Topic Models for Historical Research
Scholars in humanities, particularly those working on the 19th and 20th
centuries, have access to a staggering amount of surviving textual
material---books, plays, newspapers, periodicals, personal diaries, and so on.
Reading through and documenting trends in, for example, thousands of 19th
century novels or all issues of a regional newspaper presents a considerable
challenge. Probabilistic topic models such as Latent Dirichlet Allocation have
gained acceptance as a valuable tool for exploring and summarizing the content
of very large collections of texts. But the unadorned topic model (LDA) fails to
consider the way language and terminology change over time. As many historical
text collections span multiple decades, this is an important consideration. Topic
models that take into consideration gradual change in patterns of language use
do exist but posterior inference is computationally taxing. This presentation
provides a selective history of quantitative text analysis in the humanities and
describes the computational challenges associated with better adapting topic
models to the demands of historical research.
Allen Riddell is a postdoctoral fellow in the Neukom Institute and the Leslie
Center for the Humanities at Dartmouth College. His research explores how
historians might use statistics and machine learning to study collections of
tens of thousands of text documents – including books, academic journal
articles, and newspapers. His recent work uses Bayesian models of text corpora
to investigate trends in the nineteenth-century British novel and to study the
development of German Studies in the United States between 1920 and 2000. He
earned his PhD from Duke University in 2013.
Events are free and open to the public unless otherwise noted.