Topics Models + tf-idf

This project will utilize a few digital humanities tools in order to produce its close and distant reading analysis of the public history field: topic modelling and tf-idf, or Term Frequency – Inverse Document Frequency, and its related models of raw and relative term frequency. As ‘models’, the two function as hermeneutical lenses by which to understand the material and to engage in a contextualized close reading of the material utilizing distant reading’s methods of abstraction. Topic modelling involves breaking down a document into all of its component words, and identifying patterns in the vocabulary. This takes the form of sorting words again and again based on word probabilities until it generates lists of topic words that represent the overall focuses of the corpus given. How they are connected and what it is that connects them is left up for analysis. Tf-idf models identify terms that are locally unique in a document, versus the overall distribution of word counts. This identifies strictly the words that are distinctive in their appearance.[1] Similarly, this project will use raw frequency and relative frequency to track individual words over time. Though raw and relative frequency are not the ideal models, they nevertheless provide additional context that is needed to analyze the topic models. They are once again an additional focus for my macroscope, providing a slightly different way to examine the same data to draw my conclusions.

Topic modelling, as said, is a “tool [that] takes a single text (or corpus) and looks for patterns in the use of words; it is an attempt to inject semantic meaning into vocabulary.”[2] Topic models represent a family of computational programs that extract topics from texts, a topic being a list of words that occur in statistically meaningful ways. Essentially, the program reads all the documents at once and begins comparing frequencies and distributions of words in the documents versus the entire corpus.[3] It peers into an abyss of words of the entire corpus, and structures it based the inherent connection between words. For example, in a corpus a topic could be comprised of words like “office”, “lab”, “government”, “research” and “work”. This topic could be termed “workplace words”, or “scientific workplace”, or “federal workplace”. It is hard to determine exactly, and that is where the analysis of the topic model enters. From this work on scrambled words, the algorithm generates a distribution of words and their proportions within a topic, and a distribution of topics and their proportions within a document. By structuring the number of topics to be generated, the content of the lists can be changed, as the computer program broadens the connection between topics to encompass more, or makes the topics more hyper-focused for fewer.

It is important to properly understand the meaning of the word topics, here. The program does not define the lists of words that it generates by giving the topics a name or a category. In this way, these are less categories of set terms than discourses – they are not “settled and obvious topics.”[4] The content of the lists and the connection between them as identified by the computer are up for analysis by the researcher. In this way, the topic lists are discourses, or arguments waiting to be made about what the underlying context and meaning between them is. Topic modelling is a lens for analysis, therefore. In calling it a model – and later as I call term frequency programs a model - I mean that it functions as a hermeneutical lens for analysis. They adjust and structure the data in different ways to bring out different aspects, but they themselves do not produce something in the data that is not there. They do not produce objective results, rather results that are true based on the data it has been given. By reading the produced topic lists I can make inferences and assumptions about what those topic lists say about the data of the corpus, but those are strictly inferences and assumptions. I will have to read context back into the model. In my case, I will be applying a corpus of 10,000 scrambled works from the Public Historian into the model, and having it generate a set list of topics to see which aspects of the field are important in that point of time, and even a little of how they are spoken of in the corpus.

Topic modelling as a practice has seen some utility for scholars in the social sciences, in the humanities, and in the digital humanities.[5] It has been used quite a lot over the last several years to analyze single texts, and, more commonly, large corpus’ of texts to find commonalities and underpinning contexts and meanings. It is more commonly used in literature studies to analyse the whole corpus of an author, or to identify the ways it interacts with figurative language and wrestles with figurative meaning. Rhody writes extensively on topic modelling and poetry, and the ways that it interprets the connection between words layered with hidden meanings. Rhody suggests that the space between the literary possibility of language and the rigors of the computer model is ripe for interpretation and analysis.[6] That said, the exact specifications of the code of the algorithm and the way it functions are outside the purview of this project,[7] and the only real code work that is of importance is defining the number of topics. For example, when I focus in by producing fewer topics, the model has to generate that number, and therefore any subtopics that might exist can be buried. If more topics are used, then the model can expand its definition of what is a topic in the corpus, and newer aspects of it can be analyzed. Future projects of a similar nature might find there are many possibilities to change the model based on its code. The different results produced will be interesting in and of themselves, but the space between the two interpretations will also provide a further depth of analysis.

But, as Rhody points out, the model can only train itself based on what it reads.[8] There are multiple perspectives that can change what the topic model produces, depending on what, and how much, it is given. With this in mind, this project will break down the general corpus into smaller chunks. Essentially, I will fit a topic model on the whole corpus, and then I will turn the gain up on my macroscope and focus on distinct time periods within that corpus, occasionally adjusting the focus by changing the number of topics to see what that brings out in the data. This will allow for a more in-depth analysis of smaller trends and a more accurate understanding of trends over time, and will additionally provide a useful counterpoint in which to read the topic model of the larger corpus.

[1] Matthew Lavin, “Analyzing Documents with tf-idf,” The Programming Historian, 2019. https://doi.org/10.46430/phen0082

[2] Graham, Weingart, and Milligan, “Getting Started with Topic Modelling and MALLET,” The Programming Historian, 2012.

[3] Ibid.

[4] Graham “Getting Started with Topic Modelling and MALLET,” 2012.

[5] John Mohr, “Introduction – Topic models: What they are and why they matter,” Poetics 41, no. 6 (2013): 4.

39 Lisa M. Rhody, “Topic Modeling and Figurative Language,” Digital Humanities 2 no. 1 (2012).

[7] Graham “Getting Started with Topic Modelling and MALLET,” 2012.

[8] Ibid.

PreviousTopic Modelling NextWhat Data and Why?

Last updated 4 years ago

Was this helpful?