Method
I will utilize topic modelling to provide the distant reading approach discussed earlier. The topic models will take my research corpus – in this case, the entire digitized collection of the Public Historian journal – and analyze it. First, all at once, and then decade by decade to provide a narrower view to track small trends over time. I will be using the Constellate Beta Digital ToolBench from ITHAKA and JSTOR to complete this computational analysis. The Constellate Digital ToolBench allows for distant readings and qualitative/quantitative engagements with materials housed in JSTOR. The datasets produced with this tool, which also provides premade Jupyter Notebooks with topic modelling code built in, can be seen and interacted with here on my website, along with more reflections on the process of my analysis.
The Jupyter Notebooks provided by Constellate analyze the corpus’ the ToolBench assembled. Jupityr Notebooks, which is derived from the Julia, Python, and R programming languages, is “an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.”[1] It functions by providing reproducible and replicable workflows, which allows me to easily tinker with the model and the corpus, and, equally importantly, allows me to save my work and share my data and workbench with other academics to reproduce or tinker with it as well. I will primarily be working with these set notebooks, which will be amended as necessary to analyze the data fully. The Notebooks provided enable me to analyze the corpus’ both in terms of topic modelling and tf-idf.
Tf-idf originated in a 1972 paper by Karen Spärck Jones.[2] It focuses on identifying “the terms in a document that are distinctively frequent in a document, when that document is compared other documents.”[3] Words are scored based on how frequently they appear in a given document as it compared to its appearances in other documents. An equation is input into the program to handle this, and the algorithm produces scores for words. This singles out the words that are distinctive, and therefore likely the most meaningful to the text’s arguments. It is no surprise that articles in the public historian would frequently use words such as “history” and “public”. In identifying those words as normal, the tf-idf model will focus on words that are much more unique, like “California” or “oral”.
This is useful for my project, both in terms of exact tf-idf usage, but also targeted raw and relative frequency models. The tf-idf model can provide an overview of contextually important words. This can, as stated, assist in bridging the contextual and analytical abyss that the topic models leave me with. The brief historiography of public history has told me what to look for. It has identified two possible trends for examination, and therefore I can use raw and relative frequency models in a way that produces real results. When I know which words to look for, I can input specific words and check their raw or relative frequency and how it changes over time. Once again, the Constellate Digital ToolBench provides me with the code built in to provide visualizations such as raw and relative frequency models. Though neither topic models nor tf-idf might identify words such as “oral” or “colonialism” as of particular relevance, searching their development as terms is important to track the academic trends identified in my historiographic research. If I need a rope bridge to take my research from topic modelling to a rounded overview of public history, then tf-idf is the rope, and raw and relative frequency models are the wooden planks. Both are needed, and both are valuable.
[1] https://jupyter.org/.
[2] Karen Spärck Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation 28, no. 1 (1972): 11-21.
[3] Lavin, “Analyzing Documents with tf-idf,” The Programming Historian.
Last updated
Was this helpful?
