What Data and Why?

Academic draft regarding data transparency of my project, specifically regarding the datasets being used, the nature of JSTOR data, and the resulting influence on the project as a whole.

Crucial to this project is its digital nature, and as a result of that there were several considerations that had to be taken at the front end that had a resulting impact throughout the whole scope project.

Wide literature review coupled with close reading is only made possible via digital tools. It is impossible to look at thousands of books or articles in conjunction with each other to examine them on the basis of an evolution over time or throughlines of subjects. As a result, the project is inherently digital, and this has carried with it several key components.

On a methodological side, the way this project was built was changed based on the fact that it was digital. Notetaking, scheduling, planning, reading, consultations, outlines, and writing were all done digitally. There is a digital footprint on my computer of the entire project, outside of what has been collected for the project's final paper. Moreover, using tools such as Gitbook and Jupyter Notebooks changed what was possible for the project. Github and all its affiliate programs allow me to integrate my research directly into the project itself. This relates to another central tenet of digital work - my entire project can be reconstructed through making my work visible and available online. Academics, public and digital historians alike can view my project and my methodology from start to finish, and can interact with my collected source material. My project can be further examined from different lenses, and my source material interacted with in ways beyond the scope of this project.

The fact that this was a digital project also had more practical methodological implications along with it. For example, there were limitations on the kind of data I would be working with. The goal as stated was to produce a close reading and literature review of the Public History corpus of works as a whole in order to examine it for any academic and philosophical evolution as a field. Practically, this meant that only digitally published works could be examined. While this is nevertheless a widespread corpus, in many fields archival digitization could be lacking, and therefore my results as far as understanding the public history field at different points in time could be skewed. Additionally, of the two primary digital academic archives I am drawing sources from, JSTOR limits its digital publication to articles that have been in circulation for four years. In this way, the last four years of the public history corpus I am examining have a red flag on them. The data results I can get now might be different to a significant degree to the data results I might have had if JSTOR published digital records without a delay.

Complications such as these are part and parcel of digital work, and as a result are not truly complications at all. They instead serve to highlight the complex and multifaceted nature of digital work, in which multiple variables must be assessed and weighed both during the project's completion, and in light of its results. As a result, transparency is one of the most important parts of digital work. What results were obtained, and using what methods, are crucial to understanding the results in the first place.

In this way, the nature of digitally published public history works ought to be examined further. Setting aside the journal The Public Historian for further examination later, it is worth thinking about and commenting on the ethos behind digitally published works. All the works that were collected for this project were done through various ITHAKA subsidiaries, specifically JSTOR and Portico. ITHAKA is a non-profit educational company whose goal is to provide accessible and affordable knowledge for students and researchers. Both services aim to collect and digitally archive journals, books, reviews, and primary sources for public research and use. The collections and research are made free and available to countless partnering institutions, but it worth noting that the ability to use JSTOR and Portico for research and reading for this project was entirely enabled by my university affiliation, and would not otherwise be freely accessible.

Additionally of note is that though JSTOR and Portico collect from sources all across the academic spectrum, the English language resources are orders of magnitude more plentiful than international resources. This project, by necessity of my own language abilities, the general North American origins of the field of Public History more generally, and the availability of kinds of data through JSTOR and Portico, concerns itself only with the English (and more broadly, North American) history of the public history field.

Turning now to the specifics of the data examined in this project, I collected through ITHAKA's services and the tdm-pilot project's digital toolbench, the entire digitally archived corpus of the Public History from its first digitally available articles to the most recent that were accessible to me. The total collection between numbered just shy of 10,000 articles. There were several reasons why the Public Historian comprises my main corpus. Chief among them is the origins of the field as a whole. Public History can potentially trace its origins to the founding of one graduate program in 1976 in UC/Santa Barbara, and three years later those same founders created the first run of the Public Historian, a journal meant to publish articles and pieces in this new and emergent field. Since then, it has remained at the forefront of the English public history consciousness, and is the largest single collection of public history works collected, and moreover, is the largest collection of public history works that is accessible online.

As a result, this project is drawing on the 10,000 strong corpus of articles published in the Public Historian in order to draw conclusions on the field as a whole. This, quite obviously, comes with some riders. My data will transparently show that any conclusions made based on this research are founded solely through examination of this one public history corpus. Though academic research indicates that this is the best possible singular corpus to examine public history as a field since its origins, nevertheless there are articles that could easily fit within the brackets of the field that will be left out of this examination. As a field, public history is strongly interdisciplinary, and many academics argue that the work they do, or the work done by others, is public history, but if it is not published within the confines of the Public Historian it will not be entered into conversation within the bounds of my project. This is a notable limiting factor in my work, but it does not detract entirely from the conclusions that can be drawn by using the Public Historian's corpus as a sample of the field as a whole. Additionally, this is an opportunity for further work in the field of close reading and public history.

An additional limiting factor is the role that (digitally) published articles play in terms of my understanding of public history work. It is important to note that once again the field I am studying is interdisciplinary, and intersects with many different fields of academic and professional work. Museum exhibits, digital projects, books, performance art, and more can all be understood as works of public history. But, due to the academic and methodological nature of my project, they will not be examined. This is another limiting factor, as my project will only be able to generate analysis based on published articles of public history, and will not place that into conversation with other methods of doing public history. In this way, though the project's scope is smaller, it is actually achievable using topic modelling tools, as articles are units easily dissectible using my digital methods, whereas museum exhibits and digital projects are not. Like the limiting factor of the Public Historian as a corpus, this too stands merely as an opportunity for further work in this field in order to generate and analyze a more comprehensive picture of the history of public history.

A final pillar of my project is worth noting in terms of transparency of data and methodology - the method of topic modelling itself. Topic modelling, and other word frequency tools such as those made available by the tdm-pilot digital toolbench, is a subjective practice utilizing objective metrics. The term frequency analysis gives a black and white analysis of how frequently words are used, at what year, and at what percentage or number. Topic modelling arranges word frequency into smaller, more digestible topic, linked together by commonalities depending on how the code is programmed to break them down. The data and the tools do not make guesses as to why this is, nor do they make evident underpinning trends or evolutions between trends in topics and word frequencies. The analysis in this project is wholly my own, and as a result there will be unconscious human bias and error influencing the analysis I produce. The data is entirely transparent, and by recording the data I am analyzing, recording my methods, my initial thoughts, and my fleshed out analysis, I hope to make the analysis equally transparent. The nature of the tools is necessarily experimental and reflexive, as topic modelling is a relatively new method to be applied to academic purposes. Furthermore, regardless of how transparent the data appears, the coding of the topic modelling program, and the analysis of its results, involve a human element.

Once again, this is both a complication and limiting factor, and also simply a hallmark of digital work. All digital work involves these potentially objective metrics in which the data is entirely transparent, but is necessarily combined with human elements of analysis and purpose. Digital work is complicated, and as a result transparency - in terms of data, method, and analysis - is one of the most important parts of it.

Last updated