Middle-Distant Reading

Abstract – Computational Textual Analysis (CTA) is a controversial sub-field in the digital humanities. Previous work has shown that critics misunderstand the field’s objectives but also that practitioners overstate their results. Using the JSTOR/Folger Shakespeare dataset, I demonstrate that supercomputing on big humanities data can help both camps to understand one another. I designed and optimized: an HPC job using Python and OpenMPI to transform the Shakespeare dataset from a citational network with 623,428 edges to a co-citational network with 29,256,101 entries (5,080 CPU hours); and an HTC job (16,000 CPU hours) that reduced this dataset for a Shakespeare recommendation system. The resulting recommendation system powers a simple intertextual reading interface. Its recommendations qualitatively outperform standard CTA approaches. Computational analysis of humanities data can make better use of big data, especially the quantifiable interpretive activities of trained practitioners

I was fortunate to be included in the 2019 Rice Data Science Conference. The presentation was videorecorded.

In September 2017, I began to split my time between the Humanities Research Center (HRC) and the Center for Research Computing (CRC) at Rice. It was a valuable opportunity for me to put into practice some of the digital humanities project-management skills I had learned and to share them by facilitating other scholars’ research, while keeping me connected to the school of humanities out of which all this work would emerge.

It was also valuable because it allowed me to jump into the deep end of the computing pool. My new colleagues in the CRC were incredibly generous with their time, introducing me to the high-end infrastructure that most universities have for their science programs but which few humanists encounter.

I like to think of our research computing infrastructure at Rice as a three-legged stool: you have networked storage, a private cloud, and a supercomputing cluster. Over the course of the last two years, I learned how to supercompute, how to develop full-stack data-driven web applications, and how to leverage networked storage in order to realize the full potential of both.

Whiteboard diagram of workflow from a research computing workshop I co-led in 2018.

I have also helped professors and graduate students to utilize the resources available to them, and to develop custom solutions to their research computing problems.

Lisa Spiro introduced me to the JSTOR Matchmaker dataset in July 2017, and I began experimenting with ways to make use of the enormous amount of data on disciplinary knowledge production that they have collected. I settled into trying out supercomputing methods for analyzing their data of citations of passages in Shakespeare’s dramatic corpus, as imported from the Folger library’s digital editions.

The practical outcome of all of this was a recommendation system for passages from Shakespeare’s dramatic corpus. You can play that game at the above link, but do be gentle, it’s a fragile system. It took ~21,000 cpu hours (2.4 years on your laptop) to produce.

The formalization and explanation of my ranking algorithm, from my slide deck at the conference.

Creative uses of big humanities data can help:

Data scientists to better respect the richness of humanities “data” and explore methods more nuanced than “mining.”
Humanities scholars to realize that our data is computable, if only because we have been producing our knowledge from within institutional structures for a very long time.

In short, it would benefit us all to at least be able to take an infrastructural view on humanities data in the academy, and to read for the middle distance.