Last month, the HathiTrust announced that they were opening their archive to text mining. The HathiTrust is among the libraries of digital books in the world and brings together the resources from hundreds of research libraries and provides access to this library though a partnership program. They also work with publishers and rights holders to make available in open formats published works. Last year, we released an open version of the entire back archive of North Dakota Quarterly through the HathiTrust and since that time the NDQ archive has seen thousands of both new and old readers. In many ways, the interest in our archive has reinforced our faith in the continued vitality of NDQ as publication.
Our goal in releasing the NDQ archive was to open it to conventional readers around the world. The HathiTrust’s announcement that their archive will be available for text-mining. For those unfamiliar with the idea of text mining, it can be informally defined as a form of remote reading that looks for patterns across many thousands of pages of texts. Patterns in these texts can reflect anything from genre to narrative style, level of education and vocabulary, place of origin, and psychological make up.
At its most simple, text mining allows us to get a gross idea of the relevance of a particular phrase or set of concepts. Just for fun, we used Google’s Ngram viewer to compare Thomas McGrath’s name and the North Dakota Quarterly from 1940 when McGrath published his first work, First Manifesto, through his remarkable career to the present. The graph below represents the frequency that the two phrases appear in the Google Books corpus.
For another fun, and simple, graph, I compared the appearance of the name of several literary magazines in the Google Books corpus, including TriQuarterly, Colorado Review, The Iowa Review, South Dakota Review, and Quarterly West. While this provides only a very coarse indicator of impact, it’s clear that NDQ holds its own among similar publications.
Recently, Google has experimented with using text mining to produce the raw materials for computer generated poetry (the scholars published a paper on it here). The results have been a bit mixed.
In academic circles, text mining has become an important tool in the arsenal of the growing number of digital humanists who leverage large bodies of textual information to make expansive cultural arguments. Not all scholars have embraced the potential of the digital humanities and a recent, polemical history of the field in the Los Angeles Review of Books even questioned its core assumptions. It goes without saying that digital humanists responded with a wide range of critiques, some of which are compiled here, here, and here. The real value of the digital humanities likely falls between optimistic claims verging on technological utopianism or technological solutionism and fears of a intellectually impoverished neoliberal academy.
With the opening of the HathiTrust to text mining, it will be possible for the tech-savvy humanist to query the entire corpus of North Dakota Quarterly (and other literary in the HathiTrust corpus). As NDQ looks to a more digital future, it will be exciting to see if this grow for a more digital past.
Bill Caraher, Associate Professor, Department of History, University of North Dakota