Historic Texts: Using digitized content in research

"Historic Texts: Using digitized content in research" by Gabriel Egan

There are two big things that the computer can do for us in literary research, and those who tried to predict over the past few decades which would be most important got it wrong [SLIDE]:

1) Perfect fast copying of digital files for dissemination (computer-as-replicator)

2) Searching within digital files and counting what is found (computer-as-processor).

Most predictions about how computers would affect literary studies said that the computer-as-processor would matter most and they underestimated the effects of the computers-as-replicator. They did this because they had not figured on the Internet, had not figured on just how cheap digital storage would get, and had not figured on how fast digital file transmission would get. When the WorldWide Web was invented 26 years ago, home users typically connected to the Internet at around 1,000 alphanumeric characters per second, today it is 3 million characters per second, which is 3,000 times faster. Back then, a typical hard disk held 1,000 million characters, today it is a million million, which is 1,000 times more. If we used computers for nothing but storing and copying our raw materials, they would be utterly transformatory because of these numbers alone. I will come back in a moment back to what the computer-as-processor is doing to research in my field in a moment.

With digitization of literary works, primary and secondary, we are not tied to the economics of print publication or to the costs of maintaining paper copies within libraries. Instead, the economics of providing readers' access to the materials are rather different. Once digitized and stored in a well-designed repository, works of literature and works about literature cost virtually nothing to deliver to the reader. But unless they were created digitally in the first place, digitizing them from paper is a major cost, and building and maintaining well-designed repositories is itself costly. If we leave these things to the free market, we become dependent on commecial providers like ProQuest who have poor track record on providing cheap, reliable access to digital materials in my field. (I will come back to a salutary lesson about that in a moment.)

In practice, then, what does a resource such as Jisc's Historical Texts do to literary research in my field? The main effect, and it arises directly from the computer-as-replicator effect, is a flattening out of the peaks in the literary landscape. The economics of print dissemination required that publishers focus on the authors and the topics that are reasonably certain to be of enough interest to enough people that sufficient print copies will be sold to buyers in order to return at least a modest profit. It was and still is significantly easier to get a publisher to invest in a proposed monograph on Shakespeare than one on, say, the Gawain poet or an obscure Modernist novelist. Equally, it is Shakespeare's works that were amongst the first to be keyboarded into digital texts and for which we have the best sets of high-resolution digital facsimiles. But, with services like Jisc's Historical Texts, that sharp difference between the most and the least popular writers is markedly flattened out: in Early English Books Online (EEBO), for example, the digital images of Shakespeare's books are no better than those of everyone else's books. In digital dissemination, all texts are essentially equal, which in practice is a huge step up for the non-canonical texts.

Where services such as Jisc Historical Texts most obviously transform literary studies is in this levelling effect produced by their completeness of the coverage. For most purposes, the EEBO books in Historical Texts effectively represent everything published in this country up to the year 1700, and the Eighteenth Century Collections Online (ECCO) books represent everything published in this country up the year 1800. Thereafter, things get patchy, because nobody has yet digitized the complete 19th century publishers' output. But up to 1800 we have virtually everything, and that changes everything, whether you want to work on Shakespeare or the most obscure writers.

Why does completeness matter so much? Because it allows us to speak definitively on how language was being used by writers in a particular period, as I shall demonstrate in a moment. I first discovered this fact in practice when working 15 years ago at the replica Globe theatre in south London. Like any cultural institution, the replica Globe had, developed its own set of orthodox beliefs that were widely repeated without anyone closely interrogating them. [SLIDE] One of these was the belief that people in Shakespeare's time did not say, as we do, that they were going to see a play but rather they would say that they were going to hear a play. It occurred to me that using Literature Online (LION) one could if that were true, at least for the literary authors in Shakespeare's time. Two of the difficulties I had to overcome were that there are rather a lot of ways of referring to seeing or hearing a play, even if we confine ourselves to just a verb, an article, and the noun play or plays [SLIDE] and that with the exception of the indefinite article each of these words could be spelt in a variety of ways four hundred years ago. But with the Oxford English Dictionary's lists of historical variant spellings and the patience to enter a great many Literature Online searches by hand, these difficulties were surmountable and I was able to publish definitive proof that in fact like us--with a few notable exceptions--the early modern writers referred to seeing a play not to hearing it. One of the notable exceptions is Shakespeare himself, who more often referred to hearing a play, which presumably is why people at the Globe theatre--who are much more familar with Shakespeare's works than those of any of his contemporaries--thought that everyone spoke like that 400 years ago. [BLANK SLIDE]

You may have heard similar stories to this one before. It is now well known to everyone except journalists that Shakespeare did not coin as many words and phrases as we used to think. The hundreds of compilers of the Oxford English Dictionary who crowdsourced its lists of first occurrences for each word tended to know their Shakespeare better than they knew anyone else's works, so Shakespeare got credited with the first occurrences of words and phrases that had in fact been previously used by less well-known predecessors. Now, with EEBO Text Creation Partnership (or, TCP) transcriptions provided by Jisc Historical Texts it is easy to find a pre-Shakespearian antecedent for many of the words and phrases that used to we credit to him. It is the relative completeness of the EEBO collection that makes this finding valid: if it were only a partial selection of the writings of others, we might well overlook an antecedent usage we are interested in and would instead credit Shakespeare undeservedly.

You might well think that such an error would be fairly unimportant in the wider scheme of things, and it is, but observe what follows from a faulty assumption of completeness and a conclusion drawn from the absence of evidence. Investigators working on authorship attribution rely on definitive answers to questions about whether certain words and phrases are in a given body of works, or not. That 'not' is crucial, since in this kind of searching one might fail to find something in the writings not because it is not there but because one failed to perform the searching properly. For example, Brian Vickers thinks that particular phrases and collocations from the play Macbeth are not found anwhere in the canon of Thomas Middleton, and so he thinks Middleton did not adapt the play, as many other Shakespearians now believe. But the only reason that Vickers fails to find these phrases in Middleton's work is that he searches in a private electronic corpus that lacks most of what Middleton wrote. If you search for those phrases in EEBO-TCP provided by Historical Texts, they show up in Middleton's work. This fundamental error also vitiates Vickers's attributions of several works to Thomas Kyd.

If you want to be definitive about absence, you have to start with a complete corpus. You also have to be scrupulous about how you phrase your searches, which is difficult because unfortunately each of the tools we use--the Oxford English Dictionary, EEBO-TCP, the English Short Title Catalogue, Literature Online--has a different convention for expressing complex searches by specifying the desired proximity of the targets, the application of Boolean logic ('this AND/OR that'), and the use of wildcards to do grammatical stemming. A frequent cause of error is that users misremember the various conventions needed and enter their searches incorrectly. And even if you express your search correctly, you might still get back the wrong answer because the database lies to you. Simply lies. In June 2014, the ProQuest Corporation upgraded its server software and accidentally broke the advanced searching features of their flagship product Literature. As of yesterday when I last checked, Literature Online's advanced searching remains broken: the numbers it returns for complex queries are just wrong and the words its search engine claims are in the texts are not there.

I am one of the General Editors of the New Oxford Shakespeare and we have made extensive use of EEBO-TCP to establish that Shakespeare contributed to three plays that will, with our edition, appear for the first time in a Complete Works edition, and that substantial parts of plays that have long been attributed to Shakespeare were in fact written by someone else, including Christopher Marlowe. We came to these conclusions by an advanced form of the kind of searching for phrases about seeing a play and hearing a play that I mentioned earlier. Luckily, at the New Oxford Shakespeare we spotted that ProQuest had accidentally broken its Literature Online search function and we made sure that this did not affect any of the edition's claims about who wrote what. In response, we made much heavier of EEBO-TCP in place of Literature Online.

It is not only in authorship attribution that we make use of databases such as Historical Texts, however. Whenever we are trying to make sense of a line of Shakespeare that we think might in fact contain a printer's error, or some other kind of error, we go looking for the troublesome phrase in EEBO-TCP to see if it was in fact a common expression of the period that we are simply not used to reading. In the case of King Henry Fourth Part One, our best authority for what Shakespeare wrote is the edition published in 1598, in which appears the following troubling line [SLIDE]:

Which thou powrest downe from these swelling heauens

The trouble is trying to make this fit into iambic pentameter. Is it [SLIDE]:

which THOU powREST downe FROM these SWELLing HEAuens ?

Or if we try to make powrest one syllable is it [SLIDE]:

which THOU powrest DOWNE from THESE swellING heaUENS ?

No matter how you try it, something goes wrong in the first half of the line. One suggestion that has been made by past editors is to emend the line by reversing powrest downe to downe powrest so that it reads [SLIDE]:

which THOU downe POWrest FROM these SWELLing HEAuens

That was the solution adopted in the last Oxford Shakespeare Complete Works edition 30 years ago, before the advent of resources such as Historical Texts. Even then, we knew from our printed concordances that the phrase down pour appears nowhere else in Shakespeare, so that if we adopted this reading we would be claiming that it is a hapax legomena: a phrase used just once in Shakespeare's works because it is used uniquely here. But did anybody else of Shakespeare's time use the phrase down pour? A search in EEBO-TCP within Historical Texts tells us that this phrase, in any of its cognate forms, really is extremely rare in Shakespeare's time, which encouraged the editor of this play in the New Oxford Shakespeare to see if she could come up with a better emendation. What if we try the solution of emending thou powrest downe to downe thou powrest. That would make the line [SLIDE]:

which DOWNE thou POWrest FROM these SWELLing HEAuens

Such a placing of the preposition before the verb phrase is common in is typical of Shakespeare, and it gives the line a pleasingly regular meter: which DOWN thou POURest FROM these SWELLing HEAVens. But is down followed by another word followed by pour a common idiom of the time? Yes, EEBO-TCP tells us that it is. On the evidence provided by the EEBO-TCP texts in Jisc's Historical Texts service, the New Oxford Shakespeare emends this line and we would not be surprised if all future editors accept our emendation, based as it is on the largest accumulation of evidence that has ever been bought to bear on this problem.

* * *

I have made it sound like editors of Shakespeare have reached digital nirvana and that nothing remains to be done to help them recover what Shakespeare wrote. In fact, far from it. The powers we have been given by resources such as Historical Texts have made us only still more hungry for certain computational ways to work on this data that are not currently implement in this or any comparable resource. One of the ways that Historical Texts is responding to that is to make available what is known as its Application Programming Interface (or, API) so that interested investigator can wrote their own computer programs to interrogate the data. I see this as the most important next step in digitally enhanced literary research. But rather than leave with you with that optimistic thought, I wish to end with some suggestions. of where I see that we have gone wrong in the digitization of the study of our literary culture:

* Thinking that it takes big changes in our practices and theories to make pedagogy and research digital -- it just takes small things like digitizing all the books and getting used to using the rough-and-ready digital texts.

* Thinking in terms of 'projects' around literary topics: just make the digital realm as full of texts as the paper realm and let the tutors and researchers get on with it

* Social media is at best irrelevant and most probably positively harmful to students and tutors in my field of English Literature (18 year olds are narcissistic enough, reading is the way to be less introspective, more in touch with the wider world of deep thoughts)

* In English Literature, everything but work done in the 20th century is out of copyright so we should not be using paper at all to study this subject unless paper and printing itself are in fact what you are studying.

* Not making new front-ends or aggregations. Don't expect users to leave their work on your site (annotations) as the lack of an effective archive mechanism means that this work is necessarily lost.

* Forget one-stop portals except where you've genuinely covered the whole of something (like EEBO have all books up to 1700 or LION having all English Literature). If you want to create some kind of "virtual collection" you should have a very strong rationale indeed for why those texts should be brought together.

* Possibly teach the more advanced students computer programming, else they're relying on other people's tools that might easily disappear (same as 'the cloud' problem with texts).