**"Scholarly Method, Truth, and Evidence in Shakespearian Textual
Studies" by Gabriel Egan**

There are plenty of things still to be discovered about Shakespeare just by
counting certain features in the plays. For example, what do you think is the
average length of the speeches in a Shakespeare play, measured in words? That
is, how many words (on average) are there in a Shakespearian speech? Just guess.
It is a trick question because you should have asked me what I meant by word
"average". There are at least three kinds of average and they can give
vastly different results. [SLIDE] If we are talking about the *mean*
average, this is the total number of words in all the speeches in a play divided
by the total number of speeches. (I use the word 'tokens' here to mean that each
word counts each time it occurs, so that "Never, never, never, never,
never" is 5 five words not one.). [SLIDE] If we make that calculation for
each Shakespeare play in chronological order, we get these numbers. We see that
the mean average of words per speech varies from a low of 18 in *The Two
Gentlemen of Verona* to a high of 32 in *Richard 2*. Looking
at these columns of numbers, there are no obvious patterns of similarity.
[SLIDE] We could try labelling them by genre to see if that makes a pattern
emerge, but it does not seem to. There is a different kind of average, though,
that might suit this data: the *mode* average. [SLIDE] In this kind of
average, we put the speeches into ranked categories: all the one-word speeche,
as you can see there aren't any here, [SLIDE] all the two-word speeches, [SLIDE]
all the three-word speeches, [SLIDE] all the four-word speeches, and so on until
we reach the end of the play. The we just see how many speeches we have in each
category. [SLIDE] The mode average is useful for datasets that have a bulge near
the left side and a long diminishing tale going to the right, as is typically
found, for example, with income distribution. The *mean* average income in
all countries is skewed upwards by the stratospheric income of a small number of
people. This chart shows yet another average, the *median*, which is the
value for which half as many people are below that amount as above it. Which
kind of average--mean, mode, or median--is most useful to an investigator
depends on the kind of data she is working with.

[SLIDE] The graph for lengths of Shakespearian speeches looks much like the
graph of income data, with a bulge at the low end and a long tail. Here is the
graph for *Hamlet*, and it shows that there are more four-word speeches in *Hamlet*
than any other kind, and that is a not obvious from the mean average. If we find
the mode average for each play in our original list, a startling pattern becomes
obvious [SLIDE]. Where before the data showed no pattern, suddenly a very clear
pattern emerges: in all the plays up to 1599 the mode average length of a
Shakespearian speech was about nine words and then suddenly in that year it
dropped to about four words, and stayed that way for the rest of his career.
(These numbers are my counts that I made to try to replicate the results
reported by Hartmut Ilsemann (Ilsemann 2005), who made this amazing discovery; I will
have more to say on replication of others' results shortly.) [SLIDE] To see what
a difference it makes to use the mode average instead of the mean average, we
can put our new results alongside our earlier results: the mean tells us
nothing, the mode reveals something we care about. As before, the plays on the
left were written up to 1599, the plays on the right were written from 1599.

What happened in 1599? "The obvious reason", wrote Ilsemann about this pattern he discovered, "must be the opening of the Globe Theatre in the same year. The first assumption that comes to mind is the spatial dimension of the stage, which would have prompted a shift from monological to dialogical action, and included a higher speed" (Ilsemann 2005, 162). But as Ilsemann acknowledged, the stage of the company's previous home, the Theatre in Shoreditch, was probably about the same size and shape as the new one at the Globe, so he wondered if moving to the Globe changed Shakespeare's style because previously "the playwright had to produce texts to be performed at various localities" (Ilsemann 2005, 163). But this idea is also difficult to reconcile with the theatre-historical evidence. As Alan Somerset showed, Shakespeare's company toured more often and more widely in the 1600s than they did in the 1590s (Somerset 1994, 53), so the need to produce plays to be performed in various locations increased rather than decreased after the move to the Globe.

* * *

As I show in the longer version of this talk, theatre historians and
historians of the book have long been counting things and performing arithmetic
operations on the results to make their arguments, and these are usually well
accepted within Shakespeare studies. Whither the current anxieties over the use
of mathematics in early modern drama studies? When mathematical conventions are
used in studies about Shakespeare's texts this presents an obstacle to readers
who cannot remember, or never learnt, what all the mathematical symbols denote.
These mathematical symbols could be written out longhand as words to convey the
same thing; mathematical notation is merely a shorthand employed by specialists
when talking to one another. [SLIDE] In *A Brief History of Time*, Stephen
Hawking reported that in the planning of the work "Someone told me that
each equation I included in the book would halve the sales" (Stephen 1988, ix).
This is a sly joke on Hawking's part, since that sentence is itself a statement
of the principle of exponential decay [SLIDE]. If we suppose world sales of 10
million copies--the book's actual world sales in the first 20 years--then the
falling off that would have been caused by each additional equation is given by
the equation *y* = 10 million / 2 to the power of the number of equations
in the book. This kind of exponential falling off governs many things, such
as the rate at which unstable atoms undergo radioactive decay and the rate at
which hot things get cold. By presenting this equation in words, Hawking
exemplified the very procedure he needed to adopt. Mathematics is about language
as much as it is about numbers.

* * *

The increasing use of mathematics in Shakespeare Studies is creating an unwelcome divide between those who do it and those who think they have trouble understanding it, [SLIDE] so I will hereby offer a guide to sceptical reading of studies in computational stylistics. This guide uses just three heading for the kinds of thing that should ring your mental alarm bells in studies whose mathematics you do not understand: Probability, Replication, and Validation.

[SLIDE] Probability is the measure of how likely it is an event will occur.
Notice I said *will* occur, not *has* occurred. Probability has
nothing to say about past events, only future ones. If you read that there is a
1% probability that Thomas Kyd wrote the play *Edward III*, be aware that
this assertion is nothing like the assertion that there is a 1% probability of
Italy leaving the European Union, since Kyd either did or did not write the
play. In its proper sense *probability* has nothing to say on such a
matter. Yet such claims about probability are frequently heard in courtrooms, as
when an expert witness testifies that there is a 1% probability that the DNA
found at the crime scene belongs to someone other than the accused. The key to
making sense of such a claim is understanding what kind of simile it constructs.
The idea is that if we had a large number of cases to consider, say 10,000
cases, then in 1% of them (100 cases) the DNA evidence that seems to incriminate
the accused would in fact come from someone else.

In textual studies, statistics is most commonly used after we have collected some data--counted some feature of the text--and want to know how likely it is that the results we got came about merely by chance rather than being driven by some phenomenon we are interested in, such as co-authorship. The proposition that no real effect is being observed in our data is conventionally called the 'null hypothesis'. We start by assuming that the data reveal nothing meaningful at all; the numbers are just random. The key question is how unusual can our data get before that assumption becomes untenable? How much of a pattern do we need before we abandon the null hypothesis and assume that something other than random variation is producing the data we have? To help with this there are a number of calculations we can make--such as Fisher's Exact test, Student's t-test, and the chi-squared test--that are able to specify just how often unlikely results will come about purely by chance. We feed into these calculations the results of our experiments and they will tell us how often we should expect to get those results when the null hypothesis is true and nothing interesting is going on. If the results that we have will come up by chance only once in a billion years, we perhaps should abandon the null hypothesis that nothing interesting is going on and assume instead that something beyond mere chance is driving our results.

Let us take an example. [SLIDE] Suppose that we have counted the frequency at
which a couple of features--verse lines with feminine endings and verse lines
that rhyme--appear in the first two acts of a play. We do not know who wrote the
play, but we have a few candidates in mind. We do not know if the play was
sole-authored or co-written, and we wonder if the rates of feminine endings and
rhyme can at least help us decide that. Looking at the numbers, what strikes us
is the asymmetry: Act One seems to have lots of feminine endings and little
rhyme while Act Two seems to have few feminine endings and lots of rhyme.
[SLIDE] Our null hypothesis is that there is no real association, no
'contingency', underlying these numbers. That is, our null hypothesis is that
the proportions of feminine endings and rhyme do not vary significantly between
the rows, do not vary significantly between Act One and Act Two. If the numbers
in the two columns vary significantly by row, then we have found a contingency
between the columns and the rows, [SLIDE] we have found a dependency between the
variable 'verse style' and the variable 'division', they are not independent
variables but are somehow linked. We will not have established *how* these
variables are linked, only that they are linked.

Tests such as Fisher's Exact test, Student's t-test, and the chi-squared test allow us to ask [SLIDE] how often we should expect to see these results when the null hypothesis is true. That is, if there is no underlying dependency influencing the numbers, just chance variation, how rare is this asymmetry we find in the numbers? These tests are widely misused, and the first common mistake--aside from neglecting to mention the null hypothesis at all--is choosing an improper null hypothesis, such as "Act One and Act Two are by the same author". These tests have no power to comment on such an hypothesis because it contains a set of additional assumptions that we have no information about, such as the assumption that writers are consistent in their rates of feminine endings and rhyme. There may be any number of reasons other than authorship that explain Act One and Act Two being so asymmetrical regarding these features. Maybe Act One consists almost entirely of verse dialogue (giving opportunities for feminine endings) and no songs (which tend to cause rhyme) while Act Two contains mainly prose dialogue (so few opportunities for feminine endings) and lots of songs (which tend to cause rhyme). Fisher's Exact test, Student's t-test, and the chi-squared test have nothing to say on such matters: they can only comment on how often we would get this asymmetry by chance alone when nothing else is driving the difference. These tests may tell us that the asymmetry in our results is rare, tempting us to reject the null hypothesis, but if we chose an improper null hypothesis in the first place--such as the null hypothesis that the two texts are by the same author--then we will leap to a false conclusion when we reject it.

The second common error is in inverting the meaning of the results of the
test, so that instead of telling us how often we would get those results when
the null hypothesis is true, the result is assumed to be telling us how likely
it is that the null hypothesis is true. This is a logical fallacy. A test that
tells you what to expect about the world if you assume that something (the null
hypothesis) is true cannot, by definition, also be a comment on whether that
same something is true. It is traditional to reject the null hypothesis when the
frequency with which chance would produce your results is low. [SLIDE] In social
sciences, a traditional cut-off is one-time-in-20 (probability *p* <
0.05), at which point the results are said to be statistically significant.
[SLIDE] This is the most pernicious of all fallacies. There is nothing magical
about a one-in-20 probability.

To see why, consider that every week someone wins the UK's National Lottery
at odds of about one-in-10-million. This does not mean that the National Lottery
is unfair and the winner must have cheated. It is utterly predictable that
somebody will win each week with a ticket that had just a one-in-10-million
chance of being the winning ticket. This is utterly predictable because 10
million tickets are sold each week. A p-value on its own--no matter how
small--tells you nothing without additional information about the wider context
in which it emerged. Yet exactly this faulty reasoning disfigures much
scholarship in the field of computational stylistics. For illustrations of the
widespread misuse of statistics, and especially the ubiquitous but meaningless *p*
< 0.05 threshold, see John P. A. Ioannidis's paper "Why most published
research findings are false" (Ioannidis 2005).

Ioannidis's paper takes us to the second consideration in sceptical reading, the problem of replication [SLIDE]. It is a basic tenet of science that studies should be replicable: using the same conditions as those described in the experiment the same or closely similar results should be obtained. Hartmut Ilsemann claimed in 2005 that the mode average length of Shakespearian speeches suddenly dropped from about 9 to about 4 in 1599, and because this is a straightforward claim I was able to independently replicate his results using the digital texts of the Oxford Complete Works edition of 1986-87 and three dozen lines of programming code. Ioannidis showed that this is rarely possible with most scientific publications. The situation is even worse in our field of Shakespeare Studies because usually the replication cannot even be attempted. If the author relies on a dataset to which no one else has access--as Donald Foster did 20 years ago, and Brian Vickers and Marcus Dahl do today--and/or software that this is not Open Source (ditto) and/or methods that are not fully described in all their technical detail (ditto), it is impossible for other investigators to check their results by attempting direct replication.

[SLIDE] The databases Literature Online (LION) and Early English Books Online Text Creation Partnership (EEBO-TCP) are available to most investigators and when studies are based on those databases it is possible for other investigators to check the claims that are being made. There are reasons why an investigator might find the LION and EEBO-TCP texts unsuited to her methods, most commonly because they are in original spelling and hence likely to upset counts based on the automated searching for particular strings of characters representing words. It is reasonable to take texts from these sources and first regularize the spelling, for example using the Variant Detector (VARD) software developed at the University of Lancaster, but if one does that it is good practice to then make the regularized text available to everyone else. After all, there is more than one way to regularize early modern spelling and for replication we need to know just how it was done. Aside from the source texts being available, it is important for replication that the methods used in a study are described in sufficient detail that another investigator may apply them for herself. Where software is developed for an investigation, it should be given away in Open Source form so that others can look at the code and see that it does what the investigator thinks it does.

Replication is a high ideal, but even without it there is another kind of scepticism about its own truth claims that any study can embody. [SLIDE] If someone claims to have found a method of distinguishing authorship by measuring some feature of the text, this itself is a readily testable claim. Using the thousands of digital texts available to us, the validation of the new method would involve simply setting it to work on texts for which we already know the true author and then counting how often the method was able to correctly identify this person. Without validation, or with only a few validation runs, there is simply no way to tell if the new method really is capable of distinguishing authorship. There should at least be tens, and preferably hundreds or thousands, of validation runs and at the end of them the study should give a percentage figure for how often the new method got its authorship attribution correct when applied to the known cases. In general, a correctness figure of less than 90% is hardly worth anyone's attention, since the best methods we currently have point to the correct authors about 95% of the time, when given sufficiently large samples to work on. The same principle applies to studies that claim to quantify aspects of style such as genre or to identify the date of a work's composition. We have hundreds of works for which we already agree the genres and dates and the new methods must be shown, in rigorous tests, to come in the great majority of cases to the same widely accepted conclusions that we have already reached by other methods. This is perhaps the easiest kind of scepticism to commit to memory: if they did not validate their method, it is not valid.

This account of scholarly method in Shakespeare Studies has engaged with mathematics only so far as the elementary arithmetic operators and measures such as the three kinds of average: the mean, the median, and the mode. I hope to have shown that when using even these simplest of mathematical procedures, investigators need to exercise caution in order to provide an adequate verbal account of what was done and why the results should be accepted by other specialists. I hope also to have provided some guidance for those trying to discriminate between good and bad scholarly practices in this field. When we move beyond these simple mathematical operations to more complex ones, such as calculating standard deviation, variance, and Shannon Entropy, or applying data reduction with methods such as Principal Component Analysis, the majority of Shakespearians have little hope of following the detail. Why does the threshold of comprehension fall just here, at +, ?, ×, ÷ and syllogistic logic? What is it about the more complex operations that makes Shakespearians so uncomfortable? The simplest answer is probably correct: this threhold corresponds to the level at which most Shakespearians ceased to study mathematics in their formal education. It is unhelpful to excoriate the profession for its collective lack of advanced mathematics ability. But it should be as much a source of embarrassment to admit that one is innumerate as to admit that one is illiterate, rather than (as now) innumeracy being almost a badge of honour for some Humanist scholars. Since even the most elementary arithmetic operators and measures--the ones most Shakespearians are comfortable with--are quite capable of misleading us, we need to move forward collectively and slowly.

**Works Cited**

Ilsemann, Hartmut. 2005. "Some Statistical Observations on Speech Lengths in Shakespeare's Plays." *Shakespeare Jahrbuch* 141. 158-68.

Ioannidis, John P. A. 2005. "Why Most Published Research Findings Are False." DOI 10.1371/journal.pmed.0020124. *Public Library of Science Medicine (PLoS Med)* 2.8. n. pag..

Somerset, Alan. 1994. "'How Chances it They Travel':? Provincial Touring, Playing Places, and the King's Men." *Shakespeare Survey* 47. 45-60.

Stephen, Hawking. 1988. *A Brief History of Time: From the Big Bang to Black Holes*. Introd. Carl Sagan. London. Bantam.