Can We Tell from Lexical Analysis What Parts Shakespeare (or Anyone else) Played? Testing the Cibber Corpus
by Matthew Steggle and Steve Roth
The short answer to the question posed in the title of this article is, "We can't." We tested several analysis methods; none showed any ability to
accurately identify the parts played by a player/playwright, or to demonstrate that conned parts influence that player/writer's subsequently written works. The analysis methods are described below.
These are negative results, of course, so of course they can never be definitive. It's possible that some other analysis method would succeed where ours have failed. And there are some caveats to our implementation of the analyses (discussed below) that might disqualify their results.
Testing the Methods Against Known Results
To recap: Foster's model proposes that:
1. Shakespeare writes a play, say, Midsummer Night's Dream.
2. The play is performed, and Shakespeare acts a part in it, say, Egeus. In memorizing all Egeus' speeches, Shakespeare becomes unusually attuned to the rare words included in those speeches.
3. In Shakespeare's next play, rare words from Egeus' part will crop up disproportionately often, because he will remember those words unusually clearly.
Aside from the general problems of rare-word analysis discussed by Steve and Gabriel elsewhere on this site, and aside from the persisting problem that
Foster's results depend on a database that no other scholar has access to, there are several more particular difficulties with this highly attractive model:
1. The date and sequence of composition of Shakespeare's plays is not known with complete precision.
2. Our knowledge of the date and extent of stage performances of the plays is, at best, patchy.
3. There is little "gold standard" evidence of the parts that Shakespeare did play as an actor: tradition ascribes to him roles including the ghost in Hamlet, and Adam in As You Like It, but beyond this it is hard to go. Therefore, it is hard to check the accuracy of the conclusions of this method.
4. The model has not been demonstrated on any other comparable author.
We set out to address problem 4. Would the Foster method be able to predict the parts played by an author whose dramatic career could be traced with some certainty?
Finding a candidate
The candidate author had to be an actor-writer, as close as possible in time to Shakespeare, with a large corpus of writing, well-documented as to date and solely authored by him. Furthermore, he needed to have enjoyed a well-attested theatrical career: we needed to know what parts he played in what plays at what date.
No pre-1642 actor-dramatist meets these criteria adequately. The best contenders are Robert Armin, whose corpus of accessible work is too
small--two full-length plays, and one jest-book--and Ben Jonson, whose writings are certainly extensive, but whose theatrical career cannot be traced beyond his appearances as "mad Hieronimo" in the 1590s.
Moving to the Restoration theatre is less than ideal, since it opens up a gap in
many aspects of theatrical culture. This applies most obviously to patterns of line-learning and performance, but in fact affects all aspects of authorship practice as well. Indeed, the very factors that make Restoration theatre better documented make it less like the theatre of Shakespeare. However, it does contain a candidate who meets our initial criteria: Colley Cibber.
Cibber's parts and works
Colley Cibber (1671-1757), actor, playwright, theatre manager, and ultimately poet laureate, is generally remembered as the enemy of Pope and as one of the victims of
Pope's Dunciad. But he was also a prolific actor-writer. Furthermore, both his literary and theatrical careers are well documented, thanks partly to his own autobiography. He is therefore a suitable candidate for this project. If the rare-words technique works, it should be possible to deduce the parts he played from the influence that they have on subsequent plays. We can then compare the results of this deduction to the known parts Cibber played to evaluate the rare-words technique's value in determining parts played.
Matt's first step was to build a corpus of available material written by Cibber, and of parts that Cibber played. The backbone of both took the form of
Cibber's own plays, sourced in electronic form from Literature Online (LION), whose policy has been to keyboard the texts from the first edition.
The plays involved can usefully be presented in the following table:
Exclusions
LION has 26 dramatic texts ascribed to Cibber. However, two of these are linked: Cibber completed
Vanbrugh's A Journey to London after Vanbrugh's death, performing the resulting play as The Provoked Husband. Both plays were therefore excluded from the building of the database, as they do not represent
Cibber's sole authorship.
Richard III is Cibber's adaptation of Shakespeare's play, incorporating large numbers of
Shakespeare's lines: again, it is excluded on the grounds that not all the words are by Cibber. Damon and Phillida, listed as
Cibber's by LION, is excluded from the Cibber canon by Viator and Burling, and from our
corpus (Timothy J. Viator and William J. Burling, eds., The Plays of Colley
Cibber, Volume One (Madison, NJ: Farleigh Dickinson UP, 2001). In spite of uncertainties about date and text regarding The Rival Queans, Viator and Burling (429) conclude that it is probably all by Cibber. Myrtillo, Venus and Adonis, and Love in a Riddle are all excluded from this study on the grounds that they are musicals rather than conventional plays.
Other texts that do not appear on LION, but should also be mentioned as exclusions are:
The Lottery
(1728). Not by Cibber according to Viator and Burling.
Hob in the well, also called Flora (1729. Not by Cibber according to Viator and Burling.
Polypheme (1735). No etext available.
The Hypocrite (1716), a rewrite of The Non-Juror. No etext
available.
The Bulls and the Bears (1715) Lost.
Thus, we are left with nineteen dramatic texts. All of them were (according to the title-pages of the first editions) written unaided by Cibber. In all but two of these,
Cibber's part can be established either from the cast-list in the publication or from other evidence collected by Viator and Burling.
There is one other text that it is possible to add to the corpus - Cibber's prose autobiography, Apology for the life of Colley Cibber (1740). An
2-part etext of this is available from links in the left frame (Cibber apology
1, 2). The resulting corpus comes to about 566,000 words, which is still a little smaller than
Shakespeare's (about 711,000 words).
Preparing the Texts
In the case of the LION plays, each text was downloaded from LION and saved as a
".txt" file. Using a combination of automated search and replace routines, and manual editing and checking, Matt adjusted the format so as to be suitable for the
Perl program. This involved:
1. stripping out LION's header and footer material
2. stripping out LION's indications of page and line numbers
3. stripping out all those parts of the publication that were not the dialogue itself:
prefatory material, cast lists, indications of scene-division, inset songs, and stage-directions.
4. replacing all "VV" with "W" (but spelling itself was left unstandardized, for reasons of repeatability and speed)
5. ensuring that each speech was arranged with a blank line, followed by a line containing only the speech-prefix, followed by the speech itself.
6. ensuring that each speaker always had a consistent and unique speech-prefix.
The file was then emailed to Steve, who ran it through a specially modified
Perl script to convert it into a list of parts, and processed them through FileMaker as described in the
SHAXICAN section 'Roth's refinements'. Matt checked the list of parts generated, and used this to weed out remaining problems with inconsistent speech-prefixes, incorrect positioning of blank lines, etc.
The same procedure was followed with the Apology, except that instead of stage directions to remove, there were the notes of the later editor to be deleted.
By the end of this process, then, we had built a database of 8,186 "rare" words (words used between two and twelve times in the Cibber corpus).
Counting and Analysis Methods
We tested several analysis methods, not just the one that Foster adopted for his Shakespeare Newsletter
(SNL) articles. We gave them people's names for easy remembrance. The methods are described below. A few definitions of the terms we used will make those methods easier to understand.
We refer to a "source play" as a play made up of parts ("source parts") that might be influencing "target plays."
Influence means the presumed effect of a source part or play on a target play, as measured by the usage of rare words from the source part/play in the target play.
There are two counting methods that may be used as the basis for each analysis method. It's unclear which of these methods Foster used in his SNL analysis.
1. How many of the rare words from a source part are used in a target play? We called this a count of "usages."
2. How many instances of the source-part rare words are there in the target play? (A word can obviously be used multiple times in a play.) We called this a count of "instances."
We tested both counting methods for the two analysis methods for which it was appropriate. The results from each counting method varied significantly in one of the analysis methods, but not in a way that affected our overall conclusion.
The result of each analysis method (the presumed "influence") was expressed as a ratio.
Foster Ratio. Assume that a source part which constitutes 5% of its play will also exert 5% of that source
parts influence on a target play. If that source part's influence is significantly higher, it suggests that the author may have played the source part prior to writing the target play. Foster's (unstated) break-point was an influence at least 50% greater than predicted.
There are at least two difficulties with this method; it does not account for two aspects of the data:
Target play length. A long play has more words than a short play, hence more rare words, hence more likely correlations with any source part. So long plays, in general, would by this method seem to have been more
heavily influenced than short plays. Our analysis bore this out.
Random variation. Some parts, statistically, will by sheer chance have lots of rare words. These parts will seem to exert inordinate influence on all target plays.
It's also unclear whether the assumption described in the first sentence of this ratio description is valid, and it's not clear how to test its validity given the random variation discussed in the preceding bullet point.
Egan Ratio. Gabriel suggested this method in his initial SHAXICAN
('The idea'). Compare the relative frequency of source-part rare words in a target play to their frequency in the whole corpus. If their frequency is significantly higher in the target play, it suggests that the source part influenced that play, causing the writer to use the words more frequently. Since this method inherently counts the number of times a word is used (in a target play and in the corpus), it only relies on one of the two counting methods. This method accounts for both the difficulties inherent in the Foster Ratio, and does not rely on the underlying assumption of that method.
Roth Ratio. Compare the count of rare words shared between source part and target play to the count of rare words in the source part. A high ratio suggests more influence. This method accounts for the number of rare words in the source part, but not for the length of target plays.
Steggle Ratio. A count of rare words shared between works as a proportion of the works' combined lengths. Not so much a test of Foster's approach as a test of the analysis machinery in use. Matthew suggested this analysis early in the process to see if the Shaxican engine would reveal likely similarities between plays (with no analysis of parts or their influence). As expected, it showed that tragedies tend to share rare words with other tragedies (not surprisingly, tragedies showed the strongest correlation), comedies with comedies, etc. Hence, we conclude that rare-word statistics overall are more likely to reflect associations due to genre than due to other factors.
As explained at the beginning of this article, none of the analysis methods showed a discernable (by us) correlation between parts played and works written subsequently. Detailed tables of results are available in PDF format
in the appendix below.
Now, we are aware that we're not exactly replicating Foster's technique:
- As discussed by Gabriel, Foster's counting method depends on words
(distinguishing the noun 'row' from the verb 'row') while SHAXICAN's depends on
strings (treating 'row' as just three letters in succession, irrespective of
the meaning).
- There is an extra complication with Cibber which doesn't apply with the work on Shakespeare: the Cibber corpus is built from texts in unmodernized spelling. However, we consider that the overall effect of this is likely to be small, since the eighteenth-century spelling of the Cibber texts is already reasonably consistent.
And there are a number of stones that we have left unturned:
- We didn't experiment with altering the threshold of defining a "rare" word, currently set at 12 or fewer appearances in the whole canon.
- We didn't experiment with permutations of the initial canon: with including the 'marginal' texts we excluded, or with paring it down by (say) excluding the autobiography. Since generic similarity clearly 'drowns out' other factors in the Steggle ratio, we could also have tried analyses based solely on a corpus of tragedies, say, or city comedies. Furthermore, we didn't analyze the potential influence of parts known to have been acted by Cibber in plays he did not write.
The reason we have not investigated these avenues is that the initial results were so disappointing as to discourage us. In order to make a convincing case that the technique can meaningfully be applied to Shakespeare, where the correct answer is not known, it would be necessary first to get loud and clear results from Cibber. And at the moment, we
can't.
Matt Steggle &
Steve Roth,
December 2002

Appendix.
Parts played by Cibber and their "influence" on plays he wrote
Links in the left frame point to the following four reports (in Acrobat PDF format)
that show the "influence" of parts played by Cibber on his writings. Each presents the same data--comparing all the analysis methods--but sorted by different ratios so it's easy to scan for patterns resulting from each
method.
FosterRatioSort.pdf. The Foster Ratio is the part's percent of RW influence (how many rare words from the part appear in the target play [or instances of those rare words] over total rare words shared with the source play) over the part's percent of its play (which is Foster's measure of "projected" influence). Only one report is provided (sorted by usages) because the usages/instances ratios don't vary much.
EganRatioSort.pdf. The Egan Ratio is the frequency of shared rare words in the target play (necessarily usage instances) over frequency of those words in the corpus (ditto). This method inherently relies on an "instances" count.
RothRatioUsagesSort.pdf. The Roth Ratio is the number of rare words shared
between source part and target play (or instances of those words in target play) over number of rare words in the source part.
For the 'instances' case, there's another report: RothRatioInstancesSort.pdf
These reports only show source-part/target-play correlations where
the part has a known playing date by Cibber, and the source part and target play share at least 10 rare
words. A report on the Steggle Ratio is not included because it compares plays and plays, not parts and plays.
The abbreviations used in the reports are as follows.