Gabriel Egan . com

Home
Up
Steve Roth
Matt Steggle
LION
Cibber Apology 1
Cibber Apology 2
FosterRatioSort.pdf
EganRatioSort.pdf
RothRatioUsagesSort.pdf
RothRatioInstancesSort.pdf

Can We Tell from Lexical Analysis What Parts Shakespeare (or Anyone else) Played? Testing the Cibber Corpus

by Matthew Steggle and Steve Roth

The short answer to the question posed in the title of this article is, "We can't." We tested several analysis methods; none showed any ability to accurately identify the parts played by a player/playwright, or to demonstrate that conned parts influence that player/writer's subsequently written works. The analysis methods are described below.

These are negative results, of course, so of course they can never be definitive. It's possible that some other analysis method would succeed where ours have failed. And there are some caveats to our implementation of the analyses (discussed below) that might disqualify their results.

Testing the Methods Against Known Results

To recap: Foster's model proposes that:

1. Shakespeare writes a play, say, Midsummer Night's Dream.

2. The play is performed, and Shakespeare acts a part in it, say, Egeus. In memorizing all Egeus' speeches, Shakespeare becomes unusually attuned to the rare words included in those speeches.

3. In Shakespeare's next play, rare words from Egeus' part will crop up disproportionately often, because he will remember those words unusually clearly.

Aside from the general problems of rare-word analysis discussed by Steve and Gabriel elsewhere on this site, and aside from the persisting problem that Foster's results depend on a database that no other scholar has access to, there are several more particular difficulties with this highly attractive model:

1. The date and sequence of composition of Shakespeare's plays is not known with complete precision.

2. Our knowledge of the date and extent of stage performances of the plays is, at best, patchy.

3. There is little "gold standard" evidence of the parts that Shakespeare did play as an actor: tradition ascribes to him roles including the ghost in Hamlet, and Adam in As You Like It, but beyond this it is hard to go. Therefore, it is hard to check the accuracy of the conclusions of this method.

4. The model has not been demonstrated on any other comparable author.

We set out to address problem 4. Would the Foster method be able to predict the parts played by an author whose dramatic career could be traced with some certainty?

Finding a candidate

The candidate author had to be an actor-writer, as close as possible in time to Shakespeare, with a large corpus of writing, well-documented as to date and solely authored by him. Furthermore, he needed to have enjoyed a well-attested theatrical career: we needed to know what parts he played in what plays at what date.

No pre-1642 actor-dramatist meets these criteria adequately. The best contenders are Robert Armin, whose corpus of accessible work is too small--two full-length plays, and one jest-book--and Ben Jonson, whose writings are certainly extensive, but whose theatrical career cannot be traced beyond his appearances as "mad Hieronimo" in the 1590s.

Moving to the Restoration theatre is less than ideal, since it opens up a gap in many aspects of theatrical culture. This applies most obviously to patterns of line-learning and performance, but in fact affects all aspects of authorship practice as well. Indeed, the very factors that make Restoration theatre better documented make it less like the theatre of Shakespeare. However, it does contain a candidate who meets our initial criteria: Colley Cibber.

Cibber's parts and works

Colley Cibber (1671-1757), actor, playwright, theatre manager, and ultimately poet laureate, is generally remembered as the enemy of Pope and as one of the victims of Pope's Dunciad. But he was also a prolific actor-writer. Furthermore, both his literary and theatrical careers are well documented, thanks partly to his own autobiography. He is therefore a suitable candidate for this project. If the rare-words technique works, it should be possible to deduce the parts he played from the influence that they have on subsequent plays. We can then compare the results of this deduction to the known parts Cibber played to evaluate the rare-words technique's value in determining parts played.

Matt's first step was to build a corpus of available material written by Cibber, and of parts that Cibber played. The backbone of both took the form of Cibber's own plays, sourced in electronic form from Literature Online (LION), whose policy has been to keyboard the texts from the first edition.

The plays involved can usefully be presented in the following table:

Play (date of first publication)

Cibber's part

Year of first Performance

Love's last shift (1696)

Sir Novelty Fashion

1696

Woman's wit (1697)

Longville

1697

The rival queans (1729)

?Alexander

?1699

Xerxes (1699)

No part

1699

Love makes a man (1701)

Clodio

1700

The school-boy (1707)

Mass Johnny

1702

She wou'd, and she wou'd not (1703)

Don Manuel

1702

The careless husband (1705)

Lord Foppington

1704

Perolla and Izadora (1706)

Pacuvius

1705

The comical lovers (1707)

Celadon

1707

The double gallant (1707)

Atall

1707

The lady's last stake (1708)

Sir George Brillant

1707

The rival fools (1709)

Samuel Simple

1709

The non-juror (1718)

Dr Wolf

1711

Ximena (1719)

Don Alvarez

1712

Cinna's conspiracy (1713)

No part

1713

The refusal (1721)

Witling

1721

Caesar in Aegypt (1725)

Achoreus

1724

Papal tyranny in the reign of King John (1745)

Pandulph

1745

 

Exclusions

LION has 26 dramatic texts ascribed to Cibber. However, two of these are linked: Cibber completed Vanbrugh's A Journey to London after Vanbrugh's death, performing the resulting play as The Provoked Husband. Both plays were therefore excluded from the building of the database, as they do not represent Cibber's sole authorship.

Richard III is Cibber's adaptation of Shakespeare's play, incorporating large numbers of Shakespeare's lines: again, it is excluded on the grounds that not all the words are by Cibber. Damon and Phillida, listed as Cibber's by LION, is excluded from the Cibber canon by Viator and Burling, and from our corpus (Timothy J. Viator and William J. Burling, eds., The Plays of Colley Cibber, Volume One (Madison, NJ: Farleigh Dickinson UP, 2001). In spite of uncertainties about date and text regarding The Rival Queans, Viator and Burling (429) conclude that it is probably all by Cibber. Myrtillo, Venus and Adonis, and Love in a Riddle are all excluded from this study on the grounds that they are musicals rather than conventional plays.

Other texts that do not appear on LION, but should also be mentioned as exclusions are:

The Lottery (1728). Not by Cibber according to Viator and Burling.

Hob in the well, also called Flora (1729. Not by Cibber according to Viator and Burling.

Polypheme (1735). No etext available.

The Hypocrite (1716), a rewrite of The Non-Juror. No etext available.

The Bulls and the Bears (1715) Lost.

Thus, we are left with nineteen dramatic texts. All of them were (according to the title-pages of the first editions) written unaided by Cibber. In all but two of these, Cibber's part can be established either from the cast-list in the publication or from other evidence collected by Viator and Burling.

There is one other text that it is possible to add to the corpus - Cibber's prose autobiography, Apology for the life of Colley Cibber (1740). An 2-part etext of this is available from links in the left frame (Cibber apology 1, 2). The resulting corpus comes to about 566,000 words, which is still a little smaller than Shakespeare's (about 711,000 words).

Preparing the Texts

In the case of the LION plays, each text was downloaded from LION and saved as a ".txt" file. Using a combination of automated search and replace routines, and manual editing and checking, Matt adjusted the format so as to be suitable for the Perl program. This involved:

1. stripping out LION's header and footer material

2. stripping out LION's indications of page and line numbers

3. stripping out all those parts of the publication that were not the dialogue itself: prefatory material, cast lists, indications of scene-division, inset songs, and stage-directions.

4. replacing all "VV" with "W" (but spelling itself was left unstandardized, for reasons of repeatability and speed)

5. ensuring that each speech was arranged with a blank line, followed by a line containing only the speech-prefix, followed by the speech itself.

6. ensuring that each speaker always had a consistent and unique speech-prefix.

The file was then emailed to Steve, who ran it through a specially modified Perl script to convert it into a list of parts, and processed them through FileMaker as described in the SHAXICAN section 'Roth's refinements'. Matt checked the list of parts generated, and used this to weed out remaining problems with inconsistent speech-prefixes, incorrect positioning of blank lines, etc.

The same procedure was followed with the Apology, except that instead of stage directions to remove, there were the notes of the later editor to be deleted.

By the end of this process, then, we had built a database of 8,186 "rare" words (words used between two and twelve times in the Cibber corpus).

Counting and Analysis Methods

We tested several analysis methods, not just the one that Foster adopted for his Shakespeare Newsletter (SNL) articles. We gave them people's names for easy remembrance. The methods are described below. A few definitions of the terms we used will make those methods easier to understand. We refer to a "source play" as a play made up of parts ("source parts") that might be influencing "target plays." Influence means the presumed effect of a source part or play on a target play, as measured by the usage of rare words from the source part/play in the target play. There are two counting methods that may be used as the basis for each analysis method. It's unclear which of these methods Foster used in his SNL analysis.

1. How many of the rare words from a source part are used in a target play? We called this a count of "usages."

2. How many instances of the source-part rare words are there in the target play? (A word can obviously be used multiple times in a play.) We called this a count of "instances."

We tested both counting methods for the two analysis methods for which it was appropriate. The results from each counting method varied significantly in one of the analysis methods, but not in a way that affected our overall conclusion. The result of each analysis method (the presumed "influence") was expressed as a ratio.

Foster Ratio. Assume that a source part which constitutes 5% of its play will also exert 5% of that source parts influence on a target play. If that source part's influence is significantly higher, it suggests that the author may have played the source part prior to writing the target play. Foster's (unstated) break-point was an influence at least 50% greater than predicted.

There are at least two difficulties with this method; it does not account for two aspects of the data:

• Target play length. A long play has more words than a short play, hence more rare words, hence more likely correlations with any source part. So long plays, in general, would by this method seem to have been more heavily influenced than short plays. Our analysis bore this out.

• Random variation. Some parts, statistically, will by sheer chance have lots of rare words. These parts will seem to exert inordinate influence on all target plays.

It's also unclear whether the assumption described in the first sentence of this ratio description is valid, and it's not clear how to test its validity given the random variation discussed in the preceding bullet point.

Egan Ratio. Gabriel suggested this method in his initial SHAXICAN ('The idea'). Compare the relative frequency of source-part rare words in a target play to their frequency in the whole corpus. If their frequency is significantly higher in the target play, it suggests that the source part influenced that play, causing the writer to use the words more frequently. Since this method inherently counts the number of times a word is used (in a target play and in the corpus), it only relies on one of the two counting methods. This method accounts for both the difficulties inherent in the Foster Ratio, and does not rely on the underlying assumption of that method.

Roth Ratio. Compare the count of rare words shared between source part and target play to the count of rare words in the source part. A high ratio suggests more influence. This method accounts for the number of rare words in the source part, but not for the length of target plays.

Steggle Ratio. A count of rare words shared between works as a proportion of the works' combined lengths. Not so much a test of Foster's approach as a test of the analysis machinery in use. Matthew suggested this analysis early in the process to see if the Shaxican engine would reveal likely similarities between plays (with no analysis of parts or their influence). As expected, it showed that tragedies tend to share rare words with other tragedies (not surprisingly, tragedies showed the strongest correlation), comedies with comedies, etc. Hence, we conclude that rare-word statistics overall are more likely to reflect associations due to genre than due to other factors.

As explained at the beginning of this article, none of the analysis methods showed a discernable (by us) correlation between parts played and works written subsequently. Detailed tables of results are available in PDF format in the appendix below.

Now, we are aware that we're not exactly replicating Foster's technique:

  1. As discussed by Gabriel, Foster's counting method depends on words (distinguishing the noun 'row' from the verb 'row') while SHAXICAN's depends on strings (treating 'row' as just three letters in succession, irrespective of the meaning).
  2. There is an extra complication with Cibber which doesn't apply with the work on Shakespeare: the Cibber corpus is built from texts in unmodernized spelling. However, we consider that the overall effect of this is likely to be small, since the eighteenth-century spelling of the Cibber texts is already reasonably consistent.

And there are a number of stones that we have left unturned:

  1. We didn't experiment with altering the threshold of defining a "rare" word, currently set at 12 or fewer appearances in the whole canon.
  2. We didn't experiment with permutations of the initial canon: with including the 'marginal' texts we excluded, or with paring it down by (say) excluding the autobiography. Since generic similarity clearly 'drowns out' other factors in the Steggle ratio, we could also have tried analyses based solely on a corpus of tragedies, say, or city comedies. Furthermore, we didn't analyze the potential influence of parts known to have been acted by Cibber in plays he did not write.

The reason we have not investigated these avenues is that the initial results were so disappointing as to discourage us. In order to make a convincing case that the technique can meaningfully be applied to Shakespeare, where the correct answer is not known, it would be necessary first to get loud and clear results from Cibber. And at the moment, we can't.

Matt Steggle & Steve Roth, December 2002

Appendix. Parts played by Cibber and their "influence" on plays he wrote

Links in the left frame point to the following four reports (in Acrobat PDF format) that show the "influence" of parts played by Cibber on his writings. Each presents the same data--comparing all the analysis methods--but sorted by different ratios so it's easy to scan for patterns resulting from each method.

FosterRatioSort.pdf. The Foster Ratio is the part's percent of RW influence (how many rare words from the part appear in the target play [or instances of those rare words] over total rare words shared with the source play) over the part's percent of its play (which is Foster's measure of "projected" influence). Only one report is provided (sorted by usages) because the usages/instances ratios don't vary much.

EganRatioSort.pdf. The Egan Ratio is the frequency of shared rare words in the target play (necessarily usage instances) over frequency of those words in the corpus (ditto). This method inherently relies on an "instances" count.

RothRatioUsagesSort.pdf. The Roth Ratio is the number of rare words shared between source part and target play (or instances of those words in target play) over number of rare words in the source part. For the 'instances' case, there's another report: RothRatioInstancesSort.pdf

These reports only show source-part/target-play correlations where the part has a known playing date by Cibber, and the source part and target play share at least 10 rare words. A report on the Steggle Ratio is not included because it compares plays and plays, not parts and plays.

The abbreviations used in the reports are as follows.

Play (date of first publication)

Cibber's part

Perf.

Play title abbrev.

Cibber's part abbrev.

Love's last shift (1696)

Sir Novelty Fashion

1696

llsh

llsh_SirNov

Woman's wit (1697)

Longville

1697

wowi

wowi_Lon

The rival queans (1729)

?Alexander

?1699

rivq

rivq_Al

Xerxes (1699)

No part

1699

xerx

-

Love makes a man (1701)

Clodio

1700

lmam

lmam_Clo

The school-boy (1707)

Mass Johnny

1702

scho

scho_Maj

She wou'd, and she wou'd not (1703)

Don Manuel

1702

shwo

shwo_DMa

The careless husband (1705)

Lord Foppington

1704

care

care_LdFop

Perolla and Izadora (1706)

Pacuvius

1705

pero

pero_Pac

The comical lovers (1707)

Celadon

1707

comi

comi_Cel

The double gallant (1707)

Atall

1707

doub

doub_At

The lady's last stake (1708)

Sir George Brillant

1707

lady

lady_LdGeo

The rival fools (1709)

Samuel Simple

1709

rivf

rivf_Sim

The non-juror (1718)

Dr Wolf

1711

nonj

nonj_Doct

Ximena (1719)

Don Alvarez

1712

xime

xime_Alv

Cinna's conspiracy (1713)

No part

1713

cinn

-

The refusal (1721)

Witling

1721

refu

refu_Wit

Caesar in Aegypt (1725)

Achoreus

1724

caes

caes_Acho

Papal tyranny in the reign of King John (1745)

Pandulph

1745

papa

papa_Pand

Cibber’s two volumes of autobiography were identified in the study as APOLVOL1 and APOLVOL2 respectively.