Gabriel Egan . com

Home
Up
Steve Roth
Glenn Fleishman
MakePartAndPlayFiles.pl
MakeRareWordList.pl
rarewords.txt
Correlationbuilder.pl
correlations.txt
PartAndPlayWordCounter.pl
playwordcounts.txt
partwordcounts.txt
ShakeMakerFields.pdf
FosterAnalysisWithWordDetail.pdf
FosterAnalysisNoWordDetail.pdf
RelFreqAnalWithWordDetail.pdf
RelFreqAnalNoWordDetail.pdf
MySQL

SHAXICON meets SHAXICAN

By Steve Roth

In 1991, Donald Foster published a three-part series of articles in Shakespeare Newsletter suggesting that it might be possible to determine, through lexical analysis, what parts Shakespeare played, and when. The idea is that Shakespeare would have remembered words from parts he’d conned, and those words would appear more frequently in plays he wrote (shortly) thereafter--they would exert an inordinate "influence" on those subsequent plays. A table of plays between 1599 and 1607 (below) seemed to support his position.

Don has said that this use of lexical analysis is something of a sideshow and demonstration piece (the uncharitable might call it a parlor trick <g>), but it has generated a great deal of interest from both adherents and detractors. Unfortunately nobody has been able to check or verify Don’s findings. He promised over the years to publish the database (which he dubbed "Shaxicon") on which his proposals were based, but it has never been published.

This led Gabriel Egan to propose re-creating the data using perl scripting, starting with the Oxford electronic texts. He dubbed the enterprise "Shaxican," and posted the work on his site (www.totus.org) that's now moved here.

On this page you’ll find my continuation of Gabriel’s work, taking Shaxican to the point that results can be compared to Don’s analysis.

Words versus Strings

I should begin by pointing out that the base text we’re working from does not identify "words" per se, as Don’s Shaxicon database apparently does. "Row" (n.) and "row" (v.) are not the same word. Since we’re comparing strings of characters here, not words, the results aren’t really comparable to Don’s work, nor can they serve to support or disqualify that work. When I say "word" in the rest of this writeup, I’m actually referring to strings. The tools presented here could, with some revision, be adapted for use with a "lemmatized" text.

The primary goal and result here is the analysis/reporting architecture conceived, and the tools to build that architecture.

The specific goal of the architecture was to duplicate Don’s analysis of playing parts as presented in the Shakespeare Newsletter articles. That goal restricts the wider usability of the architecture to some extent, but the granularity of the data and the database analysis/reporting tools are still fairly flexible for those who want to compare rare words in various plays, or in various parts compared to various plays.

The key to the analysis was building a database of each part/play/rareword correlation--occurrences where a rare word appears in both a play and a part from a different play. viz:

Word Play Part Play Count Part Count
a-bleeding ROM MV_LANCELOT

1

1

a-bleeding MV ROM_PRINCE

1

1

a-doing R3 COR_BRUTUS

1

1

a-doing COR R3_SCRIVENER

1

1

a-hungry WIV TN_SIR ANDREW

1

1

a-hungry TN WIV_SLENDER

1

1

a-making MAC HAM_POLONIUS

1

1

a-making HAM MAC_LADY MACBETH

1

1

a-nights JC 2H4_MISTRESS QUICKLY

2

1

a-nights TIM 2H4_MISTRESS QUICKLY

1

1

a-nights 2H4 JC_CAESAR

1

2

a-nights TIM JC_CAESAR

1

2

a-nights 2H4 TIM_APEMANTUS

1

1

a-nights JC TIM_APEMANTUS

2

1

a-pieces TNK AIT_PORTER

1

1

a-pieces AIT TNK_PALAMON

1

1

a-tilt CYL 1H6_JOAN

1

1

a-tilt 1H6 CYL_QUEEN MARGARET

1

1

a-weary ROM 1H4_KING HENRY

1

1

a-weary 1H4 ROM_NURSE

1

1

For instance, "a-nights" appears once in the 2H4-MISTRESS part, and twice in Caesar (both occurrences in Caesar’s part).

By sorting, summarizing, and analyzing those "match" occurrences we can see how many rare words are shared between different plays, and between specific parts and other plays.

ShakesPerl

I started by building the database records in perl, expanding on Gabriel’s work. (All files and scripts mentioned are linked from the frame on the left.)

First we need clean text files of each play, and each part. The script MakePartAndPlayFiles.pl reads All.txt (the whole Oxford Shakespeare in one file, as improved upon and provided by Gabriel) and creates those files, stripping out stage directions, speaker designations, and line-numbering apparatus in the process. This results in 38 play files (both Lear Quarto and Folio are included in the Oxford text, but here I use only the Quarto version) and (you were probably wondering this) 1,754 part files.

Then we need a list of rare words in All.txt. MakeRareWordList does that. You can specify the minimum and maximum number of occurrences that together define a "rare" word. It’s currently set to a maximum of 12 (Foster’s apparent breakpoint judging from Funeral Elegy, though the SNL articles are ambiguous, suggesting 10) and a minimum of 2 (if there’s only one occurrence in the corpus, the word can’t very well appear in both a part and a different play). I’ve also excluded words with less than three letters. The results with those settings (11,051 words) are here in rarewords.txt (116k). This file also includes, for each word, the total number of occurrences in the corpus.

CorrelationBuilder.pl reads the part, play, and rareword files and creates correlations.txt (4.6mb!), containing the database records as described above. Thanks and no few kudos go here to my officemate Glenn Fleishman, who wrote this central and quite sophisticated script (and taught me Perl in the process). I’ve merely fine-tuned, packaged, and debugged a bit.

Glenn suggests that this whole portion of Shaxican should be dubbed "ShakesPerl."

Finally, to calculate the percentage each part constitutes of its play, we need to count the words in each part and play. PartAndPlayWordCounter.pl does that, generating two files of counts--one for the plays (playwordcounts.txt), one for the parts (partwordcounts.txt).

ShakeMaker

I did the next part of the analysis in FileMaker, so the moniker is unavoidable. I won’t document that whole database here, though interested parties are welcome to look at the field definitions in the Adobe Acrobat file ShakeMakerFields.pdf, and contact me if you have any questions or would like a copy of the database.

Some results from the analysis are summarized in the following table, re-creating the table in Shakespeare Newsletter, and comparing the Shaxican results to Shaxicon’s. I’ve highlighted fields where the parts’ "influence" is (arbitrarily) at least 50% higher than those parts’ share of the source play.

Shakespare’s supposed roles in Henry V - Hamlet:
Percentages of cross-indexed vocabulary, 1599-1607

   

% of Play

H5

AYL

TN

Ham

Tro

AWW

MM

Oth

Mac

Cym

Cor

Ant

H5: Chor & Mont. Shaxicon

8.1

-

16.8

15.5

17.9

20.2

11.4

14.1

15.4

18.0

17.9

20.5

33.3

Shaxican

7.4

-

6.9

9.4

12.7

11.9

7.2

11.8

11.3

13.3

8.7

12.8

16.7

AYL: Adam & Corin. Shaxicon

5.0

3.6

-

4.3

8.9

11.8

11.1

10.4

11.9

3.8

12.9

11.1

13.0

Shaxican

5.0

6.8

-

1.3

8.5

7.2

5.3

7.2

7.0

3.8

7.1

6.1

7.4

TN: Valen. & Anton. Shaxicon

4.6

5.1

6.4

-

5.9

12.2

8.9

18.2

3.6

8.6

6.7

6.3

22.2

Shaxican

4.6

4.6

5.0

-

6.3

5.8

4.1

4.0

1.0

7.4

2.9

5.5

7.7

Ham: Ghost & 1 Player Shaxicon

3.5

4.8

5.3

4.5

-

10.4

3.4

3.2

5.2

11.2

10.0

8.0

11.0

Shaxican

3.7

5.3

4.2

4.7

5.4

3.3

4.2

1.7

6.4

8.7

7.2

7.5

ShakeMaker Reports

You can see the more detailed FileMaker reporting which provided the table data in two Acrobat files: one with all the shared words listed (FosterAnalysisWitWordDetail.pdf,

Two other reports look at the same parts and their possible influences on every play in the corpus, using the method that Gabriel proposed: comparing relative word frequency in target plays to the frequency of those words in the whole corpus. (The theory is that words that Shakespeare "remembered" would appear more frequently in plays he wrote thereafter.) The report with full word detail is in the document RelFreqAnalWithWordDetail.pdf (43 pages). The summary report without word detail is in the document RelFreqAnalNoWordDetail.pdf (9 pages). This analysis method needs further scrutiny.

These are just two examples of the types of reports that can be generated given some moderately high-level FileMaker skills (which I am happy to impart to interested parties).

This application definitely pushes FileMaker’s limits. While generating these reports on a small subset of the data takes very little time, when you start sorting 165,000 "match" records, you need to plan on lunch or a good night’s sleep. One goal might be to move this architecture to a speedier platform. MySQL is a likely candidate because it’s free, feature-rich, widely used at universities, and awesomely fast. It’s not nearly as easy to learn as FileMaker, though.

Miscellaneous Issues

One issue is the Oxford editors’ uncertainty about who spoke certain speeches. Those speakers are bracketed in the text, so you end up with additonal part files containing those questionably attributed speeches (i.e. WIV-[SHALLOW].txt and ADO_[FIRST] WATCHMAN.txt). This requires more care in searching for parts, to be sure there are/are not bracketed versions of the part. These brackets could be removed and the parts combined with the non-bracketed parts, in essence accepting the Oxford editors’ suppositions.

The perl scripts here, when run on the Oxford texts, remove all foreign-language speeches. (Foreign-language speeches are enclosed in {} in the text, which is the same delimiter used to enclose stage directions.) While this is arguably fine for rare-word analysis, it also alters the part-as-a-percentage-of-play calculations some, especially in H5.

ALL of this work requires vetting by others for errors of both design and implementation. The tools are quite complex, and there are many opportunities for missteps.

Please feel free to contact me with any questions, comments, ideas, or suggestions.

Thanks,

Steve Roth