|
SHAXICON meets SHAXICAN
By Steve Roth
In 1991, Donald Foster published a three-part series of articles in Shakespeare
Newsletter suggesting that it might be possible to determine, through
lexical analysis, what parts Shakespeare played, and when. The idea is that
Shakespeare would have remembered words from parts he’d conned, and those
words would appear more frequently in plays he wrote (shortly) thereafter--they
would exert an inordinate "influence" on those subsequent plays. A
table of plays between 1599 and 1607 (below) seemed to support his position.
Don has said that this use of lexical analysis is something of a
sideshow and demonstration piece (the uncharitable might call it a parlor trick
<g>), but it has generated a great deal of interest from both adherents
and detractors. Unfortunately nobody has been able to check or verify Don’s
findings. He promised over the years to publish the database (which he dubbed
"Shaxicon") on which his proposals were based, but it has never been
published.
This led Gabriel Egan to propose re-creating the data using perl
scripting, starting with the Oxford electronic texts. He dubbed the enterprise
"Shaxican," and posted the work on his site (www.totus.org) that's now
moved here.
On this page you’ll find my continuation of Gabriel’s work, taking
Shaxican to the point that results can be compared to Don’s analysis.
Words versus Strings
I should begin by pointing out that the base text we’re working from
does not identify "words" per se, as Don’s Shaxicon database
apparently does. "Row" (n.) and "row" (v.) are not the same
word. Since we’re comparing strings of characters here, not words, the results
aren’t really comparable to Don’s work, nor can they serve to support or
disqualify that work. When I say "word" in the rest of this writeup, I’m
actually referring to strings. The tools presented here could, with some
revision, be adapted for use with a "lemmatized" text.
The primary goal and result here is the analysis/reporting architecture
conceived, and the tools to build that architecture.
The specific goal of the architecture was to duplicate Don’s analysis
of playing parts as presented in the Shakespeare Newsletter articles.
That goal restricts the wider usability of the architecture to some extent, but
the granularity of the data and the database analysis/reporting tools are still
fairly flexible for those who want to compare rare words in various plays, or in
various parts compared to various plays.
The key to the analysis was building a database of each
part/play/rareword correlation--occurrences where a rare word appears in both a
play and a part from a different play. viz:
| Word |
Play |
Part |
Occurences in this play |
Occurrences in this part |
| a-bleeding |
ROM |
MV_LANCELOT |
1 |
1 |
| a-bleeding |
MV |
ROM_PRINCE |
1 |
1 |
| a-doing |
R3 |
COR_BRUTUS |
1 |
1 |
| a-doing |
COR |
R3_SCRIVENER |
1 |
1 |
| a-hungry |
WIV |
TN_SIR ANDREW |
1 |
1 |
| a-hungry |
TN |
WIV_SLENDER |
1 |
1 |
| a-making |
MAC |
HAM_POLONIUS |
1 |
1 |
| a-making |
HAM |
MAC_LADY MACBETH |
1 |
1 |
| a-nights |
JC |
2H4_MISTRESS QUICKLY |
2 |
1 |
| a-nights |
TIM |
2H4_MISTRESS QUICKLY |
1 |
1 |
| a-nights |
2H4 |
JC_CAESAR |
1 |
2 |
| a-nights |
TIM |
JC_CAESAR |
1 |
2 |
| a-nights |
2H4 |
TIM_APEMANTUS |
1 |
1 |
| a-nights |
JC |
TIM_APEMANTUS |
2 |
1 |
| a-pieces |
TNK |
AIT_PORTER |
1 |
1 |
| a-pieces |
AIT |
TNK_PALAMON |
1 |
1 |
| a-tilt |
CYL |
1H6_JOAN |
1 |
1 |
| a-tilt |
1H6 |
CYL_QUEEN MARGARET |
1 |
1 |
| a-weary |
ROM |
1H4_KING HENRY |
1 |
1 |
| a-weary |
1H4 |
ROM_NURSE |
1 |
1 |
For instance, "a-nights" appears once in the 2H4-MISTRESS part, and
twice in Caesar (both occurrences in Caesar’s part).
By sorting, summarizing, and analyzing those "match"
occurrences we can see how many rare words are shared between different plays,
and between specific parts and other plays.
ShakesPerl
I started by building the database records in perl, expanding on
Gabriel’s work. (All files and scripts mentioned are linked from the frame on
the left.)
First we need clean text files of each play, and each part. The script MakePartAndPlayFiles.pl
reads All.txt (the whole Oxford
Shakespeare in one file, as improved upon and provided by Gabriel) and creates
those files, stripping out stage directions, speaker designations, and
line-numbering apparatus in the process. This results in 38 play files (both
Lear Quarto and Folio are included in the Oxford text, but here I use only the
Quarto version) and (you were probably wondering this) 1,754 part files.
Then we need a list of rare words in All.txt.
MakeRareWordList does that. You can
specify the minimum and maximum number of occurrences that together define a
"rare" word. It’s currently set to a maximum of 12 (Foster’s
apparent breakpoint judging from Funeral Elegy, though the SNL articles
are ambiguous, suggesting 10) and a minimum of 2 (if there’s only one
occurrence in the corpus, the word can’t very well appear in both a part and
a different play). I’ve also excluded words with less than three letters. The
results with those settings (11,051 words) are here in rarewords.txt
(116k). This file also includes, for each word, the total number of occurrences
in the corpus.
CorrelationBuilder.pl reads
the part, play, and rareword files and creates correlations.txt
(4.6mb!), containing the database records as described above. Thanks and no few
kudos go here to my officemate Glenn Fleishman, who wrote this central and quite
sophisticated script (and taught me Perl in the process). I’ve merely
fine-tuned, packaged, and debugged a bit.
Glenn suggests that this whole portion of Shaxican should be dubbed
"ShakesPerl."
Finally, to calculate the percentage each part constitutes of its play,
we need to count the words in each part and play. PartAndPlayWordCounter.pl
does that, generating two files of counts--one for the plays (playwordcounts.txt),
one for the parts (partwordcounts.txt).
ShakeMaker
I did the next part of the analysis in FileMaker, so the moniker is
unavoidable. I won’t document that whole database here, though interested
parties are welcome to look at the field definitions in the Adobe Acrobat file ShakeMakerFields.pdf,
and contact me if you have any questions or would like a copy of the database.
Some results from the analysis are summarized in the following table,
re-creating the table in Shakespeare Newsletter, and comparing the
Shaxican results to Shaxicon’s. I’ve highlighted fields where the parts’
"influence" is (arbitrarily) at least 50% higher than those parts’
share of the source play.
Shakespare’s supposed roles in Henry V - Hamlet:
Percentages of cross-indexed vocabulary, 1599-1607
| |
|
% of Play |
H5 |
AYL |
TN |
Ham |
Tro |
AWW |
MM |
Oth |
Mac |
Cym |
Cor |
Ant |
| H5: Chor
& Mont. |
Shaxicon |
8.1 |
- |
16.8 |
15.5 |
17.9 |
20.2 |
11.4 |
14.1 |
15.4 |
18.0 |
17.9 |
20.5 |
33.3 |
| Shaxican |
7.4 |
- |
6.9 |
9.4 |
12.7 |
11.9 |
7.2 |
11.8 |
11.3 |
13.3 |
8.7 |
12.8 |
16.7 |
| AYL: Adam
& Corin. |
Shaxicon |
5.0 |
3.6 |
- |
4.3 |
8.9 |
11.8 |
11.1 |
10.4 |
11.9 |
3.8 |
12.9 |
11.1 |
13.0 |
| Shaxican |
5.0 |
6.8 |
- |
1.3 |
8.5 |
7.2 |
5.3 |
7.2 |
7.0 |
3.8 |
7.1 |
6.1 |
7.4 |
| TN: Valen.
& Anton. |
Shaxicon |
4.6 |
5.1 |
6.4 |
- |
5.9 |
12.2 |
8.9 |
18.2 |
3.6 |
8.6 |
6.7 |
6.3 |
22.2 |
| Shaxican |
4.6 |
4.6 |
5.0 |
- |
6.3 |
5.8 |
4.1 |
4.0 |
1.0 |
7.4 |
2.9 |
5.5 |
7.7 |
| Ham: Ghost
& 1 Player |
Shaxicon |
3.5 |
4.8 |
5.3 |
4.5 |
- |
10.4 |
3.4 |
3.2 |
5.2 |
11.2 |
10.0 |
8.0 |
11.0 |
| Shaxican |
3.7 |
5.3 |
4.2 |
4.7 |
|
5.4 |
3.3 |
4.2 |
1.7 |
6.4 |
8.7 |
7.2 |
7.5 |
ShakeMaker Reports
You can see the more detailed FileMaker reporting which provided the
table data in two Acrobat files: one with all the shared words listed (FosterAnalysisWitWordDetail.pdf,
Two other reports look at the same parts and their possible influences on
every play in the corpus, using the method that Gabriel proposed: comparing
relative word frequency in target plays to the frequency of those words in the
whole corpus. (The theory is that words that Shakespeare "remembered"
would appear more frequently in plays he wrote thereafter.) The report with full
word detail is in the document RelFreqAnalWithWordDetail.pdf
(43 pages). The summary report without word detail is in the document RelFreqAnalNoWordDetail.pdf
(9 pages). This analysis method needs further scrutiny.
These are just two examples of the types of reports that can be
generated given some moderately high-level FileMaker skills (which I am happy to
impart to interested parties).
This application definitely pushes FileMaker’s limits. While
generating these reports on a small subset of the data takes very little time,
when you start sorting 165,000 "match" records, you need to plan on
lunch or a good night’s sleep. One goal might be to move this architecture to
a speedier platform. MySQL is a likely candidate because it’s free,
feature-rich, widely used at universities, and awesomely fast. It’s not nearly
as easy to learn as FileMaker, though.
Miscellaneous Issues
One issue is the Oxford editors’ uncertainty about who spoke certain
speeches. Those speakers are bracketed in the text, so you end up with additonal
part files containing those questionably attributed speeches (i.e.
WIV-[SHALLOW].txt and ADO_[FIRST] WATCHMAN.txt). This requires more care in
searching for parts, to be sure there are/are not bracketed versions of the
part. These brackets could be removed and the parts combined with the
non-bracketed parts, in essence accepting the Oxford editors’ suppositions.
The perl scripts here, when run on the Oxford texts, remove all
foreign-language speeches. (Foreign-language speeches are enclosed in {} in the
text, which is the same delimiter used to enclose stage directions.) While this
is arguably fine for rare-word analysis, it also alters the
part-as-a-percentage-of-play calculations some, especially in H5.
ALL of this work requires vetting by others for errors of both design
and implementation. The tools are quite complex, and there are many
opportunities for missteps.
Please feel free to contact me with any questions, comments, ideas, or
suggestions.
Thanks,
Steve Roth
|