|
SHAXICON meets SHAXICAN
By Steve Roth
In 1991, Donald Foster published a three-part series of articles in Shakespeare
Newsletter suggesting that it might be possible to determine, through lexical
analysis, what parts Shakespeare played, and when. The idea is that Shakespeare would have
remembered words from parts hed conned, and those words would appear more frequently
in plays he wrote (shortly) thereafter--they would exert an inordinate
"influence" on those subsequent plays. A table of plays between 1599 and 1607
(below) seemed to support his position.
Don has said that this use of lexical analysis is something of a sideshow and
demonstration piece (the uncharitable might call it a parlor trick <g>), but it has
generated a great deal of interest from both adherents and detractors. Unfortunately
nobody has been able to check or verify Dons findings. He promised over the years to
publish the database (which he dubbed "Shaxicon") on which his proposals were
based, but it has never been published.
This led Gabriel Egan to propose re-creating the data using perl scripting,
starting with the Oxford electronic texts. He dubbed the enterprise "Shaxican,"
and posted the work on his site (www.totus.org) that's now moved here.
On this page youll find my continuation of Gabriels work, taking
Shaxican to the point that results can be compared to Dons analysis.
Words versus Strings
I should begin by pointing out that the base text were working from does
not identify "words" per se, as Dons Shaxicon database apparently does.
"Row" (n.) and "row" (v.) are not the same word. Since were
comparing strings of characters here, not words, the results arent really comparable
to Dons work, nor can they serve to support or disqualify that work. When I say
"word" in the rest of this writeup, Im actually referring to strings. The
tools presented here could, with some revision, be adapted for use with a
"lemmatized" text.
The primary goal and result here is the analysis/reporting architecture
conceived, and the tools to build that architecture.
The specific goal of the architecture was to duplicate Dons analysis of
playing parts as presented in the Shakespeare Newsletter articles. That goal
restricts the wider usability of the architecture to some extent, but the granularity of
the data and the database analysis/reporting tools are still fairly flexible for those who
want to compare rare words in various plays, or in various parts compared to various
plays.
The key to the analysis was building a database of each part/play/rareword
correlation--occurrences where a rare word appears in both a play and a part from a
different play. viz:
| Word |
Play |
Part |
Play Count |
Part Count |
| a-bleeding |
ROM |
MV_LANCELOT |
1 |
1 |
| a-bleeding |
MV |
ROM_PRINCE |
1 |
1 |
| a-doing |
R3 |
COR_BRUTUS |
1 |
1 |
| a-doing |
COR |
R3_SCRIVENER |
1 |
1 |
| a-hungry |
WIV |
TN_SIR ANDREW |
1 |
1 |
| a-hungry |
TN |
WIV_SLENDER |
1 |
1 |
| a-making |
MAC |
HAM_POLONIUS |
1 |
1 |
| a-making |
HAM |
MAC_LADY MACBETH |
1 |
1 |
| a-nights |
JC |
2H4_MISTRESS QUICKLY |
2 |
1 |
| a-nights |
TIM |
2H4_MISTRESS QUICKLY |
1 |
1 |
| a-nights |
2H4 |
JC_CAESAR |
1 |
2 |
| a-nights |
TIM |
JC_CAESAR |
1 |
2 |
| a-nights |
2H4 |
TIM_APEMANTUS |
1 |
1 |
| a-nights |
JC |
TIM_APEMANTUS |
2 |
1 |
| a-pieces |
TNK |
AIT_PORTER |
1 |
1 |
| a-pieces |
AIT |
TNK_PALAMON |
1 |
1 |
| a-tilt |
CYL |
1H6_JOAN |
1 |
1 |
| a-tilt |
1H6 |
CYL_QUEEN MARGARET |
1 |
1 |
| a-weary |
ROM |
1H4_KING HENRY |
1 |
1 |
| a-weary |
1H4 |
ROM_NURSE |
1 |
1 |
For instance, "a-nights" appears once in the 2H4-MISTRESS part, and twice in Caesar
(both occurrences in Caesars part).
By sorting, summarizing, and analyzing those "match" occurrences we can
see how many rare words are shared between different plays, and between specific parts and
other plays.
ShakesPerl
I started by building the database records in perl, expanding on Gabriels
work. (All files and scripts mentioned are linked from the frame on the left.)
First we need clean text files of each play, and each part. The script MakePartAndPlayFiles.pl reads
All.txt
(the whole Oxford Shakespeare in one file, as improved upon and provided by Gabriel) and
creates those files, stripping out stage directions, speaker designations, and
line-numbering apparatus in the process. This results in 38 play files (both Lear Quarto
and Folio are included in the Oxford text, but here I use only the Quarto version) and
(you were probably wondering this) 1,754 part files.
Then we need a list of rare words in All.txt.
MakeRareWordList
does that. You can specify the minimum and maximum number of occurrences that together
define a "rare" word. Its currently set to a maximum of 12 (Fosters
apparent breakpoint judging from Funeral Elegy, though the SNL articles are
ambiguous, suggesting 10) and a minimum of 2 (if theres only one occurrence
in the
corpus, the word cant very well appear in both a part and a different play).
Ive also excluded words with less than three letters. The results with those
settings (11,051 words) are here in rarewords.txt (116k). This
file also includes, for each word, the total number of occurrences in the corpus.
CorrelationBuilder.pl reads the part, play, and
rareword files and creates correlations.txt (4.6mb!),
containing the database records as described above. Thanks and no few kudos go here to my
officemate Glenn Fleishman, who
wrote this central and quite sophisticated script (and taught me Perl in the process).
Ive merely fine-tuned, packaged, and debugged a bit.
Glenn suggests that this whole portion of Shaxican should be dubbed
"ShakesPerl."
Finally, to calculate the percentage each part constitutes of its play, we need
to count the words in each part and play. PartAndPlayWordCounter.pl
does that, generating two files of counts--one for the plays (playwordcounts.txt),
one for the parts (partwordcounts.txt).
ShakeMaker
I did the next part of the analysis in FileMaker, so the moniker is unavoidable.
I wont document that whole database here, though interested parties are welcome to
look at the field definitions in the Adobe Acrobat file ShakeMakerFields.pdf, and
contact me if you have any questions or would like
a copy of the database.
Some results from the analysis are summarized in the following table, re-creating
the table in Shakespeare Newsletter, and comparing the Shaxican results to
Shaxicons. Ive highlighted fields where the parts "influence"
is (arbitrarily) at least 50% higher than those parts share of the source play.
Shakespares supposed roles in Henry V - Hamlet:
Percentages of cross-indexed vocabulary, 1599-1607
| |
|
% of Play |
H5 |
AYL |
TN |
Ham |
Tro |
AWW |
MM |
Oth |
Mac |
Cym |
Cor |
Ant |
| H5: Chor & Mont. |
Shaxicon |
8.1 |
- |
16.8 |
15.5 |
17.9 |
20.2 |
11.4 |
14.1 |
15.4 |
18.0 |
17.9 |
20.5 |
33.3 |
| Shaxican |
7.4 |
- |
6.9 |
9.4 |
12.7 |
11.9 |
7.2 |
11.8 |
11.3 |
13.3 |
8.7 |
12.8 |
16.7 |
| AYL: Adam & Corin. |
Shaxicon |
5.0 |
3.6 |
- |
4.3 |
8.9 |
11.8 |
11.1 |
10.4 |
11.9 |
3.8 |
12.9 |
11.1 |
13.0 |
| Shaxican |
5.0 |
6.8 |
- |
1.3 |
8.5 |
7.2 |
5.3 |
7.2 |
7.0 |
3.8 |
7.1 |
6.1 |
7.4 |
| TN: Valen. & Anton. |
Shaxicon |
4.6 |
5.1 |
6.4 |
- |
5.9 |
12.2 |
8.9 |
18.2 |
3.6 |
8.6 |
6.7 |
6.3 |
22.2 |
| Shaxican |
4.6 |
4.6 |
5.0 |
- |
6.3 |
5.8 |
4.1 |
4.0 |
1.0 |
7.4 |
2.9 |
5.5 |
7.7 |
| Ham: Ghost & 1
Player |
Shaxicon |
3.5 |
4.8 |
5.3 |
4.5 |
- |
10.4 |
3.4 |
3.2 |
5.2 |
11.2 |
10.0 |
8.0 |
11.0 |
| Shaxican |
3.7 |
5.3 |
4.2 |
4.7 |
|
5.4 |
3.3 |
4.2 |
1.7 |
6.4 |
8.7 |
7.2 |
7.5 |
ShakeMaker Reports
You can see the more detailed FileMaker reporting which provided the table data
in two Acrobat files: one with all the shared
words listed (FosterAnalysisWitWordDetail.pdf,
Two other reports look at the same parts and their possible influences on every play in
the corpus, using the method that Gabriel proposed: comparing relative word frequency in
target plays to the frequency of those words in the whole corpus. (The theory is that
words that Shakespeare "remembered" would appear more frequently in plays he
wrote thereafter.) The report with full word detail is in the document RelFreqAnalWithWordDetail.pdf
(43 pages). The summary report without word
detail is in the document RelFreqAnalNoWordDetail.pdf
(9 pages). This analysis method
needs further scrutiny.
These are just two examples of the types of reports that can be generated given
some moderately high-level FileMaker skills (which I am happy to impart to interested
parties).
This application definitely pushes FileMakers limits. While generating
these reports on a small subset of the data takes very little time, when you start sorting
165,000 "match" records, you need to plan on lunch or a good nights sleep.
One goal might be to move this architecture to a speedier platform. MySQL is a likely candidate because its free,
feature-rich, widely used at universities, and awesomely fast. Its not nearly as
easy to learn as FileMaker, though.
Miscellaneous Issues
One issue is the Oxford editors uncertainty about who spoke certain
speeches. Those speakers are bracketed in the text, so you end up with additonal part
files containing those questionably attributed speeches (i.e. WIV-[SHALLOW].txt and
ADO_[FIRST] WATCHMAN.txt). This requires more care in searching for parts, to be sure
there are/are not bracketed versions of the part. These brackets could be removed and the
parts combined with the non-bracketed parts, in essence accepting the Oxford editors
suppositions.
The perl scripts here, when run on the Oxford texts, remove all foreign-language
speeches. (Foreign-language speeches are enclosed in {} in the text, which is the same
delimiter used to enclose stage directions.) While this is arguably fine for rare-word
analysis, it also alters the part-as-a-percentage-of-play calculations some, especially in
H5.
ALL of this work requires vetting by others for errors of both design and
implementation. The tools are quite complex, and there are many opportunities for
missteps.
Please feel free to contact me with any questions, comments, ideas, or
suggestions.
Thanks,
Steve Roth
|