SHAXICAN
Perl scripts and Shakespeare's languageby Gabriel Egan
This page is intended to offer, free to all, Perl scripts1
that do the sorts of things Donald Foster's SHAXICON database is designed
to do. In 1995 SHAXICON was used to support Foster's attribution of the
'Funeral Elegy' poem to Shakespeare, the academic community being asked at
the time to wait a short while until the database itself was ready for
publication. In the 5 years since these claims were made, SHAXICON's
non-appearance has prevented substantiation of the claims made for it, and
the resources provided here are intended to assist those interested in the
development of public domain tools which do the same analyses. This is
work-in-progress and others interested in this area are invited to copy
and improve on materials provided here and to share their results. I'll
post here anything which moves the academic community towards a set of
public-domain tools for this kind of work.
The area of exploration is Shakespeare's 'rare' words: those he used
say 12 times or fewer in his entire extant output. To start with one needs
an etext of the complete works. On the left is one ('all.txt') based on the Oxford Shakespeare etext (prepared by Lou
Burnard,
1989) to which I hope I've added enough value--by retagging and
play-title/line-numbering every line--that Oxford University Press don't
claim copyright infringement. If they do, I will of course comply with a
'cease and desist' notice and substitute an inferior text.
Before getting started with the following explanations, you might want
to see the practical work ('Steve Roth's refinements' on the left) that
has done by building the limited foundations sketched here.
Part One: Building the raw files
The first thing one needs is a list of all the characters in the
Shakespeare canon, with unique names for the many Antonios and Claudios.
One solution is to prepend (in the sense of 'add before') the name of the
play before the name of the character, so ADO-ANTONIO is distinct from
MV-ANTONIO, and TGV-ANTONIO. The following script will do that:
while (<>) {
($PlayName, $RestOfLine) = split(" ", $_, 2);
if ($RestOfLine =~ m/\W([A-Z|-|\[|\]\^|\@]{2,})((
[A-Z|-|\[|\]|\^|\@]{2,})*)\W/) {
$PlayAndName = $PlayName."-".$1.$2;
$CharacterNames{$PlayAndName}++;
}
}
foreach $CharacterName (sort keys(%CharacterNames)) {
print $CharacterName, "\n";
}
This script I call findchars.pl
and when the above complete work etext is run through it, the resulting
file of character names ('charnames.txt' on the left) is produced. The
next step step is to produce a list of 'rare' words: those Shakespeare
used infrequently. The following script lists those words Shakespeare used
no more than 12 times in his entire extant dramatic output:
$/="";
$*=1;
while (<>) {
s/-\n//g;
tr/A-Z/a-z/;
@words = split(/\W*\s+\W*/, $_);
foreach $word (@words) {
$wordcount{$word}++;
}
}
foreach $word (sort keys(%wordcount)) {
if ($wordcount{$word} < 13) {
printf "%20s %d\n", $word, $wordcount{$word};
}
}
This script I call rare.pl
and when the above complete works etext is run thought it, a file of rare
words ('rarewords.txt' on the left) is produced. The next step will be to
work out which character spoke which rare words, and then we can see if an
already-existing character's vocabulary "floods into"
Shakespeare's writing with a given new play2. This is the claim
of Foster's SHAXICON: learning a part for performance brought that part's
rare words to Shakespeare's mind and these words are over-represented in
the next play he wrote. I've picked '12 or fewer' occurrences as the
cut-off for distinguishing what is a rare word because this was Foster's
cut-off, but comments are invited on whether '6 or fewer' might be a
better borderline.
The observant will have noticed problems in the above. The findchars.pl
script produces a few 'ghosts' such as "ABC" in TGV. This is
because the underlying etext renders character names using all uppercase
letters, which is the distinguishing feature the script looks for. Thus in
TGV when Speed likens Valentine to "a schoolboy that had lost his
ABC" (2.1.21) a spurious hit occurs because ABC looks like a
character's name. Also, the script rare.pl
doesn't discard numbers so there's much numerical guff at the top of the
rare words file before one gets to the alphabetical section, and the
script doesn't deal properly with possessive apostrophes and hyphens.
Page 2: Making a 'part' for each character
The 6 steps to reproducing what SHAXICON does, as I understand it, are
these:
1) You make a list of all the characters in all the plays (using
unique identifiers for the multiple Claudios, Antonios, etc).
2) You make a list of all the rare words in all the plays (rare in the
sense that they are words Shakespeare rarely uses).
3) You count how many times each of the characters in (1) uses each of
the rare words in (2).
4) You take a sample text (one of the plays) and make a list of all its
words and how frequently they appear.
5) You check the list in (3) with each of the lists in (2) to see if
there's a character whose rare words turn up much more often in the
sample play than they do in the Shakespeare canon as a whole.
6) If (4) yields a good match, that character was played by Shakespeare
shortly before he wrote that play. (Hence those rare words were
over-represented in the sample play: they were in Shakespeare's head
from his having recently memorized them for his part.)
We have achieved, albeit in a rough-and-ready form, steps (1) and (2).
Step (3) is difficult, and as an intermediate step we could do with an
actor's "part" for each character in the canon. That is to say,
we need a separate document containing just the words for a given
character in a given play. Once we have ALL the words a character says, we
can throw away the ones which are not rare (by using the list from step
(2)) and will be left with just the rare words spoken by a particular
character.
The original etext we started with, 'all.txt', has a feature which
makes it not immediately suitable for generating "parts": some
lines are shared between by two or more characters. Take this example:
KING JOHN Death. HUBERT My lord. KING JOHN A grave. HUBERT He shall
not live. KING JOHN Enough.
(King John 3.3.66)
For our purposes, it would be better if each speech began on a new
line. The following script breaker.pl
does this, and also wraps a tag (<speaker>...</speaker>)
around each speaker:
while (<>) {
($PlayName, $RestOfLine) = split(" ", $_, 2);
s/\W([A-Z|-|\[|\]\^|\@]{2,})(([A-Z|-|\[|\]\^|\@]{2,})*)\W/\n<speaker>$PlayName-$1$2<\/speaker>\n/g;
print;
}
This is essentially the same search pattern as used in findchars.pl
(which looks for strings of uppercase characters, which may be separated
by spaces but not by anything else), but instead of just finding them they
are tagged for easy identification later. When all.txt,
the original etext, is run through breaker.pl,
the result is a file with speeches tagged by speaker ('tagged.txt
In tagged.txt the first
line for a new speaker hasn't got the playname and act.scene.line numbers
which we want to strip out, so the code for "if there's a new
speaker" just pulls the line after the speaker's name and sends it
out to that speaker's file. But in the code for "if it's just a
continuing speech"--and hence does have the playname and
act.scene.line numbers which need to be removed--a 'split' on whitespace
(represented by \s+ in Perl) takes away
PlayName and ActScene and Line numbers, and then just the Speech itself is
sent out to the character's file. The script makeparts.pl
creates a new file for each character in the canon. Rather than provide a
link for each one (there 1046 "parts"), I've bundled them into a
zip-file, parts.zip
on the left.
Another outstanding problem: The stage directions should be deleted
from the original etext, else they'll end up being attributed as someone's
speech.
Notes
1Perl is a programming language common on Unix systems and
also available for the Windows and Macintosh operating systems. Perl is
good with textual strings, hence it's suitable for this application.
2The poetical output should be consider too, shouldn't it?
That goes on the list of 'next time' improvements, together with a
consideration of the effects of editorial modernization of spellings found
in the early printed texts.