Gabriel Egan . com

Home
Up
all.txt
charnames.txt
rarewords.txt
tagged.txt
parts.zip
SHAXICAN Perl scripts and Shakespeare's language

by Gabriel Egan

This page is intended to offer, free to all, Perl scripts1 that do the sorts of things Donald Foster's SHAXICON database is designed to do. In 1995 SHAXICON was used to support Foster's attribution of the 'Funeral Elegy' poem to Shakespeare, the academic community being asked at the time to wait a short while until the database itself was ready for publication. In the 5 years since these claims were made, SHAXICON's non-appearance has prevented substantiation of the claims made for it, and the resources provided here are intended to assist those interested in the development of public domain tools which do the same analyses. This is work-in-progress and others interested in this area are invited to copy and improve on materials provided here and to share their results. I'll post here anything which moves the academic community towards a set of public-domain tools for this kind of work.

The area of exploration is Shakespeare's 'rare' words: those he used say 12 times or fewer in his entire extant output. To start with one needs an etext of the complete works. On the left is one ('all.txt') based on the Oxford Shakespeare etext (prepared by Lou Burnard, 1989) to which I hope I've added enough value--by retagging and play-title/line-numbering every line--that Oxford University Press don't claim copyright infringement. If they do, I will of course comply with a 'cease and desist' notice and substitute an inferior text.

Before getting started with the following explanations, you might want to see the practical work ('Steve Roth's refinements' on the left) that has done by building the limited foundations sketched here.

Part One: Building the raw files

The first thing one needs is a list of all the characters in the Shakespeare canon, with unique names for the many Antonios and Claudios. One solution is to prepend (in the sense of 'add before') the name of the play before the name of the character, so ADO-ANTONIO is distinct from MV-ANTONIO, and TGV-ANTONIO. The following script will do that:

while (<>) {
   ($PlayName, $RestOfLine) = split(" ", $_, 2);
    if ($RestOfLine =~ m/\W([A-Z|-|\[|\]\^|\@]{2,})(( [A-Z|-|\[|\]|\^|\@]{2,})*)\W/) {
      $PlayAndName = $PlayName."-".$1.$2;
      $CharacterNames{$PlayAndName}++;
    }
}
foreach $CharacterName (sort keys(%CharacterNames)) {
   print $CharacterName, "\n";
}

This script I call findchars.pl and when the above complete work etext is run through it, the resulting file of character names ('charnames.txt' on the left) is produced. The next step step is to produce a list of 'rare' words: those Shakespeare used infrequently. The following script lists those words Shakespeare used no more than 12 times in his entire extant dramatic output:

$/="";
$*=1;
while (<>) {
   s/-\n//g;
   tr/A-Z/a-z/;
   @words = split(/\W*\s+\W*/, $_);
   foreach $word (@words) {
      $wordcount{$word}++;
   }
}
foreach $word (sort keys(%wordcount)) {
   if ($wordcount{$word} < 13) {
      printf "%20s %d\n", $word, $wordcount{$word};
   }
}

This script I call rare.pl and when the above complete works etext is run thought it, a file of rare words ('rarewords.txt' on the left) is produced. The next step will be to work out which character spoke which rare words, and then we can see if an already-existing character's vocabulary "floods into" Shakespeare's writing with a given new play2. This is the claim of Foster's SHAXICON: learning a part for performance brought that part's rare words to Shakespeare's mind and these words are over-represented in the next play he wrote. I've picked '12 or fewer' occurrences as the cut-off for distinguishing what is a rare word because this was Foster's cut-off, but comments are invited on whether '6 or fewer' might be a better borderline.

The observant will have noticed problems in the above. The findchars.pl script produces a few 'ghosts' such as "ABC" in TGV. This is because the underlying etext renders character names using all uppercase letters, which is the distinguishing feature the script looks for. Thus in TGV when Speed likens Valentine to "a schoolboy that had lost his ABC" (2.1.21) a spurious hit occurs because ABC looks like a character's name. Also, the script rare.pl doesn't discard numbers so there's much numerical guff at the top of the rare words file before one gets to the alphabetical section, and the script doesn't deal properly with possessive apostrophes and hyphens.

Page 2: Making a 'part' for each character

The 6 steps to reproducing what SHAXICON does, as I understand it, are these:

1) You make a list of all the characters in all the plays (using unique identifiers for the multiple Claudios, Antonios, etc).

2) You make a list of all the rare words in all the plays (rare in the sense that they are words Shakespeare rarely uses).

3) You count how many times each of the characters in (1) uses each of the rare words in (2).

4) You take a sample text (one of the plays) and make a list of all its words and how frequently they appear.

5) You check the list in (3) with each of the lists in (2) to see if there's a character whose rare words turn up much more often in the sample play than they do in the Shakespeare canon as a whole.

6) If (4) yields a good match, that character was played by Shakespeare shortly before he wrote that play. (Hence those rare words were over-represented in the sample play: they were in Shakespeare's head from his having recently memorized them for his part.)

We have achieved, albeit in a rough-and-ready form, steps (1) and (2). Step (3) is difficult, and as an intermediate step we could do with an actor's "part" for each character in the canon. That is to say, we need a separate document containing just the words for a given character in a given play. Once we have ALL the words a character says, we can throw away the ones which are not rare (by using the list from step (2)) and will be left with just the rare words spoken by a particular character.

The original etext we started with, 'all.txt', has a feature which makes it not immediately suitable for generating "parts": some lines are shared between by two or more characters. Take this example:

KING JOHN Death. HUBERT My lord. KING JOHN A grave. HUBERT He shall not live. KING JOHN Enough.
(King John 3.3.66)

For our purposes, it would be better if each speech began on a new line. The following script breaker.pl does this, and also wraps a tag (<speaker>...</speaker>) around each speaker:

while (<>) {
($PlayName, $RestOfLine) = split(" ", $_, 2);
s/\W([A-Z|-|\[|\]\^|\@]{2,})(([A-Z|-|\[|\]\^|\@]{2,})*)\W/\n<speaker>$PlayName-$1$2<\/speaker>\n/g;
print;
}

This is essentially the same search pattern as used in findchars.pl (which looks for strings of uppercase characters, which may be separated by spaces but not by anything else), but instead of just finding them they are tagged for easy identification later. When all.txt, the original etext, is run through breaker.pl, the result is a file with speeches tagged by speaker ('tagged.txt' on the left). To illustrates the tagging, here is how the above line from King John looks after processing by breaker.pl:

JN 3.3. 66B
<speaker>JN-KING JOHN</speaker>
Death.
<speaker>JN-HUBERT</speaker>
My lord.
<speaker>JN-KING JOHN</speaker>
A grave.
<speaker>JN-HUBERT</speaker>
+
JN 3.3. 66B He shall not live.
<speaker>JN-KING JOHN</speaker>
Enough.

There are still things here which we don't want in the final "parts" for these two characters: the code representing the play's name (JN), the act, scene, and line numbers, and the line-number suffix 'B' with its associated run-over character '+'. These are all feature of the underlying etext which we will remove in the next stage.

To turn the file tagged.txt into a collection of characters' "parts" all we need to do is spot those single lines which have the <speaker>...</speaker> tagging and when we find one, start pouring the output into a new file whose name is based on the character name we find between the tags. If this is a new character (one who hasn't yet spoken in the play), the act of OPENing a filehandle for that character will create the output file, but if the file already exists (and hence contains what the character has said so far), the act of OPENing the same filehandle again will cause the new words to be appended to the existing file. This happens because the filehandle is opened in the 'append' mode by putting ">>" before the filename. Here is the script makeparts.pl which takes the output of breaker.pl, ie takes the file tagged.txt, and creates a character part for each character in the canon:

while (<>) {
$WholeLine = $_;
if ($WholeLine =~ m/\<speaker>(.*)\<\/speaker>/) { # if there's a new speaker
  open (CURRENTSPEAKER, ">>parts/$1.cue");         # open new (or reopen old) handle
  $FirstLineOfNewSpeaker = (<>);  # pull first line (which hasn't got
                                  # playname, act.sc.line number prefix)
  print CURRENTSPEAKER $FirstLineOfNewSpeaker;
                                  # and send line to the current part
}

else {                            # or if it's just a continuing speech....
  ($PlayName, $ActScene, $Line, $Speech) = split(/\s+/,$WholeLine, 4);
                                  # then separate off the playname
                                  # and act.sc.line from speech
  print CURRENTSPEAKER $Speech;   # send this line to the current speaker's part
}
}                                 # go back for next line

In tagged.txt the first line for a new speaker hasn't got the playname and act.scene.line numbers which we want to strip out, so the code for "if there's a new speaker" just pulls the line after the speaker's name and sends it out to that speaker's file. But in the code for "if it's just a continuing speech"--and hence does have the playname and act.scene.line numbers which need to be removed--a 'split' on whitespace (represented by \s+ in Perl) takes away PlayName and ActScene and Line numbers, and then just the Speech itself is sent out to the character's file. The script makeparts.pl creates a new file for each character in the canon. Rather than provide a link for each one (there 1046 "parts"), I've bundled them into a zip-file,  parts.zip on the left.

Another outstanding problem: The stage directions should be deleted from the original etext, else they'll end up being attributed as someone's speech.

Notes

1Perl is a programming language common on Unix systems and also available for the Windows and Macintosh operating systems. Perl is good with textual strings, hence it's suitable for this application.

2The poetical output should be consider too, shouldn't it? That goes on the list of 'next time' improvements, together with a consideration of the effects of editorial modernization of spellings found in the early printed texts.