Information Metrics

Pairwise Information Metrics

*Relationship of these metrics to each other....

Notes


Experiment 1.i

i) Compare Q-common and F-common. Calculate the Jensen-Shannon Divergence, the DTW cost, and the Edit Distance between them. These are chunks of Q and F that we consider to be essentially the same writing but printed 15 year apart, so let's get a baseline for how alike they appear by these three metrics. We could, for comparison, run the same tests for other Shakespeare plays that we are confident are by him alone that are available to us in Q and F form and that we consider essentially the same writing but printed about 15 years apart.

1) Corresponding Speeches must be at least fairly close ....
Say within any scene and for any speaker the speech, the speech numbers must be within 6 if each other.
This prevents lots of possibilities but is still slack enough to cover the fuzzy matches.

2) For speeches to match the e.d. must be low-ish ... 
So we remove any speech pairs that are mismatches using e.d.
by this criterion ....

3) Also for speeches to match their m.i. must be high-ish ... 
Again remove mismatches ....

Code

Results

DECISION


Experiment 1.ii and 1.iii

ii) Compare Q-only with Q-common. Count the frequencies of occurrence of the top 100 function words and then apply i) Principal Component Analysis and ii) Nearest Shrunken Centroid to the counts to see if the texts seem to be the same style in this regard. Our starting hypothesis is that the Q-only matter is material cut from the play after Q was published, so we expect Q-only and Q-common to test the same in style. (This somewhat replicates the tests in the second half of the Kinney-on-King-Lear chapter.)

iii) Compare F-only with F-common. Count the frequencies of occurrence of the top 100 function words and then apply i) Principal Component Analysis and ii) Nearest Shrunken Centroid to the counts to see if the texts seem to be the same style in this regard. Our starting hypothesis is that the F-only matter was freshly written by Shakespeare about 6 years after he originally wrote the play, so we would expect these F-only and F-common to test roughly the same in style, whereas a great difference in style would suggest that someone other than Shakespeare wrote F-only. (This somewhat replicates the tests in the second half of the Kinney-on-King-Lear chapter.)

Code

Results


Experiment 2

EXPERIMENT 2) "FIND-YOUR-PARTNER"

Take a play for which we have a quarto and a Folio text. Take the first 2000 words (dialogue, and stage directions, and speech prefixes) in the quarto and compare it to the first 2000 words in the Folio, using as a measure of the difference each of DTW, Jensen-Shannon Divergence, and Edit Distance. Record the value of the difference by each of these three metrics. Then move the 'window' in the Folio (but not the quarto) up by one word so that words 1-2000 of the quarto are being compared to words 2-2001 in the Folio. Take the same three measures of difference and record them. Then move the 'window' in the Folio up by word one so that words 1-2000 of quarto are being compared to words 3-2002 in the Folio. Repeat this until there are no more words left in the Folio. Review the three scores (one for DTW, one for Jensen-Shannon Divergence, one for Edit Distance) for the matches and find for each the lowest-matching segment in the Folio. Record this lowest matching segment as the best match for words 1-2000 in the quarto.

Then move the window in the quarto up by one word to words 2-2001 and reset the Folio window to words 1-2000 and again make the comparison, recording the difference for each of the three metrics. Then move the Folio window up by 1 so that quarto words 2-2001 are being compared with Folio words 2-2001 and again record the differences. Then move the Folio window up by 1 so that quarto words 2-2001 are being compared to Folio words 3-2002 and record the difference. Repeat until there are no more Folio words. Again review the scores to find the partner in F for Q's segment 2-2001. Repeat all of this for Q's segment 3-2002 and so on until there are not quarto words left. The result should be a mapping of each 2000-word segment in Q to the closest 2000-word segment in F, as detected by each of the three metrics, together with the score for that mapping (how alike the segments are). This should show where Q and F are much alike and where they are not.

Things that might be varied in this experiment:

Using segments smaller or larger than 2000 words

Using steps of more than 1 word each time

Code

Results

DECISION


Experiment 3

EXPERIMENT 3) "HE-SAID-WHAT?"

Take a play for which we have a quarto and a Folio text. Pull out all the speeches by one character who is common to both texts. (This might necessitate a manual mapping to show that character xxx in the quarto is the same person as character yyy in the Folio.) Putting all the speeches into one long chunk of text, measure the difference by each of Jensen-Shannon divergence, DTW, and Edit Distance between the chunk from Q and the chunk from F.

Things that might be varied in this experiment:

Rather than putting all of one character's speeches

into one long chunk of text, might subdivide by act and also by scene--requiring some manual mapping of Q act/scenes to Folio acts/scenes--so that we can see where in the play that character's speeches are most alike in Q and F and most different in Q and F.

Results


EEBO Texts

S. Texts