Experiments in the Q/F King Lear differences.

Now that we have TEI-XML encoded texts of Q1 (1608) and
Folio (1623) King Lear, we can undertake some experiments.
I propose that the first step is for you to put tags
into the XML encodings to mark off:

* In Q1, the passages that are unique to Q1

* In F, the passages that are unique to F

The crib for finding and marking up the passages is
the document "LR-Q1-F-mapping.htm". Below I provide
some notes on adding the XML tags needed to mark
up the texts in the above way. 

Once we have the XML documents so marked up, it should
be easy to use XPath to pull out:

* The Q1-only passages (hereafter Q1-unique)

* The F-only passages (hereafter F-unique)

* Q1 minus the Q1-only passages (hereafter Q1-common-with-F)

* F minus the F-only passages (hereafter F-common-with-Q1)

I was thinking we'd apply the various quantification methods,
including such things as Shannon Entropy, Shannon-Jensen
Divergence, Edit Distance, and DTW, to do a series of
comparisons. Some tests will be more sensible that others
in different cases. For example, DTW on texts that are not
supposed to have any runs of words in common (such as
Q1-unique and F-unique) makes no sense, but DTW does
make sense in comparing Q1-common-with-F with F-common-with-Q1
since these are meant to be essentially the same writing.
Conversely, Shannon Entropy and Shannon-Jensen Divergence
make sense for comparing Q1-unique with F-unique. I leave
it to you to consider what tests make best sense with what
texts, but the kinds of questions I was hoping we could start
to answer are:

Q1) Some basic stats. How long are the Q1-unique and F-unique
passages? How do their contributions break down by scene
or act-scene (as in can we graph where in the play--beginning,
middle, end--they fall)? 

Q2) What differences are there between Q1-common-with-F and
F-common-with-Q1? Editorial judgement says that these bodies
of writing are essentially the same lines with small verbal
variants between them but no whole lines unique to either.
Do our tests substantiate that?

Q3) What differences if any are there between Q1-unique and
Q1-common-with-F? I'm thinking not only of differences in
choices of words within speeches, but also in kinds of speech
(prose versus verse) and who speaks. Can we say that certain
characters are over/under-represented in the Q1-unique material?
We are looking to see if Q1-unique seems to be 'of a piece'
with Q1-common-with-F or seems different in some way.

Q4) What differences if any are there between F-unique and
F-common-with-Q1? I'm thinking of the same kind of questions
as in (2) above.

Q5) What differences are there between Q1-unique and F-unique?
In the one-text theory there was only ever one play and
Q1 and F are both imperfect records of it, each deviating
from the correct script only by errors of transmission. If
that is true, then Q1-unique and F-unique should be alike
despite being printed 15 years apart since both are parts
of the same lost original play that comprise the superset
of Q1-unique + F-unique + the lines in common to both
plays, meaning Q1-common-with-F and F-common-with-Q1 which
last two are, in this view, essentially the same thing.

Q6) Any further question(s) suggested to us by our Advisory
Board, which I am going to poll now.

Q7) Any further question(s) and permutations of the above
that occur to us.

How to Add the XML Tagging for these Experiments

In order to conform to the TEIlite encoding standard,
we will use the <add> (meaning "addition") element for this,
because that element is already defined in the TEIlite
DTD. For shorthand here, we will call a piece of text
so marked an "addition" but that is just a convenience
and we do not in fact know if something was added to
something else to make the edition we are marking up or if
something was taken away from something else to make the
edition we are marking up. We are calling the Q1-unique
and F-unique materials "additions" solely because it is
convenient to use the TEI element <add> to capture them.

This <add> element is not allowed to contain any sub-elements
other than empty milestone elements such as <lb/> line breaks.
So, we need to encode each run of text that is unique
to one edition at the lowest possible level in the document
tree, amongst the parsed character data. The @resp attribute
that we apply to the <add> element will point to one of two
IDs that we add to the start of the document, after the list
of IDs used to encode character names, to indicate what
kind if material it is:

...
<name id="AM">Knight</name>
<name id="Q1-only">Material present in Q1 and not in F</name>
<name id="F-only">Material present in F and not in Q1</name>
...

When marking up a single line that is Q1-only or F-only,
we need to mark up the speech prefix and the spoken words
as separate elements:

IN KING LEAR FOLIO:
...
<sp who="D"><speaker>Cord.</speaker><l>Nothing my Lord.</l></sp>
<sp who="A"><speaker><add resp="F-only">Lear.</add></speaker><l><add resp="F-only">Nothing?</add></l></sp>
<sp who="D"><speaker><add resp="F-only">Cord.</add></speaker><l><add resp="F-only">Nothing.</add></l></sp>
<sp who="A"><speaker>Lear.</speaker><l>Nothing will come of nothing, speake againe.</l></sp>
...

When the addition is a run of verse lines, the content of each line
is treated as a separate addition, and the addition may end
(as it does here) before the end of the verse line:

IN KING LEAR QUARTO:
<sp who="D"><speaker>Cord.</speaker>
<l>Had you not bene their father these white flakes,</l>
<l>Had challengd pitie of them, was this a face</l>
<l>To be exposd against the warring winds,</l>
<l><add resp="Q1-only">To stand against the deepe dread bolted thunder,</add></l>
<l><add resp="Q1-only">In the most terrible and nimble stroke</add></l>
<l><add resp="Q1-only">Of quick crosse lightning to watch poore Per du,</add></l>
<l><add resp="Q1-only">With this thin helme</add> mine iniurious dogge,</l>
<l>Though he had bit me, should haue stood that night</l>

When the addition is a run of prose lines, the fact that all
that separates each prose line is an empty milestone <lb/> element
(which is allowed within the <add> element) means that we can wrap
the whole addition in a single pair of tags:

IN FOLIO KING LEAR:
<sp who="A"><speaker>Lear.</speaker><p>And the Creature . . <lb/>
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . <lb/>
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . <lb/>
. . . . . . . . . . . . . . . . . . . . . . . do appeare: Robes,<lb/>
and Furr'd gownes hide all. <add resp="F-only">Place sinnes with Gold, and<lb/>
the strong Lance of Iustice, hurtlesse breakes: Arme it in<lb/>
ragges, a Pigmies straw do's pierce it. None do's offend,<lb/>
none, I say none, Ile able 'em; take that of me my Friend,<lb/>
who haue the power to seale th'accusers lips.</add> Get thee<lb/>
glasse-eyes, and like a scuruy Politician, seeme to see the<lb/>
. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . <lb/>
Bootes: harder, harder, so.</p></sp>

...


* I think the labour of adding of tags like this gives one a better
appreciation of just how the XML is working, and this experience
prevents one jumping to false assumptions about exactly what is
being pulled out of the XML by XPath and other tools.

...  it is important that...  the resulting
files must validate against the "teilite.dtd".