Advisory Board Online Conference 26 May to 7 June 2017 PI Egan wrote to project's Advisory Board: << The project now has TEI-XML transcriptions of the 1608 Q1 and 1623 Folio texts of King Lear and is marking them up to show the bits that scholars have traditionally said are common to the two editions and the bits that have traditionally been thought unique to one or other edition. Attached is the crib-sheet that we use for this purpose and any critiques you can make of that would be very helpful. But what I really want you for is to suggest experiment we should do. The kinds of textual-comparison tests we can do include: * Measurement of the Shannon Entropy and Jensen-Shannon Divergence of texts * Word Adjacency Networks using Markov chains to store the proximity values of 100+ function words found within a text * Nearest Shrunken Centroid applied to frequencies of any words we care to count * Edit Distance between texts * Dynamic Time Warping to align to similar-but-not-identical texts By marking up Q1 and F, we have essentially four blocks of writing to experiment with: 1) The Q1-only passages (hereafter Q1-unique) 2) The F-only passages (hereafter F-unique) 3) Q1 minus the Q1-only passages (hereafter Q1-common-with-F) 4) F minus the F-only passages (hereafter F-common-with-Q1) To get us started, we thought we'd try to answer the following questions: Q1) Some basic stats. How long are the Q1-unique and F-unique passages? How do their contributions break down by scene or act-scene (as in can we graph where in the play--beginning, middle, end--they fall)? Q2) What differences are there between Q1-common-with-F and F-common-with-Q1? Editorial judgement says that these bodies of writing are essentially the same lines with small verbal variants between them but no whole lines unique to either. Do our tests substantiate that? Q3) What differences if any are there between Q1-unique and Q1-common-with-F? I'm thinking not only of differences in choices of words within speeches, but also in kinds of speech (prose versus verse) and who speaks. Can we say that certain characters are over/under-represented in the Q1-unique material? We are looking to see if Q1-unique seems to be 'of a piece' with Q1-common-with-F (and indeed with F-common-with-Q1 if that is not essentially the same as as Q1-common-with-F) or seems different in some way. Q4) What differences if any are there between F-unique and F-common-with-Q1? I'm thinking of the same kind of questions as in (2) above. Q5) What differences are there between Q1-unique and F-unique? In the one-text theory there was only ever one play and Q1 and F are both imperfect records of it, each deviating from the authoritative script only by errors of transmission. If that is true, then Q1-unique and F-unique should be alike despite being printed 15 years apart since both are parts of the same lost original play that comprise the superset of Q1-unique + F-unique + the lines in common to both plays. Are we seeing that kind of wholeness across the various parts? Here's where you come in: please could you let me know what you think of the above questions--be brutally honest --and either suggest improvements to them and/or suggest wholly new questions that haven't thought to. You should also critique our approach at a more fundamental level if you think we are going about things the wrong way. We are keen to hear and take your advice. >> John Burrows replied: << These are good questions, I think, and well worth pressing in the manner you suggest. Is there also a role for Hugh's version of Zeta? But I wonder whether it is worth asking a more general question: do the several relationships between Q and F versions have common properties or are they all different cases? Can one reasonably speak of bad quartos as a kind? You may well think that Lear alone may be enough for the moment. >> PI Egan replied to John Burrows: << Because our projects are so similar, Hugh and I are coordinating our labours so as not to duplicate our efforts except where we want one team to try to replicate the findings of the other. For now, Zeta is something Hugh's exploring and my team is not. > But I wonder whether it is worth asking a more general > question: do the several relationships between Q and F > versions have common properties or are they all different > cases? Can one reasonably speak of bad quartos as a > kind? That's actually one of the fundamental research questions for our project and we hope to get to that. Right now we're groping our way forward with more local, single-play questions in order to check that we've got our basic tools working. >> Doug Duhaime contributed: << I'm wondering what might turn up if you complemented some of these lexical questions with some related syntactic and semantic questions. One could train a Markov model on part of speech sequences, dependency tree structures, or verb tenses, for instance, and compare distributions in Q1 and F. In a related way, I thought one could use "brown clustering" [1][2] or a word embedding model to obtain vector representations of tokens from each of the four corpora [3], then run analysis on those word vectors to obtain distance measures. The basic idea with the latter is to build a term cooccurrence matrix in which each linguistic "type" (unique word string) in a target language is given one row and one column. Then update the cell value at each column, row position to the number of times the word in the given column cooccurs with the word in the given row (where cooccurrence means the words appear within n words of each other in any document within a corpus, and a corpus is the largest collection of texts from a given time and place you can lay hands on). This operation produces an n by n matrix, where n is the number of linguistic types in the dataset. If you then reduce the number of columns with a dimension reduction technique such as non-negative matrix factorization, you’ll have a matrix that’s n by p, where p is perhaps <= 200. Then each row in the matrix represents a given linguistic type with a p-dimensional vector. For each word, one can then retrieve the corresponding vector, and those vectors can be compared for continuous similarity, clustered, or even added and subtracted for higher-order linguistic analysis. I've found these kinds of methods very helpful in capturing more of the latent signal in language than raw tokens provide, as they preserve similarities that are sometimes lost when we treat tokens as discrete phenomena. I also wanted to raise the question of ground truth. I think it would be highly interesting if there were some known instances of the single text theory outside of Shakespeare that could be studied, measured, and used to help inform the interpretation of the studies your team will pursue on your corpora. If there were some oral traditions (or similar) for which the ur text and their multiple printings were available, running analysis on the similarities those descendants shared with the ur text could potentially be insightful for the case of the Q1 and F texts. Figure 2.2 of Craig & Kinney's Shakespeare, Computers, and the Mystery of Authorship stands out as a phenomenal instance of leveraging ground truth to evaluate model predictions, and I was wondering if there’s another sort of ground-truth that could be used to evaluate model predictions of the Q1 + F transmission case. Whether Philaster provides an expected case that’s analogous to the folio case I can’t say, but the general thrust of my thought was to suggest that it would be highly ideal to have some null hypothesis that could be tested with the analytic approaches to be leveraged against the folio case.[1] https://aclweb.org/anthology/J/J92/J92-4003.pdf
[2] https://curtis.ml.cmu.edu/w/courses/index.php/Brown_clustering
[3] https://nlp.stanford.edu/projects/glove/
>> PI Egan replied to Doug Duhaime: << [Regarding 'ground truth'] Perhaps sermons could fulfil this role: we have cases in which they are supposed to have been published from aural capturing (by shorthand) during oral delivery and then are published again by their authors in more correct versions. But doesn't the aural/oral vector muddy the waters here? Your first suggestion seems to be that we want texts which were printed in multiple versions in which oral garbling is not suspected, in order to be parallel to Q1/F King Lear in which oral garbling is generally not suspected. Perhaps Q1 (1620) and Q2 (1622) >> MacDonald P. Jackson contributed: I can't guess how suited the kinds of tests you mention are to answering the questions you list. I think that with respect to King Lear the most interesting question is Q4: Does the F-unique material differ in any way from the F-common-with-Q1 material? Of course there may not be enough F-unique material for the tests to provide a definite answer. What we most want evidence about is whether the F-unique material was always part of the play or was added by Shakespeare several years later, as Gary Taylor argued in Division of the Kingdoms, or isn't by Shakespeare at all, as P. W. K. Stone claimed (most implausibly in my opinion) in The Textual History of ‘King Lear'. Are the tests capable of giving us an answer? Whether they are or not, running the tests may well reveal something of potential significance that we haven't known before. What I suspect they might be able to give an answer to is the much broader question of whether the Shakespeare quartos once labelled ‘bad' (and examined in Alfred Hart's Stolne and Surreptitious Copies) form a distinct category that differentiates them from all their ‘good' counterparts (and from other plays of the period with sound texts). >> PI Egan replied to MacDonald P. Jackson: << Agreed, ["whether the F-unique material was always part of the play or was added by Shakespeare several years later"] that's a key question. In principle, our tests are capable of that. One problem, though, is that Q1 and F were printed 15 years apart during a period when, we know, certain aspects of the language were changing rapidly. We don't want to mistake mere 'modernization' of the spelling and inflectional kind (doth>does, hath>has), which could have been applied by by a scribe or a compositor, for more significant changes. Unfortunately, in our base transcriptions the words are not lemmatized so such discriminations are hard to make. Hugh is working on markup to record words under their modern spelling forms and to record the grammatical functions of homographs, but his texts are not ready yet. We'll hoping to find out now what can be done without such sophistications. Deciding whether the 'bad' quartos do indeed form a group is definitely where we hope to be headed. >> Hugh Craig contributed: << I wonder if your alignment work could be extended to identify cases within the shared portions of *Lear* [= Q1-common-with-Fand F-common-with-Q1] where aligned strings (punctuation, 1-gram, n-gram, spelling, contraction etc) differ repeatedly between the two versions? This would extend what was done in the Craig and Kinney chapter on *Lear* where we used a bag of words approach and then looked at individual cases by hand. Good data to collect would be instances where Q1 has the A form and F has B, where Q1 has B and F has A, where Q1 and F both have A, and where Q1 and F both have B. What I mean by A and B is things like: A "that" [as in "he said that he would come"] B "zero that" [as in "he said he would come"] A "upon" B "on" A colon B full stop A "God's wounds" B "Zounds" If there is a consistent one-directional difference representing a good proportion of the cases overall this might look like human agency rather than textual indeterminacy a la Lene Petersen, and then the next question would be whether this is more likely compositorial or authorial. On hath>has, doth>does etc, it would be a great project to look at these in variant versions of the same texts to see if and how much the later-printed versions incorporate these changes--especially cases where no authorial revision is suspected--with the idea of getting a handle on whether this was a common compositorial change. >>