"PROBLEM #1: The Transposition Problem". This occurs where two texts differ only in that a run of words has swapped over. For (made-up) example:
Q1 O for a muse of fire, that would ascend The brightest heaven of invention: A kingdom for a stage, princes to act, And monarchs to behold the swelling scene.
Q2 O for a muse of fire, that would ascend A kingdom for a stage, princes to act, The brightest heaven of invention: And monarchs to behold the swelling scene.
Here the middle two lines of Q1 appear as the middle two lines of Q2, but in reversed order. We could mark up this differenc as two kinds of edit applied to Q1 in order to achieve Q2:
i) deletion-of-material-in-Q1-to-achieve-Q2 (encoded below by square brackets):
ii) insertion-of-material-not-in-Q1-to-achieve-Q2 (encoded below by pointy brackets)
Marked up this way, the texts would be:
Q1 = ... O for a muse of fire, that would ascend The brightest heaven of invention: A kingdom for a stage, princes to act, And monarchs to behold the swelling scene.
Q2 = Q1 WITH THESE EDITS APPLIED ... O for a muse of fire, that would ascend [The brightest heaven of invention:] TYPE (i) EDIT A kingdom for a stage, princes to act,
TYPE (ii) EDIT And monarchs to behold the swelling scene.
It seems to me from what I understand so far about the alignment techniques you're exploring that this is how these techniques treat such transposition (which editorially we consider a single edit), namely as a combination of two edits: a deletion and an insertion.
That approach might be acceptable if the techniques are able to track that the two edits together comprise one textual event, and they track that the same piece of text (here, "The brightest heaven of invention:") is the subject of these two edits, and they track that this same transposition could be expressed by a different pair of edits. That last point arises because an equally valid way to express the differences betwwen Q1 and Q2 would be:
Q2 = Q1 WITH THESE EDITS APPLIED ... O for a muse of fire, that would ascend TYPE (ii) EDIT The brightest heaven of invention: [A kingdom for a stage, princes to act,] TYPE (i) EDIT And monarchs to behold the swelling scene.
I can see also how these alignment techniques can accommodate substitutions as well, if say Q2 had "observe" instead of "behold" in the last line. But that takes us to ...
"PROBLEM #2: The Substitution Problem". The rules that we want to apply when quantifying substitutions are not agreed even amongst Shakespearians. Take one line from King Lear that appears in the First Quarto and the Folio editions thus:
Q1 = Sir I am made of the selfe same mettall that my sister is F = I am made of that selfe-mettle as my Sister"
There are multiple ways to align these two strings depending on what we consider to be our threshold of dissimilarity beyond which something counts as a substitution. In the Oxford English Dictionary, which records variations in words' spellings and meanings over time, "mettall" and "mettle" are given as equally valid spellings of the modern word "metal" and so for our purposes this is no substitution at all: they are identical words. Likewise, "sister" and "Sister" are identical spellings differing only in capitalization, which in general is not meaningful in our editions. But how to treat the difference between "selfe same mettall" and "selfe-mettle" is hard to say. We might ignore the hyphen and say that these are the same except that the latter deletes "same".
This second problem is really an opportunity because in fact the field of Shakespeare studies has no agreed way to quantify these things, and coming up with defensible ways to do so is a key objective of our research project.
"PROBLEM #3: The 'Why?' Problem". While thinking about these questions of how to quantify the differences between the early editions of Shakespeare, we should bear in mind that the bigger question we want to answer is "Why are these editions different?"
The possible answers include that the texts of Shakespeare got corrupted in the process of transmission in the form of the copying of manuscripts (which is prone to its own distinctive errors arising from certain letters looking alike in early modern handwriting) and also in the process of printing (which is prone to different distinctive errors arising from the selection and combination of pieces of type).
Another possible source of textual corruption is aural error. If a play were written down by someone listening to a live performance and then published, the published version might suffer from mishearings. A possible example from King Lear is "A dogge, so bade in office (Q1) versus "a Dogg's obey'd in Office" (Folio).
Another kind of variation arising in transmission, but not really a form of corruption, is the updating of linguistic forms over time. A play first written down by Shakespeare in 1590 might well have linguistic forms--such as 'hath' where we say 'has' and 'doth' where we say 'does'--that looked distinctly old-fashioned by 1620, when the printer's copy for the 1623 Folio was being prepared. Possibly in 1620 someone recopying Shakespeare's 30-year-old manuscript simply updated the old-fashioned linguistic forms. This isn't quite error in transmission, but it does cause variation.
Aside from textual corruption separating the early editions, there may be authorial and/or non-authorial revision. There has been a lot of work done on this lately, detailed in the Authorship Companion to the New Oxford Shakespeare I sent you. In the process of authorial revision of a play, the consideration we just saw about old-fashioned linguistic forms might well apply. The middle-aged Shakespeare might in 1610, while revising a play he first wrote 20 years earlier, update the old-fashioned expressions, or he might leave them alone but use more modern forms in the bits he was newly writing.
Naturally, it falls to me as the specialist on these early editions to come up with the explanations we offer for their differences. But I think the above considerations should always be in our minds as we try to quantify the differences. It is likely that multiple explanations are operating at once. That is, Q1 King Lear (1608) and Folio King Lear (1623) might well be separated by textual corruption and authorial (or non-authorial) revision, and that their differences in date also factor in our explanation(s).