Play word counts

Projects that analyse the words in early modern drama often need to consider the frequencies of occurrence of particular words in relation to the total number words in a play. By words here we mean tokens so that Shakespeare's line 'Never, never, never, never, never' (King Lear 5.3.284) counts as five words (strictly, five tokens) although it has only one word type, namely the word never. The significance of finding a particular word a given number of times often depends on how many other words there are in a play.

We would be surprised to find a relatively rare word such as stole appearing often in a short play such as Shakespeare's The Comedy of Errors (where in fact it never appears) or his Macbeth (where it occurs just once) but less surprised to find it in a rather long play such as Hamlet (where it occurs three times). Long plays have more 'opportunity', as it were, to use rare words than short ones do. The question of just how surprised we should be is rather more complicated than it at first appears and cannot to be solved by simple arithmetic. But as a starting point, it is often useful to know how many words there are in the plays we are working on.

There are readily available sources of information for how many words there are in each of Shakespeare's plays, although the precise number found will depend on which edition we are referring to. There are about 31,000 words in the 1604-5 Second Quarto of Hamlet but only about 29,850 in the 1623 Folio edition. Despite this additional complication, it is useful to have a sense of how many words there are in not only the plays of Shakespeare but also the extensive canon of drama from around the same time written by other dramatists. Shakespeare's case is somewhat anomalous since there are more early editions of his plays than those of other writers, and in most canons the variation between editions is not as large as in his.

A useful standard source for the play texts in ProQuest's Literature Online database, recently renamed (in some marketplaces) as ProQuest One Literature. To provide working data for the New Oxford Shakespeare project, useful also in the Shakespeare's Early Editions project, the Post-Doctoral Research Associate Dr Paul Brown, funded by the Modern Humanities Research Association, set to work with De Montfort Frontrunner interns Jess Samuel and Carys Hudson. Under Brown's direction, the interns downloaded 382 plays from Literature Online and undertook some preliminary 'cleaning' of them. Brown wrote a script in the Python programming language to produce a word count for each play. The script was specifically written to detect and strip out some of the non-word artefacts peculiar to the transcriptions of plays provided by Literature Online, such as the string "[Page xyz]" that it uses to mark and number page-breaks.

The script was applied to two batches of plays. The first batch was a particular set of plays (in certain editions) requested by Prof Gary Taylor, a General Editor of the New Oxford Shakespeare, and the Comma Separated Values (CSV) file containing the word counts for these plays is linked on the left of this page as "Taylor's-List.csv". The second batch was all the plays that Literature Online dates from 1564 to 1616 and this batch has 40 more plays than the first. The CSV file containing the word counts for these plays is linked on the left of this page as "All-Plays.csv". The CSV format is an non-proprietary open format for tabular data and files conforming to it can be opened by many software applications, including the commonly used Microsoft Excel spreadsheet program.

Each of these two CSV files contains a list of the titles of plays, in column A, held in the Literature Online database. Each play has, courtesy of the accompanying Python script, its word count given next to its title, in column B. These word counts were verified by taking a sample of the plays and copying their full texts into Microsoft Word and recording the word count it reported (column C). To check there were no wild discrepancies, the percentage closeness of the counts was computed in column D. Columns C and D serve no purpose beyond checking that the Python script produced reliable results and they can be discarded by most users.

The Python script that Dr Brown wrote is below and includes his copious documentation and his disclaimer that he has learnt a lot more about programming since he wrote this. The script is presented in this webpage using HTML's <pre>...</pre> tags so it may be copied from here and pasted directly into a programming text editor.


''' AUTHOR NOTE: this script calculates the number of words
in a given set of text files. It handles some transcription
quirks particular to texts drawn from the Literature Online
database (as text files from this database were the dataset
with which the script was meant to be used). The script was
written in the earliest days of the author's work as a
software engineer and bears the marks of one learning a new
skill: it is functional but inelegant; it opts for a
straight-up procedural style over abstracting the heavy
lifting into functions and classes (which he now knows is
the way to do things). It does work, though . . .'''

# Import the operating system module and call it os. This
# module is used manipulate and navigate around elements
# of the operating system. Here it is used to move around
# working directories.

import os as os

# Import the glob module and call it glob when you reference
# it. This module is used to find file names and path names
# based on pattern matching. It creates lists of files that
# match specified patterns

import glob as glob

# Import the regular expressions module and call it re. This
# allows the program to use regular expressions (here used
# to match strings found in files).

import re as re

# Define a variable called play_file_directory -- this is
# where all the raw (but manually cleaned) texts are saved.
# os.getcwd tells the os module to return the current working
# directory (here, that's the one in which this Python program
# is housed. This line also concatenates a string to it of
# raw_files so play_file_directory become one subdirectory
# below where the program lives. That is the current working
# directory: C:\Users\Admin\PycharmProjects\LION-Word-Counts
# which makes play_file_directory
# C:\Users\Admin\PycharmProjects\LION-Word-Counts\raw_files.
# The backslash is used as an escape character to signal
# the following backslash is to be taken literally.

play_file_directory = os.getcwd() + "\\raw_files"

# Define a variable called location_for_cleaned_files which
# will store the processed files. Using the same os.getcwd
# and concatenation method, this line specifies that the
# location for cleaned files will be
# C:\Users\Admin\PycharmProjects\LION-Word-Counts\cleaned_files
# the backslash again is used as an escape character.
# Note also the backslash at the end of the location -- this
# is required when appending the .txt files to the location
# later on.

location_for_cleaned_files = os.getcwd() + "\\cleaned_files\\"

# Use the change directory method of the os module (chdir)
# to go to the location as specified by the play_file_directory
# variable. That is
# C:\Users\Admin\PycharmProjects\LION-Word-Counts\raw_files
# in this case.


# Once in this location, create a new list called play_text_list
# and use the glob method of the glob module to make a list
# of all the filenames that include some text, denoted by
# the *, followed by .txt. In effect, this serves to
# create a list of all the text files in the directory.
# This makes a list of all our play text files in this case.

play_text_list = glob.glob("*.txt")

# This line makes a list called punctuation_to_remove of all
# the punctuation marks we will remove from our texts

punctuation_to_remove = [',', '.', '!', '?', '"', ':', ';', ')', '(', '[', ']', '<', '>', '-', '{', '}', '/', '|']

# This code block instantiates a dictionary called
# play_word_counts. The {} tell us it is a dictionary,
# and, given that there is nothing between the braces,
# we have created it empty. The for loop counts from 0
# up until the number that is equal to the length of
# the play_text_list. So if there are 15 plays in the
# list, the loop will count from 0 to 14. Each time
# through the loop, the code will add a key to the
# dictionary play_word_counts that is equal to the
# item of play_text_list at index [i]. So, if i = 5,
# and the 5th item in the play_text_list list is 'Hamlet'
# then the play_word_counts dictionary will have a key
# called 'Hamlet'. This line also sets the value of
# the key to 0. For our purposes, these lines make
# all the plays we're dealing with a key in a dictionary,
# with a value of 0.

play_word_counts = {}
for i in range(0, (len(play_text_list))):
	play_word_counts[play_text_list[i]] = 0

# This block handles the bulk of the automated cleaning.
# It matches with page and number markers in the LION
# transcriptions and removes them, and removes punctuation.
# The loop counts from 0 to the length of play_text_list.
# Each time through the loop it opens a play file -- the
# name of which is an item in play_text_list and calls
# it play_text. We set play_text (currently the file
# 'hamlet.txt', say) to be equal to -- 
# this has Python read in the whole file, ready for
# some instruction to do something with it. So we now
# have the whole text of hamlet.txt ready for some
# processing. We make play_text (currently all the
# words of Hamlet ready for some processing) equal
# to the substitution (=sub) method of the regular
# expression module (as re). This means our program
# will comb through all of play_text and replace
# anything that matches the specified regular expression
# with "" (that is two quotation marks with no space
# between them). This deletes anything that matches
# the regular expression. The regular expression itself
# works in the following way: the 'r' that prefaces
# the expression denotes a raw string literal, which
# means 'treat what follows as literal characters.
# In particular, when you see a backslash, treat it
# as 'just a backslash' -- don't make me escape them.
# This is done because regular expressions use backslashes
# to say 'literally find this character, don't use its
# special meaning. The expression itself says find a
# literal '[' (marked by '\['), followed by either an
# uppercase P or a lowecase p (as given by '[Pp]'),
# followed by lowercase 'age'. This is then followed 
# by 0 or 1 space characters ('\s?') which in turn is
# followed by any number of digits ('\d*'), which is
# or are followed by one or none spaces ('\s?') which
# is followed by a ']' (marked by '\]'). This part of
# the expression captures all the variations of the
# [Page XXXX] or [Page ] or [page XX] markers in the
# LION transcriptions. The | serves as a boolean OR
# in regular expressions, so says match whatever comes
# before | OR match whatever comes after | and, in this
# case, replaces it with "" (nothing). The next section
# says match a literal [ (given as'\[) followed by one
# or more digits (denoted by '\d+') and is or are followed
# by ] (given as '\]') and replace it with "". This
# combats things like [25] and [5643] in LION transcriptions.
# The next section -- following the second | -- says
# find a [ (marked by '\[') followed by one or more
# full stops (denoted by '\.+) followed by a closing
# square bracket ('\]'). This matches with [.....] which
# is used in LION transcriptions to mark missing text.
# NOTE: must use '\.' -- . on its own means 'any character'
# Once all the matched expressions have been replaced
# in play_text, we enter a for loop that cycles through
# all the punctuation marks in the list created above
# (punctuation_to_remove). The loop crawls through
# play_text asking, in turn, is this character (say)
# a '!'? If it is, then the line
# play_text = play_text.replace(punctuation_mark), "")
# replaces the found ! with "" (nothing). The loop does
# this for all punctuation marks and the result is a
# play_text cleaned of all punctuation. Python's built-in
# 'replace' method handles this for us in the same way
# the re .sub method did previously. Both take the
# but the sub method takes an additional argument that
# specifies the text in which to look.

for i in range(0, (len(play_text_list))):
	with open(play_text_list[i]) as play_text:
		play_text =
		play_text = re.sub(r"\[[Pp]age\s?\d*\s?\]|\[\d+\]|\[\.+\]", "", play_text)
		for punctuation_mark in punctuation_to_remove:
			play_text = play_text.replace(punctuation_mark, "")

# This code block tackles the issue of some transcriptions
# including spaced-out characters where superscript
# characters appear in the early modern texts. First,
# play_text (which is now all the text of given play
# -- Hamlet in our example above) is turned into a list
# of single words using the .split method. The .split
# method splits by default on the space character.
# Thus, if the first words of the play were 'to be or
# not to be that is the question', play_text would
# become a python list that looked like this:
# ['to', 'be', 'or', 'not', 'to', 'be']. A list
# is indexed so play_list[0] is 'to' and play_list[2]
# is 'or'. The for loop gives j the value of 0 and
# increments it each time through the loop up until
# j reaches the value of the length of play_text.
# In our example, j would go up until it matched
# the number of words in Hamlet. The conditional
# statement if j < len(play_text - 1) ensures that
# the following lines do not try and test the last
# word in the play plus another word that is the
# number of words in the play + 1. It is necessary
# because the lines below it call for j + 1 (so
# when j is the total number of words in Hamlet,
# asking to examine that value + 1 is impossible
# and throws an error. The next conditional tests
# whether two consecutive words (j and j + 1) are
# equal to a series of letters found in the
# transcriptions because of superscript transcription.
# So if play_text[j] = y and play_text[j+1] = e (because
# y followed by a superscript e in the printed text
# was transcribed as 'y e' in LION's text), then
# play_text[j] + play_text[j+1] equals 'ye'. The
# 'or' operator allows the conditional to search for
# multiple of these superscript transcriptions at
# once. Because the test requires two words to have
# produced the letters tested for ('ye', 'yt' and
# so on, the program does not do anything when these
# letters appear as a single word in the text.
# Where one such pair is discovered (that is, the
# if condition is met), the line play_text[j] = play_text[j] +
# play_text[j+1] sticks the two letters or collection
# of letters together and stores them where the
# first was found. play_text.pop(j+1) then deletes
# the letter or letters found in the second part of
# the condition -- since these are now included
# at play_text[j]

		play_text = play_text.split()
		for j in range(0, (len(play_text))):
			if j < len(play_text) - 1:
				if play_text[j] + play_text[j + 1] == "ye" or play_text[j] + play_text[j + 1] == "yt" \
						or play_text[j] + play_text[j + 1] == "wt" or play_text[j] + play_text[j + 1] == "mr" \
						or play_text[j] + play_text[j + 1] == "sr" or play_text[j] + play_text[j + 1] == "or" \
						or play_text[j] + play_text[j + 1] == "for" or play_text[j] + play_text[j + 1] == "our" \
						or play_text[j] + play_text[j + 1] == "your" or play_text[j] + play_text[j + 1] == "yor":
					play_text[j] = play_text[j] + play_text[j + 1]
					play_text.pop(j + 1)

# Some superscript transcriptions for 'thee' appear
# as 'y e e'. Since the code above already stitches
# all occurences of 'y e' together, a further test
# is needed to stitch 'y e e' into 'yee'. These lines
# do that. They first check that j is not at its
# last location (since then j+1 would be impossible),
# before testing whether the word found at the
# current position j and the word at the position
# j+1 equals 'yee' when stitched together. 'y e'
# has already been stitched to 'ye' so if j + 1
# is found to be 'e' the condition is met. In which
# case, play_text[j] is set to 'yee' and
# the .pop method is used to delete play_text[j+1].

			if j < len(play_text) - 1:
				if play_text[j] + play_text[j + 1] == "yee":
					play_text[j] = play_text[j] + play_text[j + 1]
					play_text.pop(j + 1)

# The two following code blocks do the same thing
# when three or four split words need to be tested.
# That is, when a number of letters have been in
# superscript and so a space has been added between
# multiple letters that belong to the same word.
# 'which', for instance, is transcibed as 'w c h'.
# To make it 'wch' play_text[j], play_text[j+1],
# and play_text[j+2] must be examined. The same
# is the case four-letter superscripts like
# 's n t t', occasionally found for 'saint'.
# The only other difference in these blocks is
# that the .pop method deletes more than one
# list item, but always at position j+1. This
# is because at least two items subsequent to j
# need to be deleted, but once j+1 has been
# removed, the new j+1 was j+2 when the test
# began -- deleting j+2 would be deleting the
# word found immediately after the superscript
# transcription.

			if j < len(play_text) - 2:
				if play_text[j] + play_text[j + 1] + play_text[j + 2] == "wch" or play_text[j] + play_text[j + 1] + \
						play_text[j + 2] == "your" or play_text[j] + play_text[j + 1] + play_text[j + 2] == "you" \
						or play_text[j] + play_text[j + 1] + play_text[j + 2] == "wth" or play_text[j] + play_text[
						j + 1] + play_text[j + 2] == "yee" \
						or play_text[j] + play_text[j + 1] + play_text[j + 2] == "firste" or play_text[j] + play_text[
						j + 1] + play_text[j + 2] == "yeir":
					play_text[j] = play_text[j] + play_text[j + 1] + play_text[j + 2]
					play_text.pop(j + 1)
					play_text.pop(j + 1)
			if j < len(play_text) - 3:
				if play_text[j] + play_text[j + 1] + play_text[j + 2] + play_text[j + 3] == "sntt":
					play_text[j] = play_text[j] + play_text[j + 1] + play_text[j + 2] + play_text[j + 3]
					play_text.pop(j + 1)
					play_text.pop(j + 1)
					play_text.pop(j + 1)

# This line makes the value of the dictionary
# play_word_counts at the key play_text_list[i]
# equal to the number of words that remain in
# play_text (the list of all the words in the
# play currently being examined. So if hamlet.txt
# is the current play in the list we are looking
# at, and it has 20,000 words after cleaning,
# the value of the key hamlet.txt in the
# play_word_counts dictionary would be 20,000

		play_word_counts[play_text_list[i]] = len(play_text)

# This block opens a new file that the current
# play_text is going to be written to since it
# is now cleaned of punctuation and other
# transcription oddities. We open the file
# in the directory we specified above for
# cleaned files and add 'cleaned_' to it
# followed by the play file name (like
# 'hamlet.txt') and we open the file in
# write mode, specified by the 'w'. This
# gives a file path like:
# C:\Users\Admin\PycharmProjects\LION-Word-Counts\cleaned_files\cleaned_hamlet.txt
# We call this file play_out in this code block.
# We create a variable called words_per_line
# and give it the value 0. This will count
# how many words we have written to each
# line of the file. We begin a loop that
# loops through every word in play_text
# (currently a python list of all the words
# in a play, with each word being its own
# item: ['to', 'be', 'or', 'not', . . .]).
# This is a slightly more pythonic way of
# looping through an indexed list than the
# for i in range (0,(len(play_text) that
# we have used elsewhere but it has the
# same effect. The if statement checks whether
# there are 18 or fewer words on the line
# already, if there are 18 or fewer words
# then the program writes the next word of
# play_text to the file, followed by a space.
# It then increments words_per_line by adding
# 1 to its value. Once words_per_line equals
# 19, the else condition executes and writes
# a final word to the same line, before adding
# a newline character ('\n') and resetting
# words_per_line to 0 to start counting again
# for the new line.

	with open(location_for_cleaned_files + "cleaned_" + play_text_list[i], "w") as play_out:
		words_per_line = 0
		for word in play_text:
			if words_per_line <= 18:
				play_out.write(word + " ")
				words_per_line += 1
				play_out.write(word + " \n")
				words_per_line = 0

# The major for loop of the program that
# cycles through each play in turn, cleaning
# its content, counting its words and
# writing it to a file ends here. The
# 'with open' instruction in Python closes
# the files it has been using by default
# so no explicit file closing is required.

# This block opens a .csv (= comma separated
# values) file in write mode and calls it
g word_counts. It is going to house the
# names of our plays and our wordcounts.
# The program uses the .write method to
# write column headings of 'Play Title'
# and 'Wordcount' followed by a newline
# marker. It then cycles through each play
# in the play_text list. The first thing
# the loop does is make a variable called
# title_cleaned equal to the play in
# play_text_list (currently '2_henry_4.txt')
# and replaces the .txt at its end with
# nothing. title_cleaned (now '2_henry_4')
# now replaces the _ character(s) with
# a space to make title_cleaned equal
# to '2 henry 4'. The .write method is
# then called again to write the cleaned
# title to the file, followed by a comma,
# then a space then the value of the
# dictionary play_word_counts at position
# [play]. The value -- though stored as
# in integer in the dictionary -- rendered
# a string by using 'str' -- this is required
# for the concatenation of this method (strings
# and integers cannot be stuck together since
# they are different data types. This is
# followed by another comma and space
# and a new line marker. This means the file
# is operating on a new line for the next
# play title and count to be written
# the next time through the loop. 

with open("word_counts.csv", "w") as word_counts:
	word_counts.write("Play Title, Wordcount \n")
	for play in play_text_list:
		title_cleaned = play.replace(".txt", "")
		title_cleaned = title_cleaned.replace("_", " ")
		word_counts.write(title_cleaned + ", " + str(play_word_counts[play]) + ", " + "\n")	 # + play + "\n")