Tonight I attended a lecture by the new director of the Folger Shakespeare Library, Michael Witmore, on using data-mining technology on the works of Shakespeare. The use of quantifying technology, typically associated with the sciences and other "real world" pursuits, with something as subjective as literature is novel, and offers many insights, not the least of which has to do with its own subjectivity.
The talk began with an obligatory crack about the Roland Emmerich Shakespeare authorship abortion slithering into theaters this Friday ("The media was asking me for the Folger's position on Anonymous. I told them we're a library; we don't have positions, we have collections.") and then gave some background information on his topic. As an English professor at Carnegie Mellon University was introduced to a program called Docuscope that breaks down text fed into it and assigns tags to different words based on the functions they serve. Originally intended for improving the writing skills of incoming freshmen, Witmore and his University of Glasgow colleague Jonathan Hope fed it Shakespeare's 36 plays and have been using it to analyze the results on grounds of genre.
Witmore's blog demystifies some of the technical aspects of how this works:
Docuscope, that is, codes words and “strings” of words based on the ways in which they render a world experientially for a reader or listener. The theory behind how texts do this, and thus the rational for Docuscope’s coding strategy, is derived from Michael Halliday’s systemic-function grammar. But what is particularly interesting about Docuscope is the human element involved in its creation. The main architect of the system, a rhetorician named David Kaufer, spent 8 years hand-tagging several million pieces of English according to their rhetorical function, and then expanded out this initial tagging spread with wild-card operators so that Docuscope now classes over 200 million strings of English (1 to 10 words in length) into over 100 distinct categories of use or function.
The resulting data clusters, they found, would belong in certain categories, while being (categorically?) excluded from others--given text chunks would be characteristic of ABC, but not XYZ. Add these together and you'd get a Gaussian scatter plot, divided among Tragedy, Comedy, History, and the Late Romances of Shakespeare. The purest Comedy section came from The Merry Wives of Windsor, while the purest History was a scene from Richard II, which after consideration most would probably agree is the least funniest of the History plays.
How these results were reached all have to do with the generic criteria at work. As Witmore described it, comedy often involves two people plotting away, which involves a lot of 'I' and 'you' exchanges, setting in motion a snowballing series of misunderstandings. A hallmark of History, by contrast, is heavily descriptive dialogue, which makes sense considering the sheer amount of, well, history that needs to be conveyed.
These distinctions lead to the seemingly unusual grouping of Othello with The Merry Wives of Windsor in comedy. Its plot, recall, revolves on an elaborate scheme of mounting falsehoods to trick Othello into suspecting his wife of cuckoldry. Familiar comic territory, but punctuated with the horrible success of Iago's stratagem.
An obvious objection to this computerized approach to textual analysis is that its reasoning is circular. A computer program can only find what its programmers tell it to find, so how can it actually tell us anything new? Witmore was asked such a question in the follow-up Q&A (I think. It was by an analyst who works with similar data, and so the wording was far more technical and above my pay grade.), and his response was, essentially, that one had to find a balance, that overly detailed programming would indeed provide narrow and redundant results.
I myself had a hard time getting past the issue of the premises Docuscope is given, but it seems its real strength is in allowing us to re-examine our notions of genre and classification of plays. Witmore mentioned to me that one of the other surprising results was the grouping in History of A Midsummer Night's Dream, and this had all to do with the extensive descriptive language in the play, plus the whole Pyramus and Thisby episode. One can actually track the points in a play where it veers into different territory, he said. More accurate than identifying plays by genre is their tendency towards a given genre.
Thus the usefulness of the Docuscope approach is in fact its strict logic. Though we may know what we're telling it when we input its text tagging parameters, its thoroughness and consistency in applying them will show us things we may not have before considered.