The History of LEME

This website is a report on the making of the LEME corpus. What we learned from using the corpus has and will appear elsewhere.

The first publication about this project (Lancashire 1992) describes it as a diachronic corpus, a systematic collection of lexical texts for analysis by a corpus linguistics community such as ICAME. The emergence of the Web led me to a simpler application suitable for teaching a graduate course in Shakespeare’s language and comparable to Google-based searching. The Early Modern English Dictionaries Database (EMEDD) went online at CHASS in 1996 with software by University of Waterloo computer scientists at the OED project. The Canada Foundation for Innovation, in funding the TAPoR project six years later, made possible the making of a dedicated mySQL database and a much larger collection of texts, published as LEME in 2006. This work also brought LEME two partners: the University of Toronto Press was our publisher, and the University of Toronto Library our host. LEME funded its work by licensing its full database to individuals and academic institutions, while still making public querying of the database free to all.

In 2015 LEME undertook a survey of user experiences. The fifty-one respondents worked in Canada, the United States, the United Kingdom, France, Spain, Germany, and Serbia, and more than half of them studied English literature, linguistics, or philology. 94 percent of responders identified LEME as currently meeting their research needs. Respondents praised its simple and clear interface, its comprehensive coverage, its ability to print, a timeline feature, and a search engine that could be limited to headwords or definitions, and to variant spellings. Most responders identified as using LEME in conjunction with Early English Books Online (EEBO-TCP), the OED, ECCO (Eighteenth-Century Collections Online), Project Muse, and LION. LEME serviced a need that these projects could not. Half of the respondents replied “yes” when asked if LEME should be the basis of a period (Early Modern English) dictionary. A few thought LEME would be improved by offering full texts, dialect sources, multiple editions of Johnson’s dictionary, and early-American-English dictionaries. One of them, in challenging language, thought LEME to be unscholarly because, unlike EEBO-TCP, it did not grant free access to the encoded texts of its dictionaries.

LEME took that criticism seriously. By 2016, LEME, the Press, and the Library discussed making full texts of our dictionaries available to researchers who would not always be satisfied by LEME’s Web interface but needed to employ other software. Historical dictionaries are useful in research activities such as authorship attribution, translation studies, literary and philosophical analysis, computational linguistics, and historical lexicography. Scholarly projects also form to edit dictionaries in various languages, and an English-department project like LEME has few or uncertain qualifications in editing Latin, French, Spanish, Italian, and Dutch word-entries. LEME dictionaries can also be bundled with literary works in order to do semi-automatic annotation. Writing software for these applications is beyond our means.

LEME is now becoming what it originally intended to be, a corpus. It is a single-genre collection, unlike the Brown Corpus (and many other corpora created to study a language statistically), which comprised balanced equal-length extracts from representative writing genres. Brown Corpus texts served the study of modern grammar and syntax, but the LEME corpus consisted of glossaries, herbals, spelling lists, and dictionaries, all collections of a single genre, the word-entry. At its simplest, the word-entry asserts the relationship of a form (or headword) and an explanation (or definition). By collecting many lexical texts from a period, LEME complements the much broader compass of the Oxford English Dictionary. The main difference between LEME and OED is in the beliefs of their respective lexicographers about the function of word-entries. The moderns and post-moderns expect that an OED explanation will define the word-entry headword. The LEME corpus texts, however, do not adopt a modern semantics. They show another theory at work, one made explicit by grammars and works of logic and rhetoric in the period.

A SSHRC grant during 2017-20 enabled LEME to meet most of its goals and some large new ones. We are releasing the plain texts of LEME works, and the encoded texts of all works from 1475 to 1625. They are coming about a half-year later than we hoped, but for a good reason: the Library upgraded LEME 1.0 (2006) to LEME 2.0 in 2019 as a service donated to global researchers. It has acquired about sixty additional texts since my SSHRC grant began, including the great dictionary by Nathan Bailey and Joseph Nicol Scott (1755), which has 30,000 more entries than Johnson’s and introduces the term, lexical definition, a concept central to modern semantics. We also have the text of numbered headwords in John Minsheu’s Ductor in Linguas (1617). We are preparing a list of possible additions to the OED (antedatings and new words) from selected texts, a contribution to its great work. Lemmatization has been substantially revised to make possible an index of Early Modern English vocabulary in LEME word-entries, side by side with OED headwords (as of 2015). That is growing; lemmatization does not happen overnight. Like LEME 2.0 and the information supplied to OED, this vocabulary index was not anticipated in our SSHRC grant application. We owe it to two gifted undergraduate programmers, Xueqi (Sherry) Fan and Sky Li.

The University of Toronto enables LEME to store its texts for downloading in its TSpace area. Once stored in TSpace, files cannot be removed, although files for new versions may be added. The present WordPress web database for the LEME Corpus exists in the space of the Faculty of Arts and Science Information and Instructional Technology Services (I&ITS).[1] The I&ITS WordPress site that hosts this webfile can be regularly updated.

We have the release of encoded texts for the LEME Corpus, part 2 (1626-1755), yet to do.

LEME owes much to the University of Toronto Work-Study program. It has enabled us to hire undergraduate text assistants and programmers. In our first year, Timothy Aberdingk Thijm devised stand-alone Python programs to convert mySQL tags (used in the database) to XML format, and to check all encoding for adherence to its standard.[2] Xueqi (Sherry) Fan in our second year undertook to lemmatize headwords and keywords in LEME texts according to OED practice so that our English-language search of the database would perform reliably.[3] This work too was done in Python and, to speed the process, she moved the program up to the Niagara server of SciNet, Ontario’s high-performance computing institute. Two programmers came to LEME in its third year, Shirley Wang and Sky Li. Shirley Wang programmed the analysis, read-out, and graphing of an Excel spreadsheet that held an extended bibliography of corpus texts for the 1475-to-1625 period. Sky Li developed a standalone Source Analyser that compared dictionary texts to measure overlap of each text’s word-entries with every other text’s word-entries. He also devised a vocabulary index that listed all LEME lemmatized terms (upwards of 40,000 headwords) beside the 97,800 OED headwords active in the period 1475-1625.[4] For details on these five applications, see the chapters below.

LEME tasks quickly raise research questions, and these programs go some way to responding to them. Text analysts and production staff at LEME are encouraged to do their own research on Early Modern English lexicature with LEME resources. LEME has had upwards of one hundred workers in its life-span to date, but at any moment only a half dozen are in our Robarts Library lab. Dictionaries are humbling, given their reputation as an authority about everything, and the relative brevity of our own human span and cognitive grasp, let alone reach. After thirty years on this project, I have to admit my slowness in seeing the forest above the fascinating trees, bushes, and plants flourishing in the garden of languages that LEME has made its signature image. As servants of Samuel Johnson’s harmless drudges, the early lexicographers, LEME hopes that you learn something interesting from them.

[1] I & ITS derives from the Computing in the Humanities and Social Sciences (CHASS) centre, which in turn was creating when the Social Science EPAS computing facility and the Centre for Computing in the Humanities (CCH) combined in 1996.

[2] The LEME tag-set is not that of the Text Encoding Initiative (TEI), which is a late-twentieth-century encoding language developed on top of XML syntax. We follow the international standard, XML, but adopt a tag-set that is consistent with the Early Modern period. Modern dictionaries, from which TEI drew its lexicographical tag-set, operate on different principles than historical dictionaries.

[3] LEME follows OED headwords and their spelling in order to unify, for reference purposes, the wide variety of orthographies practised in the Early Modern period. The vocabulary we project from LEME lexical texts should not be assumed to follow the final or preferred orthography of the period.

[4] For this list, I am deeply grateful to James McCracken, the Content Technology Manager in the dictionaries department of Oxford University Press: his elegant spreadsheet unlocked a door for LEME.