• Skip to main content
  • Documents
    • A Welcome and a Caution
    • The History of LEME
    • Editorial Practice
    • Encoding
      • From Database to LEME-XML
        • LEME XML Tagset
    • Lemmatizing
      • Lemmatizer
    • Source Detection
  • Download Encoded Texts
    • LEME Chronology 1475-1755
    • Detailed Primary Bibliography
    • Indexes
      • Index of Proper Names
      • Index of Topics
      • Abbreviations for Languages
      • Parts of Speech
    • Works Cited
  • Help
    • Getting Started
    • People
    • Google Analytics
    • Acknowledgements

LEME

Lexicons of Early Modern English

Editorial Practice

LEME uses mySQL database structures and an XML-encoded corpus because its method derives from corpus linguistics (McEnery and Hardie 2012). Conclusions about diachronic English – the language whose speakers no longer alive to answer questions — are best based on text samples chosen from a large collection of word-entries written by them. LEME assembles them from a multi-genre body of bilingual, monolingual, and polyglot dictionaries, spelling lists, definition collections, and glossaries that serve fields as diverse as medicine, botany, law, husbandry, horsemanship, architecture, education, architecture, surveying, and navigation.

The LEME database lists 1,387 different books and manuscripts in its primary bibliography (not counting routine re-editions), 30,529 pages of text from 272 transcribed works, 1,135,142 word-entries, 1,216,763 encoded forms and sub-forms, and 231,431 lexemes (or lemmatized terms). Each text has a text id number. It can be found in two places: the LEME 2.0 filename of the text, and the pop-up entry window for a word retrieved from that text. To retrieve the permanent URL for a word-entry, search for a word in the concerned text, click on any entry from the generated hits to get a pop-up window of that entry, and click again on the center bottom icon, (-), to retrieve the permanent URL for the entry. For example, Sir Thomas Elyot’s word-entry on “hyphen” in the 1538 edition has the URL

https://leme.library.utoronto.ca/lexicon/entry/53/7464

in which 53 is the text id, and 7464 is the word-entry id. This URL will retrieve this word-entry from any workstation online not running LEME.

The LEME Corpus website lists all language texts 1475-1625, whether edited or not, in analytic paragraphs, and indexes them by chronology, subject, and proper name. Once a text has been uploaded into TSpace, it opens with a metadata page that identifies the text and its LEME id.

Principles

The editorial principles that govern LEME are intended to represent lexical information as the Early Modern period understood it. LEME uses a present-day English alphabet but otherwise keeps to old spelling and original lineation.[1] Its texts are not diplomatic. They do not reproduce display elements (font and illustrations) and bibliographical information in the text (running titles, signatures, page and folio numbers, and catchwords), and they treat some common letter-forms (e.g., long-s, different forms of r, ligatures such as ct) as the common single characters we use today. Images of most original texts, readily available online at EEBO/TCP, deliver these details. It is true that font sometimes usefully identifies language, but unlike even critical editions LEME identifies the language of all words explicitly in a form, explanation, or term tag.

Not all expanded later editions of a lexical text, updated by a lexicographer, have been transcribed, but we recognize that they should be. Infrequently, an EEBO image-set was damaged; we would then use text from another copy of the same edition, or even from a later edition, and admit the fact. Most LEME texts have been entered and encoded in the LEME lab.

The earliest LEME transcriptions date from the early 1990s. They include very large texts like Palsgrave, Cotgrave, Thomas Thomas, Florio (1598), and Minsheu (1599). I normally chose the earliest copytext, although the availability of an TCP transcription in a late edition was too valuable not to use. We adapt many texts available in EEBO/TCP and the Internet Archive and have outsourced the entry of large dictionaries to various firms, recently to Apex Covantage. We have certainly not encoded everything, desirable though that is.

Tools

The University of Toronto Library has chosen the programming languages of our Web database software. We use programmer’s editors, UltraEdit and Notepad ++, to input and encode texts in Unicode-based well-formed XML. Babelmap has proved itself very capable in helping us write with non-Roman character sets, especially Greek and Hebrew. The database editorial-tools page gives us a processing function to validate our XML-like database encoding and to add lemma elements to form and explanation tags. So does a stand-alone program devised by Tim Thijm for our corpus texts. LEME accepts headwords in the online OED as standard lemmatized forms that everyone should follow.

Encoding

LEME encoding is just a start on what can be done. It had to impose one encoding language on everything, and because its texts are highly variable in structure (both as a result of different functions and of a lexicographer’s familiarity with the history of word-entries), the simpler the set of tags, the fewer errors in applying them. Thus LEME operates with as minimal an XML tag-set as possible.[2] For example, most lexicographers of the period do not indicate senses, and attempts to delineate them seem speculative. Our lexicographical tag-set grew slowly. In the early 1990s, I used COCOA (Oxford Text Concordance) tags but shifted soon to SGML following recommendations of the Text Encoding Initiative. LEME eventually deprecated some SGML tags, such as for font and part of speech, gender, and grammatical inflections explicitly stated in a text. Early Modern English lexicographers from time to time employed cross-references between word-entries, but often the target for them could not be found, and so I was content to tag the cross-reference without linking to its supposed target.[3] The LEME.xml tagset follows the database tags closely.

Our Toronto librarians and some others have expressed concern that we have not released the texts in Text Encoding Initiative (TEI) encoding, without question an excellent language for humanities texts that use the XML standard. We have several reasons for choosing LEME.xml tags. Our current LEME-to-TEI conversion program encounters many exceptions and cannot yet be automated. Secondly, although the recommended TEI subset of tags concerns dictionaries, the number of actual LEME dictionaries is overwhelmed by other genres that introduce word-entries (e.g., grammars, herbals, spelling lists, treatises with definitions, concordances, etc.). The Early Modern English worked steadily but slowly toward formalizing its lexicographical structures. Fourth, the dominant theory of the word-entry at this time involves two quite different definitions, one for the headword (a network of other words) and one for the thing named by the headword (the so-called logical definition found in classical rhetoric). How TEI could implement this structure and yet remain within its dictionary encoding subset  is questionable. Thus the structure, naming, and function of many tags in LEME texts depart from those in recommended by the Dictionaries module in TEI P5. To most researchers, these differences will not be self-explanatory, but to historical lexicologists they are significant. In such circumstances, LEME recognizes that user convenience can trump scholarly considerations. The Creative Commons 4.0 copyright designation enables researchers or institutions themselves to revise the encoding along with the texts. TEI-affectionados can replace the LEME encoding language with TEI if they wish.

LEME lemmatizes English headwords and other important English words in word-entries in an additional .xml file for each text. For researchers in non-English languages, English headwords may be undesirable. LEME is not able to lemmatize headwords in other languages. Form and explanation tags include elements for lemmas (lexemes) that follow OED headword spellings. No such standard existed in the Early Modern period, and it would have been unwise to impose editorially a set of arbitrary spelling conventions. Not a few radical reforms of English spelling systems failed in the Early Modern period.

LEME emends errors in the text lightly, usually only for typos and foul case, and retains the erroneous form in a tag. LEME also expands contractions without identifying their marks of abbreviation because there is no standard for naming them as Unicode does language characters. My attempt to use, in an expansion tag, an arbitrary encoding for the shapes of abbreviated characters (e.g., “a+_” for “a-macron,” that is, expanded am or an) has been recently deprecated. One spelling often is abbreviated by quite different characters, and the same abbreviation may be expanded into quite different spellings.[4] Anyway, outside of early Latin and English dictionaries remaining from the early fifteenth century, most expansions are obvious to readers. We have benefited from EEBO/CP editorial guidelines on special characters and scholarly papers on Renaissance Greek ligatures and abbreviations. An diligent attempt has been made to reproduce Greek and Hebrew characters but readers are advised that illegibility and LEME’s editorial unfamiliarity with these languages may have produced some odd results.

We welcome all corrections.

Textual Commentary

In database lemenotes at the bottom of a word-entry, we registered English words antedating the earliest OED citation or “not found” in the OED. A “not found” lemenote does not mean that the questioned word-form is not in the OED; it only means that LEME researchers have not located it there. The database, unfortunately, does not give the date when we viewed the OED; and if OED has added our suggested information to its word-entry after we viewed it, LEME will  not know. OED is in a state of constant updating and can be counted on to attend to readers’ suggestions. For this reason, we are remapping our LEME headwords to OED headwords. This step, while time-consuming, offers the best mechanism for comparing the vocabulary that Early Modern word-entries knew about, and the considerably larger vocabulary that OED has assembled. As well, this step enables us to insert the date when we viewed, in a LEME text, either an antedating or a word that we have not found in the OED.


[1] It is important to be as explicit as one can in distinguishing an end-of-line hyphen as soft or hard.

[2] Tags in texts processed by the LEME 2.0 database are xml-like. The tagged entries can be viewed in the pop-up windows in which word-entries generated in response to a search request appear.

[3] Researchers can use a search function to locate these.

[4] The prospect of encoding abbreviations as documented in Capelli is daunting, but groups are working on it. See Joel Fredell, Charles Borchers IV, and Terri Ilgen, “TEI P5 and Special Characters Outside Unicode,” Journal of the Text Encoding Initiative 4 (2013). https://journals.openedition.org/jtei/727

Ed. Ian Lancashire and Isabel Zhu, with contributions from Julia DaSilva, Paramita Dutta, Xueqi Fan, Sky Li, Kristie Lui, Annika Sparrell, Timothy Aberdingk Thijm, and Shirley Wang
© 2025 Ian Lancashire