From Database to LEME-XML

LEME maintains three separate encodings for each document in its database: an “XML-like” database encoding (henceforth referred to as “LEME-DB”); a strict XML-compliant encoding (“LEME-XML”), and a TEI-compliant XML encoding (“LEME-TEI”).

The LEME-DB encoding was produced by Ian Lancashire and Marc Plamondon in 2006 for the LEME 1.0 database (2006), based on the Early Modern English Dictionaries Database encoding (1996). LEME-XML, by Timothy Aberdingk Thijm, is produced programmatically from LEME-DB and hence must remain in some sense “backwards compatible” with it. LEME-TEI, by Sky Li, is produced programmatically from LEME-XML. Yet LEME encoding is rooted less in the database than in the original structures of its old lexical texts.

Basics

The LEME-DB encoding was produced by Ian Lancashire and Marc Plamondon in 2006 for the LEME 1.0 database (2006), based on the Early Modern English Dictionaries Database encoding (1996). LEME-XML and LEME-TEI are produced programmatically from LEME-DB and hence must remain in some sense “backwards compatible” with it. Yet LEME encoding is rooted less in the database than in the original structures of its old lexical texts.

Timothy Aberdingk Thijm describes the basics of the LEME encodings below. His catalogue of the valid XML tags of the intermediary LEME-XML encoding, their contents, appears in the appendix.

Basics

The primary unit of LEME’s texts is the word-entry, which represents a basic lexicographic unit of the text. Word-entries have a form, which stores the headword and its lexemes, and an explanation, which stores the elaboration of the form by the author. I am deliberately avoiding calling this a “definition”, as LEME’s texts tend to predate this distinction, leading to a diversity of word-entry styles. Some dictionaries can contain complex explanations of word-entries, extended parenthetical remarks, fused monolingual and bilingual word-entries, embedded etymological information, and obsolescent language theory (more discussion of what belongs in a form or explanation can be found below).

Proceeding up from the word-entry, LEME identifies word-groups, which usually mark the lexicographic subsections of each dictionary (e.g., word-group “A”, word-group “Ab”). These word-groups fall into sections, which may contain remarks on the contents or text that does not resemble a word-entry. All sections belong to a LEME element which encloses the entire text and identifies it.

Throughout each of these units (word-entry, form, explanation, word-group, section, LEME) are various minor textual elements, indicating, for instance, damage, marginal notes (by the lexicographer or LEME personnel), font changes, foreign words and etymologies and page breaks. These elements are not bound to certain parts of the text by LEME in general, although exceptions exist when converting to LEME-TEI (see “LEME-TEI”, below).

LEME-XML Tags

Each XML tag used by LEME-XML is described here. This description is based on the edition of the LEME-XML RelaxNG schema current in August 2018.

Each element is identified by its tag name. Tag names are written in bold font with triangular braces, e.g. <root>. I include each tag’s possible XML attributes and their values. XML attributes and values are likewise written in bold font with an “@” symbol before the attribute name and a colon before the description or list of values, e.g. @no: string. Optional attributes include a question mark before the colon, e.g. @type?: string. I also include each tag’s possible child XML tags and a description of when and how it is used.

Children are divided into two groups: structural and textual. Structural tags indicate the structure of the lexicon; they are all referenced in the Basics section above. Textual tags are quite diverse and cover all remarks on the text, such as annotating words as foreign, citations, damage, notes and so forth. Generally, the LEME-XML schema is more permissive than need be in terms of what children are allowed: LEME-DB’s specification may claim to disallow a child that is allowed by LEME-XML. This measure is to provide some degree of flexibility to LEME-XML in case a LEME-DB file disobeys the standard. Hence, the list of children is permissive and shows all possible children that can go in the element. See the description for what should go or not go in the element, despite what the schema accepts.

Some tags are deprecated. While they may still appear in older documents, their use should be generally avoided. Preference may be given under the entry for another tag when relevant. When referred to by tag name, deprecated tags are written in bold font and italics, e.g. <set>.

If an attribute only accepts a particular set of values, these will be listed after the elements and identified by an alias such as “entrytype” or “langstr”. Where the term string is used, it means any sequence of Unicode characters, enclosed by quotation marks.

Below is an annotated example.

Wordgroup1 (the name of the tag, with the first letter capitalized)

Attributes: @type: grouptype, @lang?: langstr, @object?: string (the list of attributes and their allowed values)

Children: (the possible children of this element; any child listed may appear in any order)

Structural: <wordentry> <wordgroup2> <alpha> <heading>
Textual: <br> <blockquote> <cit> <class> <damage> <editoraddition> <emend> <expan> <expression> <etym> <etymlang> <f> <hungword> <i> <infl> <lemeformat> <lemenote> <lemepagenote> <note> <ornament> <p> <pb> <sic> <term> <xref>

The <wordgroup1> element surrounds a section of the dictionary such as words beginning with the letter “A”. It has a @type attribute which specifies what grouping is being made (see grouptype). The element may also have an optional @lang attribute to specify the language of the enclosed content, and an optional @object attribute which typically contains an editorially-spelled uppercase form of the group’s header, such as “A”…. (The description of the element)

Notes on TEI

The LEME-TEI conversion system produces a valid TEI encoded text from a LEME-XML encoded one. LEME-TEI texts do not necessarily use all the most common TEI conventions, and in fact may seem sparse compared to standard TEI documents as far as tagging goes. This is a deliberate measure to compromise between LEME’s and TEI’s structure. The LEME-TEI encoding uses TEI’s P5 guidelines’ schema, and makes use of the core, dictionaries, figures, header, linking and textstructure modules as provided through the Roma web tool.