This method identifies whether any two English texts had a source-influence relationship. This is illustrated by data and visualization tools for about twenty dictionaries in the first part of the distributed corpus. It can be used as well in analyzing relationships among other dictionaries formatted as the LEME texts are.
My goal was to determine source relations between early modern English dictionaries and word listings using texts from the LEME database.
With early modern English dictionaries, it is often unclear whether a lexicographer did or did not take from previous authors to build their own dictionaries, meaning that we must look at alternative ways to determine if one dictionary acted as a source for another. In this case, source relations were determined computationally, using 8 primary metrics (1 more planned). The metrics were divided into two major categories: Explanation and Headword similarity, with 5 explanation metrics and 4 headword metrics (1 more planned). The classification of each metric is dependent on whether it considers the explanations of a dictionary for similarity or solely the headwords. This distinction is important, since many of the source dictionaries simply do not have explanations associated with each headword (e.g. Mulcaster’s The First Part of the Elementary).
The full results are available to download online (website link pending).
Visualization
A visualization of the results can be found online (link pending). The results are visualized in the form of a source network, with the oldest sources at the top of the visualization. Each box/node in the graph represents a single dictionary, which can be seen by hovering over it. Each connection represents a possible source relationship, and can be hovered over to view the exact similarity score. You can click on a box to highlight its individual connections, and it’s also possible to click and drag a box in order to allow further distinction between dictionaries. For each metric, the tabs allow for you to filter out connections based on their strength. For example, clicking on the 50% tab will remove all connections between dictionaries except for those that have a similarity score above 50%.
The dictionaries are positioned top-down based on their publication dates, so that the oldest dictionaries are at the top of the network, and the youngest dictionaries are at the bottom of the network. The dictionaries are also partially positioned based on a force-directed layout to determine their horizontal position. The visualization is built using d3.js, a JavaScript library that is built for in browser data visualization, and is hosted using a Python server using a Flask environment.
The visualization provides a useful tool to directly compare the similarities of various pairs of dictionaries because of the ability to filter out connections under a certain threshold. Because I refrained from setting my own threshold score to definitively determine whether or not one dictionary used another as a source, this tool allows you to view potential source connections at varying degrees of similarity.


Results
From the dictionaries that I analyzed, some connections that I found suggestive were:
Pairs with the same author
- The later edition of John Bullokar’s English Expositor (1621) reused much of his earlier edition (1616), with an aggregate explanation similarity score of 85.6%, the highest of any pair of dictionaries.
- Similarly, the later edition of Robert Cawdrey’s A Table Alphabetical (1617) reused much of his earlier edition (1604), with an aggregate explanation similarity score of 70.0%.
Highly Suggestive
- Henry Cockeram’s English Dictionary (1623) likely used a significant number of explanations from Bullokar’s English Expositor (1621), with an aggregate explanation score of 45.2% and an aggregate headword score of 41.8%. The same relationship also exists for Bullokar’s earlier edition of English Expositor (1616).
Suggestive
- Robert Cawdrey’s first edition of A Table Alphabetical (1604) likely used a number of explanations from Edmund Coote’s The English School-master (1596), with an aggregate explanation score of 31.0% and an aggregate headword score of 30.2%.
- Both Timothy Bright’s Charactery (1588), and John Evans’ The Palace of Profitable Pleasure (1621) used headwords from Richard Mulcaster’s The First Part of the Elementary (1582), with aggregate headword scores of 37.7% and 30.1% respectively.
- Cawdrey’s second edition of A Table Alphabetical (1617) likely used explanations from Bullokar’s first edition of English Expositor (1616), with an aggregate explanation score of 26.8% and an aggregate headword score of 34.4%.
- Thomas Blount’s Glossographia (1654) likely provided some explanations for Edward Philips’ The New World of English Words (1658), with an aggregate explanation score of 26.6% and an aggregate headword score of 30.5%.
Plausible
- Both editions of Cawdrey’s A Table Alphabetical (1607/1617) seem to have influenced Bullokar’s editions of English Expositor, with aggregate explanation scores of 20-25%, and aggregate headword scores of 20-30%.
- Both editions of Bullokar’s English Expositor (1616/1621) seem to have influenced the explanations of Blount’s Glossographia (1656), with aggregate explanation scores of 21.2% and 21.4% respectively, and headword explanation scores of 10.8% and 11.2% respectively.
- Blount’s Glossographia (1656) potentially was influenced by John Cowell’s The Interpreter (1607), with an aggregate explanation score of 22.5%.
While this is not all potential source relations, these are the ones I found most significant and interesting.
Note that dictionaries that have not yet been lemmatized (so that each headword is associated with a modern English word) will almost invariably have low similarity scores, since it becomes impossible to compare dictionaries if the headwords cannot be matched. This problem only occurs with Minsheu’s Vocabularium Hispanicolatinum and Coles’ An English Dictionary at the moment.
Also, when applicable, similarity between two dictionaries is always given as a percentage of the younger dictionary that is similar to the older dictionary. This is because my primary purpose to determine source relations, and since a dictionary cannot source from the future, it’s much more relevant to look at a percentage of the younger dictionary to examine how much of a given dictionary could be taken from older sources.
Metrics
Explanation Metrics
Explanation content words (entry)
For this similarity metric, two explanations of the same headword are considered similar if they share either two content words (not function words), or one content word and a function word. Thus, the similarity score (as a percentage) between two dictionaries can be determined by taking the total number of similar explanations between the two dictionaries and dividing by the total number of word entries in the younger dictionary.
First two words (first)
For this similarity metric, two explanations of the same headword are considered similar if they share the first two words. Thus, the similarity score (as a percentage) between two dictionaries can be determined by taking the total number of similar explanations between the two dictionaries and dividing by the total number of word entries in the younger dictionary.
Jaccard Similarity
For two explanations of the same headword, the Jaccard similarity of the two explanations is the number of (unique) shared words, both content and function, in the two entries divided by the total number of (unique) words between the two entries.
For example, consider the two explanations of alienate (v) from Bullokar and Cockeram, respectively:
- (Bullokar): To estrange and withdraw the minde, sometime to sell.
- (Cockeram): To estrange ones selfe.
There are 10 unique words, with two shared words (To estrange), so the Jaccard similarity of these two entries would be 0.2 (20%).
The Jaccard similarity of two dictionaries is the average Jaccard similarity of all common headword entries in the younger dictionary. All other entries are ignored.
More information on the mathematics behind Jaccard similarity can be found here.
Cosine similarity
For two entries of the same headword, cosine similarity is calculated by converting each entry into a list of numbers by counting the number of times each word appears in the entry (the Bag of Words strategy), then taking the cosine of the angle between the two lists of numbers.
Each list of numbers can be conceptualized as a multidimensional vector, and so as the entries get more similar, the angle between their corresponding vectors decreases, and so the cosine of that angle increases.
For example, the sentences “I walked down the street” and “The boy sprinted up the street” could be transformed into the word vectors:
- “I walked down the street” : [1, 1, 1, 1, 1, 0, 0, 0], meaning it contains one “I”, one “walk”, one “down”, one “the”, and one “street”, with 0 of “boy”, “sprint”, or “up”.
- “The boy sprinted up the street”: [0, 0, 0, 2, 1, 1, 1, 1], meaning it contains 0 of “I”, “walk” or “down”, 2 “the”’s, and one each of “street”, “boy”, “sprint”, “up”.
The cosine similarity of these two sentences would be ~0.474 (47.4%).
The cosine similarity of two dictionaries is the average cosine similarity of all common headword entries in the younger dictionary. All other entries are ignored.
More information on the mathematics behind Cosine similarity can be found here.
Aggregate Explanation Similarity
The aggregate explanation similarity between two dictionaries is taken by simply averaging all other explanation similarity metrics between the two dictionaries.
Headword Similarity
Basic Headword Similarity
The basic headword similarity between two dictionaries is the percentage of headwords in the younger dictionary that are also present in the older dictionary.
Consecutive Headword Similarity
A run of headwords is simply a series of consecutive headwords. For consecutive headword similarity between two dictionaries, a shared run is any run that appears in both dictionaries. Thus, the consecutive headword similarity score between two dictionaries is simply the percentage of headwords in the younger dictionary that are part of a shared run.
Nonconsecutive Headword Similarity
This is the same as consecutive headword similarity, except that words in a shared run do not have to occur consecutively in the younger dictionary in this metric.
(PLANNED) Grouping Headword Similarity
This is the same as consecutive headword similarity, except that words in a shared run do not have to occur in the same order (but they still have to be consecutive).
Aggregate Headword Similarity
The aggregate headword similarity between two dictionaries is taken by simply averaging all other headword similarity metrics between the two dictionaries.
Raw Data
The raw results are available online (link pending), and are contained in a zip file with the following structure:
- scores/ – a directory that contains spreadsheets of similarity scores for 8 different scoring methods.
- jsons/ – a directory that contains the generated JSON files that describe the source network connections made solely for visualization purposes. JSON files are a type of data storage file that store data as a series of key-value pairs. In this case, a list of nodes (dictionaries) is stored, along with a list of the source connections between all pairs of dictionaries, with their similarity scores.
- paired/ – a directory that contains a direct comparison of all entries between two dictionaries based on two specific scoring methods, explanation content word similarity and first two similarity.

- oed_similarity.xlsx – an Excel table of all unique words from the Oxford English Dictionary as of 2015 and source dictionaries, along with the positions that each word appears (if applicable) in each dictionary.

Dictionary List
- An Exposition of Certain Difficult and Obscure Words (William and John Rastell, 1579)
- The First Part of the Elementary (Richard Mulcaster, 1582)
- Charactery: An Art of Short, Swift, and Secret Writing by Character (Timothy Bright, 1588)
- The English School-master (Edmund Coote, 1596)
- A Table Alphabetical (Robert Cawdrey, 1604)
- The Interpreter: or Book Containing the Signification of Words (John Cowell, 1607)
- An English Expositor (John Bullokar, 1616)
- A Table Alphabetical (Robert Cawdrey, 1617)
- Vocabularium Hispanicolatinum (John Minsheu, 1617)
- English Expositor (John Bullokar, 1621)
- The Palace of Profitable Pleasure (John Evans, 1621)
- English Dictionary (Henry Cockeram, 1623)
- Glossographia or a Dictionary (Thomas Blount, 1654)
- The New World of English Words (Edward Phillips, 1658)
- An English Dictionary (Elisha Coles, 1677)
- Gazophylacium Anglicanum (Stephen Skinner, 1689)
- A New English Dictionary (John Kersey the younger, 1702)