The resource is the assessment data of The Dictionary of Modern Lithuanian, 6th edition (DML6) [1], from the point of view of its coverage in the Joint Corpus of Lithuanian (JCL) [2].The JCL is a merge of three corpora: 1) Vilnius university corpus compiled out of the Lithuanian internet content from 2014 and primarily used for machine translation, 2) legal document corpus in a form of wordlist (courtesy of the Office of the Seimas of the Republic of Lithuania, 2011) and 3) balanced corpus of present day Lithuanian of Vytautas Magnus University (VMU). Total size of the JCL is more than 1,3 billion tokens.
The resource consists of 5 files.
1. Frequency list of types (different tokens) in JCL versus DML6.
typecountoccurrence_in_dml6 (0 – no, 1 – main entries, 2 – geographic names, 3 – abbreviations).
2. List of explicit lemmas in DML6 versus JCL.
lemmapart_of_speechoccurrence_in_JCL (count of all tokens in JCL which can be interpreted as a wordform of the particular lemma).
Possible part_of_speech values: N – noun, V – verb, A – adjective, P – pronoun, R – adverb, S – preposition, C – conjunction, M – numeral, Q – particle, I – interjection, O – onomatopoeia, Y – abbreviation.
occurrence_in_JCL means count of all tokens in JCL which can be interpreted as a wordform of the particular lemma.
3. Hunspell affixes (flexion rules) for Lithuanian language.
4. Hunspell dictionary, constructed from both explicit and implicit DML6 lemmas.
5. List of filtered out (excluding misspellings, foreign words, proper names, etc.) 254726 word-forms of JCL that are missing in the DML6
typecount
Literature
[1] Dadurkevičius, V., Petrauskaitė, R. 2020: Corpus based methods for assessment of the traditional dictionaries. Human language technologies - the Baltic perspective: the 9th international conference Baltic HLT, Kaunas, Lithuania, September 22–23, 2020.
[2] The Dictionary of Modern Lithuanian. Edited by Keinys S. 6th (3 electronic) edition of the Dabartinės lietuvių kalbos žodynas. 2006, ISBN 978-9955-704-37-9