Assessment Data of the Dictionary of Modern Lithuanian versus Joint Corpora

PID

The resource is the assessment data of The Dictionary of Modern Lithuanian, 6th edition (DML6) [1], from the point of view of its coverage in the Joint Corpus of Lithuanian (JCL) [2].The JCL is a merge of three corpora: 1) Vilnius university corpus compiled out of the Lithuanian internet content from 2014 and primarily used for machine translation, 2) legal document corpus in a form of wordlist (courtesy of the Office of the Seimas of the Republic of Lithuania, 2011) and 3) balanced corpus of present day Lithuanian of Vytautas Magnus University (VMU). Total size of the JCL is more than 1,3 billion tokens.

The resource consists of 5 files. 1. Frequency list of types (different tokens) in JCL versus DML6. typecountoccurrence_in_dml6 (0 – no, 1 – main entries, 2 – geographic names, 3 – abbreviations). 2. List of explicit lemmas in DML6 versus JCL. lemmapart_of_speechoccurrence_in_JCL (count of all tokens in JCL which can be interpreted as a wordform of the particular lemma). Possible part_of_speech values: N – noun, V – verb, A – adjective, P – pronoun, R – adverb, S – preposition, C – conjunction, M – numeral, Q – particle, I – interjection, O – onomatopoeia, Y – abbreviation. occurrence_in_JCL means count of all tokens in JCL which can be interpreted as a wordform of the particular lemma. 3. Hunspell affixes (flexion rules) for Lithuanian language. 4. Hunspell dictionary, constructed from both explicit and implicit DML6 lemmas. 5. List of filtered out (excluding misspellings, foreign words, proper names, etc.) 254726 word-forms of JCL that are missing in the DML6 typecount

Literature

[1] Dadurkevičius, V., Petrauskaitė, R. 2020: Corpus based methods for assessment of the traditional dictionaries. Human language technologies - the Baltic perspective: the 9th international conference Baltic HLT, Kaunas, Lithuania, September 22–23, 2020. [2] The Dictionary of Modern Lithuanian. Edited by Keinys S. 6th (3 electronic) edition of the Dabartinės lietuvių kalbos žodynas. 2006, ISBN 978-9955-704-37-9

Identifier
PID http://hdl.handle.net/20.500.11821/36
Metadata Access https://clarin.vdu.lt/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin.vdu.lt:20.500.11821/36
Provenance
Creator Dadurkevičius, Virginijus
Publisher Vilnius university; Vytautas Magnus University
Publication Year 2020
Rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT; https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm; PUB
OpenAccess true
Contact info(at)clarin.vdu.lt
Representation
Language Lithuanian
Resource Type lexicalConceptualResource
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics