ONTEM 1.0 comprises 1,019 manually prepared entries, each consisting of information about the lemma, part-of-speech (following the MULTEXT-East tagset for Slovenian, https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), CEFR level (based on the Core vocabulary for Slovenian as L2, organized by levels A1, A2, and B1; http://hdl.handle.net/11356/1697), confirmation of the CEFR level (based on expert validation), as well as metadata including information about the semantic categorization with detailed descriptions of each semantic category (metatopic, topic, and subtopic) and the source of the word.
The words are classified into up to three levels of hierarchically organised semantic categories: into 12 top-level categories, i.e. metatopics, and 23 topics, the latter further divided into 29 subtopics. All categories are described in more detail in the provided README file. The words in ONTEM 1.0 were sourced from the KUUS corpus (http://hdl.handle.net/11356/1696) which comprises 17 textbooks for Slovenian as a Second and Foreign Language and contains 520,796 words. From this corpus, 1,019 semantically and thematically diverse words were manually selected to represent different parts-of-speech and CEFR levels, with a primary focus on A1 and A2 textbook vocabulary, while also including higher-level words to build a robust hierarchically structured system with potential for future expansion.
The ontology will be integrated into the Dictionary for Speakers of Slovene as a Second and Foreign Language – SLOGOST (https://lexonomy.cjvt.si/slovar-za-govorce-slovenscine-kot-drugega-in-tujega-jezika/). The dataset is available in CSV format, accompanied by a README document that describes its contents in more detail.