Ontology of topics for Slovenian as a second and foreign language ONTEM 1.0

Dataset

PID

ONTEM 1.0 comprises 1,019 manually prepared entries, each consisting of information about the lemma, part-of-speech (following the MULTEXT-East tagset for Slovenian, https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), CEFR level (based on the Core vocabulary for Slovenian as L2, organized by levels A1, A2, and B1; http://hdl.handle.net/11356/1697), confirmation of the CEFR level (based on expert validation), as well as metadata including information about the semantic categorization with detailed descriptions of each semantic category (metatopic, topic, and subtopic) and the source of the word. The words are classified into up to three levels of hierarchically organised semantic categories: into 12 top-level categories, i.e. metatopics, and 23 topics, the latter further divided into 29 subtopics. All categories are described in more detail in the provided README file. The words in ONTEM 1.0 were sourced from the KUUS corpus (http://hdl.handle.net/11356/1696) which comprises 17 textbooks for Slovenian as a Second and Foreign Language and contains 520,796 words. From this corpus, 1,019 semantically and thematically diverse words were manually selected to represent different parts-of-speech and CEFR levels, with a primary focus on A1 and A2 textbook vocabulary, while also including higher-level words to build a robust hierarchically structured system with potential for future expansion. The ontology will be integrated into the Dictionary for Speakers of Slovene as a Second and Foreign Language – SLOGOST (https://lexonomy.cjvt.si/slovar-za-govorce-slovenscine-kot-drugega-in-tujega-jezika/). The dataset is available in CSV format, accompanied by a README document that describes its contents in more detail.

Identifier
PID	http://hdl.handle.net/11356/2069
Related Identifier	https://www.clarin.si/info/services/projects/#Ontology_of_topics_for_Slovenian_as_a_Second_and_Foreign_Language
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2069

Provenance
Creator	Pori, Eva; Knez, Mihaela; Klemen, Matej; Jerman, Tanja
Publisher	Centre for Slovene as a Second and Foreign Language, University of Ljubljana
Publication Year	2025
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); PUB; https://creativecommons.org/licenses/by-nc-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene; English
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; text/csv; text/plain; downloadable_files_count: 2
Discipline	Linguistics