Corpus of textbooks for learning Slovenian as L2 KUUS 2.0

PID

The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a second and foreign language. Published between 2002 and 2023 at the Centre for Slovene as a Second and Foreign Language (Faculty of Arts, University of Ljubljana), these textbooks were widely used in the teaching of Slovenian as a second and foreign language to children, adolescents and adults in Slovenia and abroad at the time of the creation of the corpus. The metadata for each text includes its title, subtitle, authors, year of publication, publisher, CEFR level, target group and, for the textbooks, the number of estimated hours of the lessons.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).

As opposed to the previous 1.0 version of the corpus, the 2.0 version has been enlarged by 7 workbooks from sets whose textbooks were already part of KUUS 1.0. It is available not only in CoNLL-U format but also in TEI XML, and in vertical encoding.

The corpus KUUS 1.0 is presented in more detail in: KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja, 2022: Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Nataša Pirih Svetina, Ina Ferbežar (eds.): Na stičišču svetov: slovenščina kot drugi in tuji jezik. Obdobja 41. Ljubljana: Založba Univerze v Ljubljani. 165–174. DOI: https://doi.org/10.4312/Obdobja.41.2784-7152

Note that a sample of the KUUS corpus, ccKUUS (http://hdl.handle.net/11356/1878) is available under a more premissive licence than KUUS and also searchabe via the CLARIN.SI concordancers.

Identifier
PID http://hdl.handle.net/11356/1877
Related Identifier https://doi.org/10.4312/Obdobja.41.2784-7152
Related Identifier http://hdl.handle.net/11356/1696
Related Identifier https://centerslo.si/KUUS
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1877
Provenance
Creator Klemen, Matej; Kosem, Iztok; Arhar Holdt, Špela; Pollak, Senja; Huber, Damjan; Lutar, Mateja
Publisher Centre for Slovene as a Second and Foreign Language, University of Ljubljana; Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2023
Rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0; ACA; https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline Linguistics