The Pilot corpus of student academic texts KOŠ 1.0 consists of authentic texts written by undergraduate students (approx. age 19–23 years) as part of their coursework at two faculties of the University of Ljubljana. The information on the study programme, field of study, year of study, academic year of submission, number of authors (single or multiple) and type of text is provided for each text. The corpus predominantly contains article reviews, essays, answers to questions, and seminar papers. It also includes reports, summaries and presentations of articles and lectures, lesson plans, and other academic materials.
Linguistic annotations were applied using the CLASSLA pipeline (https://github.com/clarinsi/classla/) across various levels, including tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags, JOS-SYN dependency syntax, Universal Dependencies, and named entities (more about specific annotation layers: https://wiki.cjvt.si/shelves/linguistic-annotation-of-slovene-corpora). For better accessibility and wider usability, we provide versions with JOS-SYN as well as Universal Dependencies, and English as well as Slovene tags.