Corpus of Slovene linguistic scientific writing JezKor


JezKor is a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.

The corpus is linguistically annotated with the CLASSLA pipeline ( on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.

Related Identifier
Metadata Access
Creator Atelšek, Simon; Nemec, Karmen; Jemec Tomazin, Mateja
Publisher ZRC SAZU
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics