Corpus of scientific texts of contemporary Slovenian KZB 1.0

Dataset

PID

The Corpus of scientific texts of contemporary Slovenian consists of 25 million words from scientific monographs and scientific papers written mainly between 2000 and 2023. It was designed as one of the resources of the project eSSKJ and corpus - towards state-of-the-art language data.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).

The corpus is available in the CoNLL-U format, as well as vertical files for use with Sketch Engine type concordancers.

Identifier
PID	http://hdl.handle.net/11356/1872
Related Identifier	https://doi.org/10.3986/JZ.31.2.06
Related Identifier	https://isjfr.zrc-sazu.si/sl/programi-in-projekti/esskj-in-korpus-na-poti-k-najsodobnejsim-jezikovnim-podatkom
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1872

Provenance
Creator	Erjavec, Tomaž; Jemec Tomazin, Mateja; Ledinek, Nina; Perdih, Andrej; Romih, Miro; Trojar, Mitja; Romih, Luka
Publisher	ZRC SAZU
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline	Linguistics