Corpus of scientific texts of contemporary Slovenian KZB 1.0

PID

The Corpus of scientific texts of contemporary Slovenian consists of 25 million words from scientific monographs and scientific papers written mainly between 2000 and 2023. It was designed as one of the resources of the project eSSKJ and corpus - towards state-of-the-art language data.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).

The corpus is available in the CoNLL-U format, as well as vertical files for use with Sketch Engine type concordancers.

Identifier
PID http://hdl.handle.net/11356/1872
Related Identifier https://doi.org/10.3986/JZ.31.2.06
Related Identifier https://isjfr.zrc-sazu.si/sl/programi-in-projekti/esskj-in-korpus-na-poti-k-najsodobnejsim-jezikovnim-podatkom
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1872
Provenance
Creator Erjavec, Tomaž; Jemec Tomazin, Mateja; Ledinek, Nina; Perdih, Andrej; Romih, Miro; Trojar, Mitja; Romih, Luka
Publisher ZRC SAZU
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics