Corpus of Slovenian school texts SBSJ 1.0

Dataset

PID

Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st to 5th grades from 2017 to 2020. The corpus consists of approximately 95,000 tokens and was designed as one of the resources for the compilation of The School Dictionary of the Slovenian Language, which is being created as part of the project Franček Web Portal, Language Counselling for Slovene Teachers and School Dictionary of the Slovene Language. The corpus was lemmatized and POS-tagged with the Obeliks tagger (http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Navodila.aspx) using JOS morphosyntactic descriptions. The corpus is written in XML and complies with TEI specifications as given in the CLARIN.SI customisation (https://github.com/clarinsi/TEI-schema).

Note that the corpus is intergrated with the CLARIN.SI concordancers, but the corpus available on the concordancers is much larger than the TEI sample available for download.

Identifier
PID	http://hdl.handle.net/11356/1413
Related Identifier	https://doi.org/10.3986/JZ.28.1.07
Related Identifier	http://www.francek.si
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1413

Provenance
Creator	Ahačič, Kozma; Atelšek, Simon; Erjavec, Tomaž; Holozan, Peter; Jakop, Nataša; Jemec Tomazin, Mateja; Ježovnik, Janoš; Ledinek, Nina; Perdih, Andrej; Romih, Miro; Trojar, Mitja
Publisher	ZRC SAZU
Publication Year	2021
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics