Corpus of Slovenian school texts SBSJ 1.0

PID

Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st to 5th grades from 2017 to 2020. The corpus consists of approximately 95,000 tokens and was designed as one of the resources for the compilation of The School Dictionary of the Slovenian Language, which is being created as part of the project Franček Web Portal, Language Counselling for Slovene Teachers and School Dictionary of the Slovene Language. The corpus was lemmatized and POS-tagged with the Obeliks tagger (http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Navodila.aspx) using JOS morphosyntactic descriptions. The corpus is written in XML and complies with TEI specifications as given in the CLARIN.SI customisation (https://github.com/clarinsi/TEI-schema).

Note that the corpus is intergrated with the CLARIN.SI concordancers, but the corpus available on the concordancers is much larger than the TEI sample available for download.

Identifier
PID http://hdl.handle.net/11356/1413
Related Identifier https://www.xn--franek-l2a.si/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1413
Provenance
Creator Ahačič, Kozma; Atelšek, Simon; Erjavec, Tomaž; Holozan, Peter; Jakop, Nataša; Jemec Tomazin, Mateja; Ježovnik, Janoš; Ledinek, Nina; Perdih, Andrej; Romih, Miro; Trojar, Mitja
Publisher ZRC SAZU
Publication Year 2021
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics