Reference corpus of historical Slovene goo300k 1.2

PID

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899.

Each text contains extensive meta-data and per-page links to facsimiles, while the word tokens in the texts are annotated with their modernised word-form, lemma, part-of-speech, and, for archaic words, their nearest modern synonyms or short explanation.

The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all the information from the source TEI.

Identifier
PID http://hdl.handle.net/11356/1025
Related Identifier https://doi.org/10.1007/s10579-015-9294-7
Related Identifier https://nl.ijs.si/imp/index-en.html
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1025
Provenance
Creator Erjavec, Tomaž
Publisher Jožef Stefan Institute
Publication Year 2015
Funding Reference info:eu-repo/grantAgreement/EC/FP7/215064
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics