Training corpus jos1M 1.1

PID

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated.

The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.

Identifier
PID http://hdl.handle.net/11356/1037
Related Identifier http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
Related Identifier http://hdl.handle.net/11356/1213
Related Identifier https://nl.ijs.si/jos/jos1M-en.html
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1037
Provenance
Creator Erjavec, Tomaž; Krek, Simon
Publisher Jožef Stefan Institute
Publication Year 2010
Rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); PUB; https://creativecommons.org/licenses/by-nc/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline Linguistics