Training corpus jos1M 1.1

Dataset

PID

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated.

The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.

Identifier
PID	http://hdl.handle.net/11356/1037
Related Identifier	http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
Related Identifier	http://hdl.handle.net/11356/1213
Related Identifier	https://nl.ijs.si/jos/jos1M-en.html
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1037

Provenance
Creator	Erjavec, Tomaž; Krek, Simon
Publisher	Jožef Stefan Institute
Publication Year	2010
Rights	Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); PUB; https://creativecommons.org/licenses/by-nc/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics