Training corpus jos1M 1.2

Dataset

PID

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated.

The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html).

The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL-U, used by Universal Dependencies. Note that the corpus does not contain syntactic dependencies.

The texts or paragraphs of the jos1M corpus overlap with this of the ssj500k annotated corpus (http://hdl.handle.net/11356/1210), but the latter has been fully manually annotated, as well as having its tokenisation and sentence segmentation corrected. The texts and paragraphs in the jos1M corpus are marked if they are also included in ssj500k, while the CONLL-U is also split into the part that is included in ssj500k and that which is not. The latter can serve as an additional training set for morphosyntactic tagging and lemmatisation to ssj500k.

Identifier
PID	http://hdl.handle.net/11356/1213
Related Identifier	http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
Related Identifier	http://hdl.handle.net/11356/1037
Related Identifier	http://nl.ijs.si/jos/jos1M-en.html
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1213

Provenance
Creator	Erjavec, Tomaž; Krek, Simon; Dobrovoljc, Kaja
Publisher	Jožef Stefan Institute
Publication Year	2019
Rights	Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); https://creativecommons.org/licenses/by-nc/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline	Linguistics