Training corpus SUK 1.0

Dataset

PID

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with some parts also containing further manually verified annotations. The morphosyntactic tags and (where present) syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies.

The corpus is composed of several parts: * ssj500k-syn (200,320 words): the syntactically annotated part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also named entity, verbal multiword expression and semantic role label annotations; * ssj500k-tag.xml (299,927 words): the PoS tagged part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also verbal multiword expressions annotations; * Ambiga (13,929 words): this corpus has been constructed to contain many potentially lemma/PoS ambiguous words in order to help in the training of taggers and lemmatizers * ElexisWSD (27,091 words): the Slovenian part of the "Parallel sense-annotated corpus ELEXIS-WSD 1.0" (http://hdl.handle.net/11356/1674) with manually checked lemmatisation, PoS tagging, and syntactic parses; contains also named entity and semantic role label annotations; * SentiCoref (340,401 words): the "Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0" (http://hdl.handle.net/11356/1285) with manually checked lemmatisation and PoS tagging; contains also named entity and coreference chain annotation.

The annotations follow: (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/.

The vocabulary of (1) is provided in the back element and (3)-(5) as taxonomies in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader.

In contrast to the previous version ssj500k 2.3, this version has significantly more text, corrects various errors in annotation, annotates more text with syntactic parses, adds new types of annotation, updates the TEI encoding, provides CoNLL-U files with text metadata and distinguishes UD-type CoNLL-U files from JOS-type CoNLL-U files.

Identifier
PID	http://hdl.handle.net/11356/1747
Related Identifier	http://hdl.handle.net/11356/1434
Related Identifier	http://hdl.handle.net/11356/1959
Related Identifier	https://rsdo.slovenscina.eu/en/language-resources
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1747

Provenance
Creator	Arhar Holdt, Špela; Krek, Simon; Dobrovoljc, Kaja; Erjavec, Tomaž; Gantar, Polona; Čibej, Jaka; Pori, Eva; Terčon, Luka; Munda, Tina; Žitnik, Slavko; Robida, Nejc; Blagus, Neli; Može, Sara; Ledinek, Nina; Holz, Nanika; Zupan, Katja; Kuzman, Taja; Kavčič, Teja; Škrjanec, Iza; Marko, Dafne; Jezeršek, Lucija; Zajc, Anja
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics