CMC training corpus Janes-Syn 1.0

Dataset

PID

Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected.

The corpus creation and structure are described in: ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. https://nl.ijs.si/janes/cmc-corpora2016/proceedings/

Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123).

Identifier
PID	http://hdl.handle.net/11356/1086
Related Identifier	https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Syn
Related Identifier	https://doi.org/10.1007/s10579-018-9425-z
Related Identifier	https://nl.ijs.si/janes/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1086

Provenance
Creator	Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja
Publisher	Jožef Stefan Institute
Publication Year	2017
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/pdf; application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline	Linguistics