CMC training corpus Janes-Norm 3.0


Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments.

The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene.

The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 ( corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,000 texts and 20,000 words). One part of Janes-Tag 3.0 is the older Janes-Tag 2.1 (texts to 2016) and the newer Janes-RSDO (tweets only, texts up to 2022). Both Janes-Norm and Janes-Tag (but not Janes-RSDO) have texts classified according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness.

The data is available in the source TEI encoding and in derived CoNLL-U format. All three parts contain lemmas and JOS/MULTEXT-East morphosyntactic descriptions, while Janes-Tag and Janes-RSDO also contain Universal Dependencies morphological features, and Janes-Tag also named entity annotations.

Compared to the previous version, this one corrects some capitalisation errors in normalised words of Janes-Norm, updates the encoding, and adds Janes-RSDO.

The first version of this corpus is described in:

FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation.

Related Identifier
Related Identifier
Related Identifier
Related Identifier
Metadata Access
Creator Lenardič, Jakob; Čibej, Jaka; Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja
Publisher Jožef Stefan Institute
Publication Year 2022
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics