-
English-Slovenian text genre dataset X-GENRE
The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and... -
Training corpus SUK 1.0
The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with... -
Q-CAT Corpus Annotation Tool 1.3
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Choice of plausible alternatives dataset in Macedonian COPA-MK
The COPA-MK dataset (Choice of plausible alternatives in Macedonian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
List of formulaic sequences in standard written Slovenian
This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic... -
Corpus of term-annotated texts RSDO5 1.1
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.1
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
CMC training corpus Janes-Norm 1.1
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Slovene Web genre identification corpus GINCO 1.0
The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and... -
Training corpus ssj500k 2.3
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Slovenian Twitter hate speech dataset IMSyPP-sl
A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs,... -
Training corpus SETimes.SR 1.0
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic... -
Post-edited and error annotated machine translation corpus PErr 1.0
The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their... -
Q-CAT Corpus Annotation Tool 1.1
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...
