-
Bilingual terminology extraction dataset KAS-biterm 1.0
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD theses from the KAS corpus of Slovene academic writing. Only sentences that have a... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.0
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
Choice of plausible alternatives dataset in Macedonian COPA-MK
The COPA-MK dataset (Choice of plausible alternatives in Macedonian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Manually sentiment annotated Slovenian news corpus SentiNews 1.0
Between 2 and 6 annotators independently sentiment annotated a stratified random sample of 10,427 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Serbian Twitter training corpus ReLDI-NormTag-sr 1.0
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
MULTEXT-East "1984" annotated corpus 4.0
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
CMC training corpus Janes-Norm 1.0
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
List of formulaic sequences in standard written Slovenian
This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic... -
Terminology identification dataset KAS-term 1.0
The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the... -
Choice of plausible alternatives dataset in Serbian COPA-SR
The COPA-SR dataset (Choice of plausible alternatives in Serbian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Corpus of term-annotated texts RSDO5 1.0
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
List of formulaic sequences in spoken Slovenian
This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,... -
Sentiment Annotated Dataset of Croatian News
We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading... -
CMC training corpus Janes-Tag 1.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Choice of plausible alternatives dataset in Croatian COPA-HR
The COPA-HR dataset (Choice of plausible alternatives in Croatian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Slovenian Twitter hate speech dataset IMSyPP-sl
A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs,... -
Dataset of Slovene idiomatic expressions SloIE
SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an... -
Slovene Web genre identification corpus GINCO 1.0
The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and...