-
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
News comment corpus Janes-News 1.0
Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is... -
Forum corpus Janes-Forum 1.0
Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.1
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
CMC training corpus Janes-Norm 1.1
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Twitter corpus Janes-Tweet 1.0
Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into... -
cSMTiser: word standardisation
Word standardisation of non-standard language as found in user-generated content, using cSMTiser (https://github.com/clarinsi/csmtiser), a tool for text normalisation via...
