Dataset - B2FIND

How Much Top-Down and Bottom-Up do We Need to Build a Lemmatized Corpus?

Building a lemmatised corpus of German Sign Language (DGS) using iLex; lemmatisation as top-down and lexicon building as bottom-up process; lemma revision

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

CMC training corpus Janes-Tag 1.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Serbian Twitter training corpus ReLDI-NormTag-sr 1.0

ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1

The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus...

Spoken Torlak dialect corpus 1.0 (transcription)

Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local...

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training...

ReLDI tag+lemma web service for WebLicht

WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging, and lemmatisation.

CMC training corpus Janes-Tag 2.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Training corpus jos1M 1.1

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1

The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training...

The CLASSLA-Stanza model for lemmatisation of non-standard Serbian 2.1

The model for lemmatisation of non-standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus...

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1

The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR...

MULTEXT-East free lexicons 4.0

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of...

The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1

The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus...

CMC training corpus Janes-Tag 3.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training...

84 datasets found