Dataset - B2FIND

The CLASSLA-StanfordNLP model for lemmatisation of standard Serbian 1.2

The model for lemmatisation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR...

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training...

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.0

The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR...

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.4

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Training corpus jos1M 1.2

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

ReLDI tag+lemma+parse web service for WebLicht

WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging, lemmatisation and dependency parsing. Tool source files...

The CLASSLA-StanfordNLP model for lemmatisation of standard Bulgarian 1.1

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the BulTreeBank...

Ekspress news article archive (in Estonian and Russian) 1.0

The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Word embeddings CLARIN.SI-embed.hr 1.0

CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC and a 400-million-token-heavy collection...

Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1

The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50...

Macedonian linguistic training corpus SETimes.MK 0.1

The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually...

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.1

The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

This model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus...

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.2

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Trankit model for SST 2.15

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank...

84 datasets found