Dataset - B2FIND

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...

Croatian linguistic training corpus hr500k 2.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

Spoken Torlak dialect corpus 1.0 (transcription)

Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local...

Word embeddings CLARIN.SI-embed.sl 1.0

CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC etc. The...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serb...

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

Serbian linguistic training corpus SETimes.SR 2.0

The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation,...

MULTEXT-East free lexicons 4.0

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of...

CMC training corpus Janes-Tag 3.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian...

This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian...

This model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank...

MULTEXT-East non-commercial lexicons 4.0

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of...

Word embeddings CLARIN.SI-embed.sr 1.0

CLARIN.SI-embed.sr contains word embeddings induced from the srWaC web corpus. The embeddings are based on the skip-gram model of fastText trained on 554,606,544 tokens of...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonia...

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the 1984 training...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croa...

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

Annotated Corpus of Pre-Standardized Balkan Slavic Literature

The corpus contains 15 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 16th-19th century, together with over 30...

Training corpus ssj500k 2.2

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training...

The Trankit model for linguistic processing of spoken and written Slovenian 1.1

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation...

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

64 datasets found