Dataset - B2FIND

Parallel sense-annotated corpus ELEXIS-WSD 1.3

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10...

Universal Dependencies 2.17 models for UDPipe 2 (2025-11-25)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 169 treebanks of 93 languages of Universal Depenencies 2.17 Treebanks, created solely using UD 2.17 data...

LegISTyr test set

LegISTyr is a machine translation test set for evaluating the quality of legal terminology translation from Italian to South Tyrolean German, a minor standard variety of German....

StarwarsNER French Italian Corpus - sample

The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It...

StarwarsNER French Italian Corpus - sample

The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It...

KIParla - KIPasti transcripts

The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled...

KIParla - ParlaTO transcripts

The ParlaTO corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaTO corpus was was...

KIParla - ParlaBO transcripts

The ParlaBO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled within the...

KIParla - KIP transcripts

The KIP corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The KIP corpus was compiled within...

MIXPAR Database: Version 1.0 (September 2025)

MIXPAR: A Database of Mixed Perfective Auxiliation in Italo-Romance (v1.0). This is the first public release (v1.0) of the MIXPAR database, a large-scale dataset documenting...

Multilingual comparable corpora of parliamentary debates ParlaMint 5.0

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Multilingual comparable corpora of parliamentary debates ParlaMint 4.1

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Wortschatz

Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences

Deltacorpus

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

Deep Universal Dependencies 2.4

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional...

Copenhagen Dependency Treebanks versions 1-3

Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the...

Multilingual static embeddings for Verbal Multiword Expressions trained on PA...

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi,...

The National Certificates corpus

The NC test results, background information, speaking and writing performances in 9 foreign / second languages. A web-based data base (html files).

335 datasets found