Dataset - B2FIND

Diccionario de neologismos on line

Lexicographic resource containing 3.530 neologisms documented in press written in Spanish between 1989 and 2007.

Universal Dependencies 2.8

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CoNLL 2017 and 2018 Shared Task Blind and Preprocessed Test Data

CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to...

Universal Dependencies 1.0

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

VOLEM

Multilingual Verbal Lexicon: Catalan , spanish (connexion with French and Basc of other groups)

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

Corpus bilingüe d’alternança de llengües (codeswitching)

8 interactive recordings of group dynamics. Bilingual speakers (L1 -> English; L1 -> Catalan/Spanish).

HamleDT 2.0

HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a...

Universal Segmentations 1.0 (UniSegments 1.0)

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...

WMT 13 Test Set

We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...

OmegaWiki

This dataset has no description

Universal Derivations v1.1

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

SynSemClass 5.0

The SynSemClass synonym verb lexicon version 5.0 is a multilingual resource that enriches previous editions of this event-type ontology with a new language, Spanish. The...

Multilingual Central Repository

Multilingual lexical database that follows the model proposed by the EuroWordNet project. The MCR integrates into the same EuroWordNet framework wordnets from five different...

Universal Dependencies 2.4 Models for UDPipe (2019-05-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data...

NameTag 3 Multilingual Model 250203

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named...

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

The corpipe23-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 https://github.com/ufal/crac2023-corpipe. It is released...

C4Corpus (CC BY-SA part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Bwananet

Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.

PALIC

A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PoSTagger and a linguistic disambiguator.

1,005 datasets found