Dataset - B2FIND

Multilingual Culture-Independent Word Analogy Datasets

Word analogy task evaluates word embeddings, based on analagous word pairs (eg. "Paris - France" should be equivalent to "Rome - Italy", "son - daughter" should be equivalent to...

Croatian-English parallel corpus MaCoCu-hr-en 2.0

The Croatian-English parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...

DSI-enriched ParaCrawl 9 en-nl corpus

This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...

Turkish-English parallel corpus MaCoCu-tr-en 1.0

The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...

Bulgarian-English parallel corpus MaCoCu-bg-en 1.0

The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other...

Finnish-English parallel corpus fienWaC 1.0

The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...

Post-edited and error annotated machine translation corpus PErr 1.0

The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their...

MultiEmo: Multilingual, Multilevel, Multidomain Sentiment Analysis Corpus of ...

MultiEmo, a new benchmark data set for the multilingual sentiment analysis task including 11 languages. The collection contains consumer reviews from four domains: medicine,...

LitLat BERT

Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing...

PolEmo 1.0 + MultiEmo-Test 1.0 Multilingual Sentiment Analysis Dataset for KE...

PolEmo 1.0 + MultiEmo-Test 1.0: Corpus of Multi-Domain Consumer Reviews. Test dataset from PolEmo 1.0 was translated to eight different languages: Dutch, English, French,...

Prague Czech-English Dependency Treebank 2.0 - Russian translation

Prague Czech-English Dependency Treebank - Russian translation (PCEDT-R) is a project of translating a subset of Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) to...

UFAL Parallel Corpus of North Levantine 1.0

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the...

Preamble 1.0

Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four...

PAWS

PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...

Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...

Hindi Visual Genome 1.1

Data Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues...

Universal Segmentations 1.0 (UniSegments 1.0)

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...

Prague Czech-English Dependency Treebank 2.0 Coref

The Prague Czech-English Dependency Treebank 2.0 Coref (PCEDT 2.0 Coref) is a parallel treebank building upon the original PCEDT 2.0 release and enriching it with the extended...

Europarl QTLeap WSD/NED corpus

This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are sentences from the Europarl parallel...

Hindi Visual Genome 1.0

Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...

63 datasets found