-
Parallel sense-annotated corpus ELEXIS-WSD 1.0
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10... -
Bulgarian-English parallel corpus MaCoCu-bg-en 2.0
The Bulgarian-English parallel corpus MaCoCu-bg-en 2.0 was built by crawling the “.bg” and “.бг” internet top-level domains in 2021, extending the crawl dynamically to other... -
Macedonian-English parallel corpus MaCoCu-mk-en 2.0
The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021, extending the crawl dynamically to other... -
Turkish-English parallel corpus MaCoCu-tr-en 2.0
The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other... -
CroSloEngual BERT
Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing... -
CroSloEngual BERT 1.1
Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing... -
Maltese-English parallel corpus MaCoCu-mt-en 2.0
The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well.... -
Slovene-English parallel corpus slenWaC 1.0
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor... -
Serbian-English parallel corpus srenWaC 1.0
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor... -
Croatian-English parallel corpus MaCoCu-hr-en 1.0
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as... -
Macedonian-English parallel corpus MaCoCu-mk-en 1.0
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other... -
Emoji Sentiment Ranking 1.0
A lexicon of 751 emoji characters with automatically assigned sentiment. The sentiment is computed from 70,000 tweets, labeled by 83 human annotators in 13 European languages.... -
Japanese-Slovene learner's dictionary jaSlo 3.1
The jaSlo dictionary is primarily intended for Slovene students learning Japanese. For each entry, it contains the Japanese headword (kanji, hiragana or katakana, and romaji),... -
DSI-enriched ParaCrawl 9 en-es corpus
This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the... -
Icelandic-English parallel corpus MaCoCu-is-en 2.0
The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as... -
Dataset of European Parliament roll-call votes and Twitter activities MEP 1.0
The resource consists of two datasets related to Members of the 8th European Parliament (MEPs). The first one is a dataset of 2,535 roll-call votes of MEPs until 2016-03-01. The... -
Serbian-English parallel corpus MaCoCu-sr-en 1.0
The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to... -
Twitter sentiment for 15 European languages
The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators. There are 15 Twitter corpora for the corresponding 15 European languages.... -
MULTEXT-East non-commercial lexicons 4.0
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of... -
Icelandic-English parallel corpus MaCoCu-is-en 1.0
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as...