Dataset - B2FIND

Diachrono

Polish texts from 17th to 19th century

Word embeddings for Polish (KGR10, Fasttext binary) kgr10_fasttext_bin_v1

Distributional language model (binary) for Polish trained on KGR10 using Fasttext (vector dimension: 100).

LCM-PL

dictionary describing abstraction level of verbs

ASPAC – Swedish-Polish (2017-10-16) ASPAC – svenska-polska (2017-10-16)

Part of The Amsterdam Slavic Parallel Aligned Corpus. The material is sentence scrambled. Del av The Amsterdam Slavic Parallel Aligned Corpus. Materialet är meningsomkastat.

JRC EU DGT Translation Memory Parsebank DGT-UD 1.0

DGT-UD is a 2 billion word 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with...

Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

Treq Translation Equivalents (ELEXIS)

Data for Treq interface 2.0 derived from the InterCorp parallel corpus release 12.

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20...

Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings...

Concreteness and imageability lexicon MEGA.HR-Crossling

The lexicon contains concreteness and imageability predictions of words in 77 languages. The resource is built via supervised machine learning, using average human responses...

Emoji Sentiment Ranking 1.0

A lexicon of 751 emoji characters with automatically assigned sentiment. The sentiment is computed from 70,000 tweets, labeled by 83 human annotators in 13 European languages....

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

Multilingual comparable corpora of parliamentary debates ParlaMint 4.0

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

MULTEXT-East "1984" annotated corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

GLOBAL Polish-French Dictionary - MLDS (ELEXIS)

A general language Polish to French dictionary.

Twitter sentiment for 15 European languages

The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators. There are 15 Twitter corpora for the corresponding 15 European languages....

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Multilingual comparable corpora of parliamentary debates ParlaMint 1.0

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid-2020, with each corpus being about...

653 datasets found