Dataset - B2FIND

NoticIA

We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative...

Psycholinguistic Experiment Video

This is a video recording that is being used in psycholinguistic experiments.

Laburpen corpusa The Basque Summaries Corpus

School summaries obtained from Unai Atutxa's thesis (Atutxa, 2022) are available under the CC BY-NC 4.0 license. A total of 1676 extractions and abstractions have been...

SemMdf - Semantic Database for Moksha

This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

SemKpv - Semantic Database for Komi-Zyrian

This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

Skolt Sami - North Sami Cognates

A human curated list of Skolt Sami (sms) - North Sami (sme) cognates found with an automatic method described in: Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates...

Celebrities and Famous People, and their Properties

Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,...

UralicNLP - The NLP library for Uralic languages

UralicNLP is a natural language processing library targeted mainly for Uralic languages. UralicNLP can produce morphological analysis, generate morphological forms, lemmatize...

El mejor conjunto de datos para identificación del sarcasmo

Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado...

s.morfcorpus.6ec19594.20131227-2309

WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor

Exploring genealogical blends_Online Corpus

The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN...

Movie Title Puns

Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a...

SemFi: Finnish Semantics with Syntactic Relations

Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost...

Finnish Dialect Normalization Model

This is an OpenNMT-py model for normalizing spoken Finnish text into written Finnish. For usage, please see https://github.com/mikahama/murre/ This model has been produced in...

SentiLex-PT 02

SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081...

55 datasets found