-
Interaction and dialogue with large-scale textual data: Parliamentary speeche...
Prof. Dr. Andreas Blätte's keynote talk at the CLARIN Annual Conference 2015. Additional material, including the presented 3D visualisations, are available via... -
Model of English OCR Post-Correction
This is an OpenNMT-py model for OCR post-correction in English Usage, see: https://github.com/mikahama/natas This is a part of the following publication: Mika Hämäläinen, and... -
CATUC: Corpus académico de textos universitarios en castellano
This research was conducted on a corpus of texts produced by first-year undergraduate students at the University of the Basque Country (UPV/EHU). The corpus is called CATUC:... -
NoticIA
We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative... -
Psycholinguistic Experiment Video
This is a video recording that is being used in psycholinguistic experiments. -
Laburpen corpusa The Basque Summaries Corpus
School summaries obtained from Unai Atutxa's thesis (Atutxa, 2022) are available under the CC BY-NC 4.0 license. A total of 1676 extractions and abstractions have been... -
SemMdf - Semantic Database for Moksha
This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the... -
SemKpv - Semantic Database for Komi-Zyrian
This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the... -
Skolt Sami - North Sami Cognates
A human curated list of Skolt Sami (sms) - North Sami (sme) cognates found with an automatic method described in: Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates... -
Celebrities and Famous People, and their Properties
Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,... -
UralicNLP - The NLP library for Uralic languages
UralicNLP is a natural language processing library targeted mainly for Uralic languages. UralicNLP can produce morphological analysis, generate morphological forms, lemmatize... -
El mejor conjunto de datos para identificación del sarcasmo
Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado... -
s.morfcorpus.6ec19594.20131227-2309
WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor -
Exploring genealogical blends_Online Corpus
The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN... -
Movie Title Puns
Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a... -
SemFi: Finnish Semantics with Syntactic Relations
Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost... -
Finnish Dialect Normalization Model
This is an OpenNMT-py model for normalizing spoken Finnish text into written Finnish. For usage, please see https://github.com/mikahama/murre/ This model has been produced in... -
SentiLex-PT 02
SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081... -
HELLO CAMPANIA! Philippines Collection
The Philippines collection contains data for 66 speakers: 32 first generation (G1), 28 second generation (G2), 6 homeland (G0). The collection contains three folders for each... -
HELLO CAMPANIA! Bangladesh Collection
The collection contains 11 interviews with 1st Bangladeshi generation migrants in Naples. It also contains langauge portraits of the migrants.