CLARIN - Repositories

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The...

ParCzech 4.0

The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current...

Swedish-speaking population of Finland - statistics

data about the Swedish speaking minority in Finland from Finstat used for a report in PowerBI

Possessive Pronoun Preference

The contribution includes the data frames and the R script (Markdown file) belonging to the paper "Morphological and Pragmatic Conditioning of Reflexivity in Possessive...

LITUND corpus v1

LITUND contains two comparable corpora: 1. Unreliable news texts. 147 full-text articles (100,678 words) identified as misleading by professional fact-checkers. The corpus...

MultiCo

The MultiCo multimodal corpus is one of the outcomes of the project "Digital Research Infrastructure for the Humanities and Arts Studies DARIAH-PL." This project was funded by...

Monitor corpus of Slovene Trendi 2025-05

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-05 covers the period from January...

MorfoCzech

A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann...

Deutsches Wörterbuch (1DWB, by Jacob and Wilhelm Grimm)

retro-digitized version of the first edition of the Deutsches Wörterbuch by Jacob and Wilhelm Grimm, originally published from 1854 to 1960

TITUS Middle Welsh

ca. 20.000 tokens; linked with relational database; XML-encoding in progress

UDPipe tagger Web Service for Weblicht

UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files (https://lindat.mff.cuni.cz/services/udpipe/)

CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)

CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and...

Spoken corpus of Karel Makoň (2020-11-16)

Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism.

Delftse Bijbel 1477

Digitised version of the Delftse Bijbel 1477

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Arabic Phonetic Rules

Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and...

Wortschatz

Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences

LiFR-Lite

Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...

Digitized Press

Collection of different digitized mastheads in Catalan and Spanish, covering a time span from 1808 to 2008. The collection, which is kept in the Girona City Council Archive,...

MLASK: Multimodal Summarization of Video-based News Articles

The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam...

4,938 datasets found