-
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The... -
ParCzech 4.0
The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current... -
Swedish-speaking population of Finland - statistics
data about the Swedish speaking minority in Finland from Finstat used for a report in PowerBI -
Possessive Pronoun Preference
The contribution includes the data frames and the R script (Markdown file) belonging to the paper "Morphological and Pragmatic Conditioning of Reflexivity in Possessive... -
LITUND corpus v1
LITUND contains two comparable corpora: 1. Unreliable news texts. 147 full-text articles (100,678 words) identified as misleading by professional fact-checkers. The corpus... -
MultiCo
The MultiCo multimodal corpus is one of the outcomes of the project "Digital Research Infrastructure for the Humanities and Arts Studies DARIAH-PL." This project was funded by... -
Monitor corpus of Slovene Trendi 2025-05
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-05 covers the period from January... -
MorfoCzech
A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann... -
Deutsches Wörterbuch (1DWB, by Jacob and Wilhelm Grimm)
retro-digitized version of the first edition of the Deutsches Wörterbuch by Jacob and Wilhelm Grimm, originally published from 1854 to 1960 -
TITUS Middle Welsh
ca. 20.000 tokens; linked with relational database; XML-encoding in progress -
UDPipe tagger Web Service for Weblicht
UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files (https://lindat.mff.cuni.cz/services/udpipe/) -
CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)
CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and... -
Spoken corpus of Karel Makoň (2020-11-16)
Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism. -
Delftse Bijbel 1477
Digitised version of the Delftse Bijbel 1477 -
FAUST cs-en 0.5
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).... -
Arabic Phonetic Rules
Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and... -
Wortschatz
Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences -
LiFR-Lite
Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features... -
Digitized Press
Collection of different digitized mastheads in Catalan and Spanish, covering a time span from 1808 to 2008. The collection, which is kept in the Girona City Council Archive,... -
MLASK: Multimodal Summarization of Video-based News Articles
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam...
