Dataset - B2FIND

UDPipe tagger Web Service for Weblicht

UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files (https://lindat.mff.cuni.cz/services/udpipe/)

CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)

CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and...

Spoken corpus of Karel Makoň (2020-11-16)

Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism.

Delftse Bijbel 1477

Digitised version of the Delftse Bijbel 1477

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Arabic Phonetic Rules

Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and...

Wortschatz

Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left/right neighbours, example sentences

LiFR-Lite

Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...

Digitized Press

Collection of different digitized mastheads in Catalan and Spanish, covering a time span from 1808 to 2008. The collection, which is kept in the Girona City Council Archive,...

MLASK: Multimodal Summarization of Video-based News Articles

The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam...

Kacenka : parallel corpus of English and Czech texts

Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts....

Hausa Visual Genome 1.0

Data Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We...

Universal Dependencies 1.2 Models for Parsito

Parsing models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use these models, you need Parsito binary,...

vocabulary_analysis

Statistical analysis service: It calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio).

TITUS Old Saxon

ca. 40.000 tokens; linked with relational database; XML-encoding in progress

Many Czech References for 50 Sentences Selected from WMT11 Data

This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are...

One-million Corpus of Croatian Literary Language

written; reference corpus; general; diachornic; monolingual

ParCzech PS7 2.0

The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...

Open SDP

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Språkbanken (Swedish Language Bank)

Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,...

4,790 datasets found