-
Diakorp v6: diachronic corpus of Czech
Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not... -
HamleDT 2.0
HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a... -
NAFIS Arabic Stemming Gold Standard Corpus
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of... -
MTMonkey
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre-... -
Universal Segmentations 1.0 (UniSegments 1.0)
Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation... -
Large Corpus of Czech Parliament Plenary Hearings
We present a large corpus of Czech parliament plenary sessions. The corpus consists of approximately 444 hours of speech data and corresponding text transcriptions. The whole... -
"Al wassit" LMF Arabic dictionary
An LMF conformant XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language... -
CEC6-Converter
Diese Software erlaubt eine Konvertierung von *.cec6.gz-Dateien in 24 Formate, die in der Korpuslinguistik / NLProc üblich sind. Die Ausführung ist unter allen modernen... -
IDENTICv1.0-raw
Raw Text -
SQAD 3.2
Simple question answering database version 3.2 (SQAD v3.2) created from Czech Wikipedia. The new version consists of more than 16000 records. Each record of SQAD consists of... -
LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Compr...
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the... -
Extended Morphosyntactic Testset for Word2Vec
We have created test set for syntactic questions presented in the paper [1] which is more general than Mikolov's [2]. Since we were interested in morphosyntactic relations, we... -
skTenTen
Slovak large web corpus skTenTen, comprising 876,003,720 tokens. -
The Use of Machine Translation by Ukrainian War Refugees in Czechia
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented... -
MUSCIMA++
MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. It contains 91255 symbols, consisting of both notation primitives and higher-level notation... -
Czech Models (MorfFlex CZ 160310 + PDT 3.0) for MorphoDiTa 160310
Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ... -
VALLEX 3.0
VALLEX 3.0 provides information on the valency structure (combinatorial potential) of verbs in their particular senses, which are characterized by glosses and examples. VALLEX... -
sqad 3.0
Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files -... -
Ptakopět data: the dataset for experiments on outbound translation
The dataset used for the Ptakopět experiment on outbound machine translation. It consists of screenshots of web forms with user queries entered. The queries are available also... -
CoNLL-based Extended Czech Named Entity Corpus 2.0
This is a Czech Named Entity Corpus 2.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8. The...
