CLARIN - Repositories

Ukrainian War Refugees as Self-Translators Dataset

Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented...

OdiEnCorp 2.0

Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...

Czech Grammar Agreement Dataset for Evaluation of Language Models

AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs....

Distribution of Mandarin synesthetic adjectives in five senses

BushBank

Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.

Czech Verbal MWEs

Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017....

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

The corpipe23-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 https://github.com/ufal/crac2023-corpipe. It is released...

NomVallex 2.5

NomVallex is a manually annotated valency lexicon of Czech nouns and adjectives, adopting the theoretical framework of Functional Generative Description as its theoretical...

Czesl - Universal Dependencies Release 0.5

Syntactic annotation of 1600 sentences from the Czesl-MAN corpus using the framework of Universal Dependencies 2.3

English-Urdu Religious Parallel Corpus

English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with...

Czech Text Document Corpus v 2.0

BASIC INFORMATION Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text...

DeriNet 1.6 (2018-09-24)

DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational...

Somali Web Corpus

Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

VIADAT-GIS

A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA.

HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuu...

HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect...

A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Docum...

These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The...

DeriNet 2.1

DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent...

SYN2005: balanced corpus of written Czech

Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types...

Coreference in Universal Dependencies 1.1 (CorefUD 1.1)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (tr...

ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech...

1,494 datasets found