CLARIN - Repositories

WMT21 Marian translation model (ca-oc)

Marian NMT model for Catalan to Occitan translation. Primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)

Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be...

SQAD v2

Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging)...

WMT21 Marian translation model (ca-oc multi-task)

Marian NMT model for Catalan to Occitan translation. It is a multi-task model, producing also a phonemic transcription of the Catalan source. The model was submitted to WMT'21...

FAUST 0.5

Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test...

SYN2013PUB: corpus of written Czech newspapers

Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged...

Tensor2tensor Translation for Docker

This submission contains Dockerfile for creating a Docker image with compiled Tensor2tensor backend with compatible (TensorFlow Serving) models available in the Lindat...

CLEF-TREC Q/A

List of 2264 questions + answers of CLEF and TREC, translated to Arabic

Czech Named Entity Corpus 2.0

Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named...

CoNLL 2009 Shared Task Czech Trial Set

Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B

EdUKate Czech-Ukrainian translation model 2024

This package includes Czech-to-Ukrainian translation model adapted for the educational domain. The model is exported into the TensorFlow Serving format (using Tensor2tensor...

Semantic annotation of noun/verb conversion in Czech

The item contains a list of 2,058 noun/verb conversion pairs along with related formations (word-formation paradigms) provided with linguistic features, including semantic...

Annotation of Dramatic Situations in Theater Play Scripts

We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In...

ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)

ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and...

KonText Web Demo

An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš...

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

The corpipe24-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is...

Contemporary Arabic dictionary

An XML-based file containing the electronic version of al logha al arabia al moassira (Contemporary Arabic) dictionary. An Arabic monolingual dictionary accomplished by Ahmed...

LatinISE corpus

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre,...

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data...

Khresmoi Summary Translation Test Data 2.0

This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech,...

1,494 datasets found