CLARIN - Repositories

Retrograde Morphemic Dictionary of Czech - verbs

The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Language (Slavíčková Eleonora, Academia 1975). The data was obtained by scanning a...

Česílko

Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.

NameTag 2 Models (2020-08-31)

NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here:...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

Czech-English Manual Word Alignment

Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources.

NameTag 3 Czech CNEC 2.0 Model

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained on the Czech Named Entity Corpus 2.0...

CoNLL 2017 Shared Task System Outputs

This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.

Italian Function Words v2

This dictionary is the second version of 11372/LRT-2288, a curated list of Italian function words in a JSON Lines format text file, particularly useful for tasks such as...

CzEngVallex

CzEngVallex is a bilingual valency lexicon of corresponding Czech and English verbs. It connects 20835 aligned valency frame pairs (verb senses) which are translations of each...

MorfFlex CZ 2.0

MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of...

Teaching practicum and its role in the professional training of English teach...

The presented data and metadata include answers to questions raised in the questionnaire focused on the experience of teaching practicums and their role in the practical...

Parsito

Parsito is a fast open-source dependency parser written in C++. Parsito is based on greedy transition-based parsing, it has very high accuracy and achieves a throughput of 30K...

Implemented Spelling Rules

The book [1] contains spelling rules classified into ten categories, each category containing many rules. This XML file presents our implemented rules classified with six...

ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (tr...

ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech...

Manual Arabic spelling-errors correction for collected documents

The file represents a text corpus in the context of Arabic spell checking, where a group of persons edited different files, and all of the committed spelling errors by these...

Corpus OVER

Many studies in cognitive linguistics have analysed the semantics of 'over', notably the semantics associated with 'over' as a preposition. Most of them generally conclude that...

Uniform Meaning Representation 2.0

The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily...

Czech Relationship Extraction Dataset

CERED (Czech Relationship Dataset) is a family of datasets created via distant supervision on Czech Wikipedia and Wikidata. It was created as part of a thesis on Relationship...

CALEM (Comprehensive Arabic LEMmas)

Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical...

Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

1,494 datasets found