CLARIN - Repositories

Coreference in Universal Dependencies 1.2 (CorefUD 1.2)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

NameTag 2

NameTag 2 is a named entity recognition tool. It recognizes named entities (e.g., names, locations, etc.) and can recognize both flat and embedded (nested) entities. NameTag 2...

Gold Standard Reference Data for Multiword Expression Extraction: Czech Depen...

Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned...

LiStr: Linguistic Structure Induction Tookit

This toolkit comprises the tools and supporting scripts for unsupervised induction of dependency trees from raw texts or texts with already assigned part-of-speech tags. There...

PAWS

PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...

Czech Translation of SQuAD 2.0 and 1.1

The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the...

Annotation of Dramatic Situations in Theater Play Scripts (2023)

We defined 58 dramatic situations and annotated them in 19 play scripts. Then we selected only 5 well-recognized dramatic situations and annotated further 33 play scripts. In...

Deltacorpus 1.1

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

VIADAT-TEXT

A VIADAT module; the purpose of VIADAT-TEXT is analysis of transcribed recordings. Developed in cooperation with ÚSD AV ČR and NFA.

VIADAT

This component integrates other VIADAT modules; together with VIADAT-REPO this composes the Virtual Assistant for accessing historical audiovisual data. The zip archive contains...

Deep Universal Dependencies 2.8

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional...

MMI_clustering

MMI_clustering is a set of command line tools implementing Mercer's maximum mutual information-based clustering technique.

HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in...

HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect...

Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)

Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a...

EngVallex - English Valency Lexicon 2.0

EngVallex 2.0 as a slightly updated version of EngVallex. It is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the...

ParaDi 2.0 (2018-01-24)

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

PDT-Vallex: Czech Valency lexicon linked to treebanks

The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague...

Facebook Data for Sentiment Analysis

Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and...

Europarl QTLeap WSD/NED corpus

This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are sentences from the Europarl parallel...

HMM tagger

The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.

1,494 datasets found