CLARIN - Repositories

MUSCIMarker

MUSCIMarker is an open-source tool for annotating visual objects and their relationships in binary images. It is implemented in Python, known to run on Windows, Linux and OS X,...

VIADAT-GIS (2019-12-31)

A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA.

SYN2006PUB: corpus of Czech newspapers

Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and...

AKCES-GEC Grammatical Error Correction Dataset for Czech

AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison...

Tigrinya Web Corpus

Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Video699: lecture recordings and lecture materials

This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified...

Universal Derivations v1.1

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

Universal Dependencies 1.2 Models for UDPipe

Tokenizer, POS Tagger, Lemmatizer and Parser models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use...

CORMAP - Corpus for Moroccan Arabic Processing

This resource is a corpus containing 34k Moroccan Colloquial Arabic sentences collected from different sources. The sentences are written in Arabic letters. This resource can be...

UDPipe

UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only...

DZ Interset

DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset...

onion

onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and...

A Speech Test Set of Practice Business Presentations with Additional Relevant...

We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for...

Arabic Particles Lexicon

An XML-based file containing Arabic particles

SYN2010: balanced corpus of written Czech

Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types...

Multiword expressions in the Prague Dependency Treebank 2.0

This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as...

VIADAT-REPO

VIADAT-REPO is a modification to lindat-dspace platform; it's a part of the VIADAT project and as such will be a part of a "virtual assistant" for processing, annotation,...

NameTag 3 Multilingual Model 250203

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named...

Victoria

Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further...

English-Czech Corpus from Wikipedia

Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech. The work done is described in the paper: ŠTROMAJEROVÁ,...

1,494 datasets found