1,494 datasets found

None: downloadable_files_count: 1 Repositories: CLARIN

Filter Results
  • MUSCIMarker

    MUSCIMarker is an open-source tool for annotating visual objects and their relationships in binary images. It is implemented in Python, known to run on Windows, Linux and OS X,...
  • VIADAT-GIS (2019-12-31)

    A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA.
  • SYN2006PUB: corpus of Czech newspapers

    Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and...
  • AKCES-GEC Grammatical Error Correction Dataset for Czech

    AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison...
  • Tigrinya Web Corpus

    Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • Video699: lecture recordings and lecture materials

    This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified...
  • Universal Derivations v1.1

    Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...
  • Universal Dependencies 1.2 Models for UDPipe

    Tokenizer, POS Tagger, Lemmatizer and Parser models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use...
  • CORMAP - Corpus for Moroccan Arabic Processing

    This resource is a corpus containing 34k Moroccan Colloquial Arabic sentences collected from different sources. The sentences are written in Arabic letters. This resource can be...
  • UDPipe

    UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only...
  • DZ Interset

    DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset...
  • onion

    onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and...
  • A Speech Test Set of Practice Business Presentations with Additional Relevant...

    We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for...
  • Arabic Particles Lexicon

    An XML-based file containing Arabic particles
  • SYN2010: balanced corpus of written Czech

    Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types...
  • Multiword expressions in the Prague Dependency Treebank 2.0

    This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as...
  • VIADAT-REPO

    VIADAT-REPO is a modification to lindat-dspace platform; it's a part of the VIADAT project and as such will be a part of a "virtual assistant" for processing, annotation,...
  • NameTag 3 Multilingual Model 250203

    This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named...
  • Victoria

    Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further...
  • English-Czech Corpus from Wikipedia

    Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech. The work done is described in the paper: ŠTROMAJEROVÁ,...