CLARIN - Repositories

Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking

This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the...

Arabic Proclitics Lexicon

An XML-based file containing all Arabic proclitics

HamleDT 3.0

HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that...

enTenTen

Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens.

WMT 2011 Testing Set

Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is...

SYN v4: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the...

QTLeap WSD/NED corpus

This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the...

Universal Derivations v1.0

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

SQAD

The SQAD database consists of 3301 records obtained from Czech Wikipedia articles. The record structure is following: - the original sentence(s) from Wikipedia - a question...

huntoken - tokenizer and sentence splitter

HunToken is a rule based tokenizer and sentence boundary detector for Hungarian (and English) texts.

EngVallex - English Valency Lexicon

EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments....

Vystadial 2016 – Czech data

This is the Czech data collected during the VYSTADIAL project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone...

Indonesian web corpus

Indonesian web corpus crawled in 2010. Encoded in UTF-8, cleaned, deduplicated, tagged by Morphind.

Arabic Morphological evaluation corpus

An annotated corpus dedicated to the benchmark and evaluation of Arabic morphological analyzers. It consists of 100 words with all their possible analysis. The corpus contains...

English Model (CoNLL-2003) for NameTag

English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves...

Deep Universal Dependencies 2.7

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional...

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...

ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Uniform Meaning Representation

The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily...

Chared

Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as...

SYN2015: representative corpus of written Czech

Representative corpus of contemporary written Czech sized 100 MW. It was created as a representation of printed language from 2010–2014 containing a wide range of text types...

1,494 datasets found