-
Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the... -
Arabic Proclitics Lexicon
An XML-based file containing all Arabic proclitics -
HamleDT 3.0
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that... -
enTenTen
Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens. -
WMT 2011 Testing Set
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is... -
SYN v4: large corpus of written Czech
Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the... -
QTLeap WSD/NED corpus
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the... -
Universal Derivations v1.0
Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent... -
SQAD
The SQAD database consists of 3301 records obtained from Czech Wikipedia articles. The record structure is following: - the original sentence(s) from Wikipedia - a question... -
huntoken - tokenizer and sentence splitter
HunToken is a rule based tokenizer and sentence boundary detector for Hungarian (and English) texts. -
EngVallex - English Valency Lexicon
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments.... -
Vystadial 2016 – Czech data
This is the Czech data collected during the VYSTADIAL project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone... -
Indonesian web corpus
Indonesian web corpus crawled in 2010. Encoded in UTF-8, cleaned, deduplicated, tagged by Morphind. -
Arabic Morphological evaluation corpus
An annotated corpus dedicated to the benchmark and evaluation of Arabic morphological analyzers. It consists of 100 words with all their possible analysis. The corpus contains... -
English Model (CoNLL-2003) for NameTag
English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves... -
Deep Universal Dependencies 2.7
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional... -
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...
ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole... -
Uniform Meaning Representation
The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily... -
Chared
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as... -
SYN2015: representative corpus of written Czech
Representative corpus of contemporary written Czech sized 100 MW. It was created as a representation of printed language from 2010–2014 containing a wide range of text types...
