-
Kacenka : parallel corpus of English and Czech texts
Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts.... -
Hausa Visual Genome 1.0
Data Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We... -
Universal Dependencies 1.2 Models for Parsito
Parsing models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use these models, you need Parsito binary,... -
vocabulary_analysis
Statistical analysis service: It calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio). -
TITUS Old Saxon
ca. 40.000 tokens; linked with relational database; XML-encoding in progress -
Many Czech References for 50 Sentences Selected from WMT11 Data
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are... -
One-million Corpus of Croatian Literary Language
written; reference corpus; general; diachornic; monolingual -
ParCzech PS7 2.0
The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between... -
Open SDP
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data... -
Språkbanken (Swedish Language Bank)
Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,... -
YouTube-ASL Clip Keypoint Dataset
The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos.... -
tfidf
It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in... -
Teop
Documentation of the Teop project (DoBeS project) -
ConsILR - Consortium for the Romanian Language: Resources & Tools
Resources and tools developed for Romanian -
Brockhaus' Kleines Konversations-Lexikon
Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts -
Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus)
The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,... -
SALDO
SALDO (Swedish Associative Thesaurus version 2) is an extensive lexicon resource for modern Swedish written language created for the purpose of language technology research and... -
Narrangansett dictionary
The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA. -
Wikipedia-Lexikon
Präsentation der am häufigsten gesuchten (20.000) Stichwörter der Wikipedia -
Romanian Explanatory Dictionary
292.792 definitions
