CLARIN - Repositories

Kacenka : parallel corpus of English and Czech texts

Parallel corpus, 3,297,283 words. The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts....

Hausa Visual Genome 1.0

Data Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We...

Universal Dependencies 1.2 Models for Parsito

Parsing models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use these models, you need Parsito binary,...

vocabulary_analysis

Statistical analysis service: It calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio).

TITUS Old Saxon

ca. 40.000 tokens; linked with relational database; XML-encoding in progress

Many Czech References for 50 Sentences Selected from WMT11 Data

This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11). In total, there are...

One-million Corpus of Croatian Literary Language

written; reference corpus; general; diachornic; monolingual

ParCzech PS7 2.0

The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...

Open SDP

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Språkbanken (Swedish Language Bank)

Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,...

YouTube-ASL Clip Keypoint Dataset

The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos....

tfidf

It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in...

Teop

Documentation of the Teop project (DoBeS project)

ConsILR - Consortium for the Romanian Language: Resources & Tools

Resources and tools developed for Romanian

Brockhaus' Kleines Konversations-Lexikon

Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts

Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus)

The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,...

SALDO

SALDO (Swedish Associative Thesaurus version 2) is an extensive lexicon resource for modern Swedish written language created for the purpose of language technology research and...

Narrangansett dictionary

The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA.

Wikipedia-Lexikon

Präsentation der am häufigsten gesuchten (20.000) Stichwörter der Wikipedia

Romanian Explanatory Dictionary

292.792 definitions

4,938 datasets found