Dataset - B2FIND

Replication data for: The beginning of a beautiful friendship: rule-based and...

We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed...
Replication data for: The ongoing eclipse of possessive suffixes in North Saa...

North Saami is replacing the use of possessive suffixes on nouns with a morphologically simpler analytic construction. Our data (>2K examples culled from >.5M words) track...
Diakorp v6: diachronic corpus of Czech

Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not...
B4 Heliand

Heliand 1, 4 and 5: complete text, status: final, digitalization, translation to Modern German, manually annotated with parts of speech, syntactic categories, grammatical...
Frequency list of textbook vocabulary by level of education in elementary and...

The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary...
A Digital Dictionary of Tunis Arabic - TUNICO (ELEXIS)

A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but...
Diachrono

Polish texts from 17th to 19th century
Diachrono - sample

Sample of diachronic corpus
diachronic1

HISTORY

You can also access this registry using the API (see API Docs).

9 datasets found