Dataset - B2FIND

CCLL Lemmatised Frequency Lists

The resource contains 6 frequency lists for the Corpus of Contemporary Lithuanian language (CCLL) (https://sitti.vdu.lt/en/services/) 1-LT_token_freq_list.txt - a full frequency...

Frequency List of Lithuanian Homoforms

The list contains 63,139 homoforms. In the Frequency List of Lithuanian Homoforms, the following data are provided for each homoform: 1. the homoform itself, 2) its lemma (or...

IMP corpus n-grams 2.0

A collection of n-grams extracted from the IMP corpus of historical Slovene (cf. https://nl.ijs.si/imp/). Three sets of n-gram lists are provided for lowercased word n-grams of...

KRES corpus n-grams 1.0

This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic...

IMP corpus n-grams 1.0

This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens...

Janes corpus n-grams 1.0

A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided...

Kres corpus n-grams 2.0

A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased...

Keywords and n-grams from a textbook corpus

Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens),...

Gos corpus n-grams 2.0

A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word...

Gos corpus n-grams 1.0

This is a collection of n-grams extracted from the Gos corpus of spoken Slovene. http://hdl.handle.net/11356/1040. In addition to the separate lists of n-grams for tokens and...

Wordlist of the Contemporary Corpus of Lithuanian Language in the Face of War...

We present the comparative wordlist based on the Corpus of the Contemporary Lithuanian Language (CCLL2 version 2, pre-2020), supplemented by the media (courtesy of the news...

Wordlist of Lemmas from the Joint Corpus of Lithuanian

The resource is a wordlist of lemmas from the Joint Corpus of Lithuanian (JCL). The JCL is a merge of three corpora: 1) Vilnius university corpus compiled out of the Lithuanian...

Wordlist of the Contemporary Corpus of Lithuanian language

Dabartinės lietuvių kalbos tekstyno žodžių formų dažniniai sąrašai Worlists of Wordforms of the Contemporary Corpus of Lithuanian language Tekstyno struktūra/Corpus Structure...

Assessment Data of the Dictionary of Modern Lithuanian versus Joint Corpora

The resource is the assessment data of The Dictionary of Modern Lithuanian, 6th edition (DML6) [1], from the point of view of its coverage in the Joint Corpus of Lithuanian...

Lemmatised Wordlist of 1 m. Corpus of Contemporary Lithuanian

The lemmatised wordlist of 1 m. word Lithuanian corpus. The structure of the tab delimited text file (dazninis.txt): HeadwordPart of SpeechWordformFrequency of Occurrence. The...

Cameroonian Languages Dataset

This is a collection of resources on Cameroonian Languages. The collection comprises electronic copies of scanned wordlists and TEI-XML encoded files of the wordlists.

16 datasets found