Dataset - B2FIND

Frequency lists of word-level n-grams from the Trendi corpus 2020

Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST...

Corpus extraction tool LIST 1.0

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI...

KRES corpus n-grams 1.0

This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic...

Frequency lists of word-level n-grams from the Trendi corpus 2021

Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST...

Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1

Frequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool...

Gos corpus n-grams 1.0

This is a collection of n-grams extracted from the Gos corpus of spoken Slovene. http://hdl.handle.net/11356/1040. In addition to the separate lists of n-grams for tokens and...

Frequency lists of character-level n-grams from the Gigafida 2.0 corpus

Frequency lists of character-level n-grams were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus...

List of formulaic sequences in standard written Slovenian

This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic...

Frequency lists of word-level n-grams from the GOS 1.0 corpus

Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction...

Dependency tree extraction tool STARK 1.0

STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora, aimed at corpus-driven linguistic investigations of syntactic phenomena of...

Janes corpus n-grams 1.0

A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided...

Keywords and n-grams from a textbook corpus

Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens),...

Frequency lists of word-level n-grams from the Gigafida 2.0 corpus

Frequency lists of word-level n-grams (or word sets) were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST...

Corpus extraction tool LIST 1.2

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI...

Gos corpus n-grams 2.0

A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word...

List of formulaic sequences in spoken Slovenian

This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,...

Kres corpus n-grams 2.0

A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased...

Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1

Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction...

IMP corpus n-grams 1.0

This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens...

IMP corpus n-grams 2.0

A collection of n-grams extracted from the IMP corpus of historical Slovene (cf. http://nl.ijs.si/imp/). Three sets of n-gram lists are provided for lowercased word n-grams of...

26 datasets found