Dataset - B2FIND

Reference List of Slovene Frequent Common Words

The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text...

A Resource for Evaluating Graded Word Similarity in Context: CoSimLex

The dataset contains human similarity ratings for pairs of words. The annotators were presented with contexts that contained both of the words in the pair and the dataset...

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as...

CroSloEngual BERT

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing...

ELMo embeddings model, Slovenian

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus...

SimLex-999 Slovenian translation SimLex-999-sl 1.0

The resource contains English SimLex-999 (Hill et al. 2015) and their Slovene translations. In the translation process, the word pairs were first translated by two translators...

Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as...

Dataset of Slovene idiomatic expressions SloIE

SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an...

Summarization datasets from the KAS corpus KAS-Sum 1.0

Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus...

Corpus of academic Slovene KAS 2.0

The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens)...

Slovenian keyword extraction dataset from SentiNews 1.0

The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords....

CroSloEngual BERT 1.1

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing...

ELMo embeddings models for seven languages

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian,...

Multilingual Culture-Independent Word Analogy Datasets

Word analogy task evaluates word embeddings, based on analagous word pairs (eg. "Paris - France" should be equivalent to "Rome - Italy", "son - daughter" should be equivalent to...

Machine Translation datasets from the KAS corpus KAS-MT 1.0

The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and English plain-text abstracts from KAS-Abs 2.0 (http://hdl.handle.net/11356/1449)...

Abstracts from the KAS corpus KAS-Abs 2.0

The KAS-abs 2.0 corpus contains 125,202 automatically identified Slovenian and/or English abstracts from BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic...

List of single-word male and female occupations in Slovenian

The list of single-word occupations in Slovene is based on the Slovene Standard Classification of Occupations...

17 datasets found