Dataset - B2FIND

Dataset of annotated collocation-distractor pairs COLLDIST

The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to...

Dataset of annotated headword-synonym-distractor triplets SYNDIST

The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of...

ASR model evaluator

Docker image with ASR evaluation tool that has support for WER calculation on punctuated and capitalised transcripts. The UI allows uploading the reference and predicted...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of...

Carniolan Provincial Assembly corpus Kranjska 1.0

The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deželnega zbora kranjskega / Bericht über die Verhandlungen des...

Spoken corpus Gos 2.1 (transcriptions)

The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

PyTorch model for Slovenian Coreference Resolution

Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with the code published on...

Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0

PoVeJMo-VeMo-Med is a dataset containing Slovene medical texts. The bulk of it is comprised of instructions of use for different prescribed drugs. The texts were extracted from...

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as...

Slovene translation of the SQuAD2.0 dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to...

Spoken corpus Gos 2.0 (transcriptions)

The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

Dependency tree extraction tool STARK 3.0

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic...

Slovene sentiment lexicon KSS 1.1

Slovene opinion lexicon KSS is based on the manually translated opinion lexicon of Hu & Liu (2004). The lexicon is updated with some positive and negative words typical for...

CroSloEngual BERT

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing...

Slovene Natural Language Inference Dataset SI-NLI

SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels...

Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6

This Punctuation and Capitalisation model was trained following the NVIDIA NeMo Punctuation and Capitalisation recipe (for details see the official NVIDIA NeMo P&C...

Slovene Text Normalizator RSDO-DS2-NORM 1.0

This Text Normalisator converts Slovene text from written-form into its spoken-form. Traditionally it is an essential preprocessing step before text-to-speech (TTS). As input it...

Neural Machine Translation model for Slovene-English language pair RSDO-DS4-N...

This Neural Machine Translation model for Slovene-English language pair was trained following the NVIDIA NeMo NMT AAYN recipe (for details see the official NVIDIA NeMo NMT...

ELMo embeddings model, Slovenian

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus...

66 datasets found