Slovene instruction-following dataset for large language models GaMS-Instruct-MED-Termset 1.0

Dataset

PID

GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with explanations for clinical and patient use and examples of their application.

The dataset is based on a set of medical terms obtained from Wikidata, accessible via the Wikidata Query Service (https://query.wikidata.org/). The initial set of terms was compared with the terms in the reference Slovenian Medical Dictionary published on Termania (https://www.termania.net/slovarji/95/slovenski-medicinski-slovar). Only matching terms were selected for further processing. The final set of terms was structured and enriched with descriptions generated using large language models (Azure OpenAI, GPT-4.1).

It includes: • Professional descriptions of medical terms and phrases for medical professionals • Popular descriptions of medical terms and phrases for the general public • Conversions between professional and popular descriptions • Synonyms and antonyms for medical terms and phrases

The result is a standardized database in an instructional format. It is suitable for use in computational linguistics, natural language processing (NLP), medical informatics, for training and adapting large language models, developing medical chatbots and assistants in Slovene, supporting healthcare professionals in medical terminology, standardizing medical terminology in Slovene, education in the field of medicine, and conversion between professional and colloquial medical language.

For more details on the structure of the dataset, please consult 00README.txt.

Identifier
PID	http://hdl.handle.net/11356/2089
Related Identifier	https://www.cjvt.si/povejmo/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2089

Provenance
Creator	Plesnik, Emil; Tovornik, Robert; Fabjan, Borut; Radnić, Vuk; Marjanović, Anđela; Korošec, Filip; Žabkar, Ines; Kuzman, Ema; Rigler, Martin; Škufca, Lara; Satler, Maša
Publisher	Better, d.o.o.; Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2026
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics