Slovene instruction-following dataset for large language models GaMS-Instruct-MED-Termset 1.0

PID

GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with explanations for clinical and patient use and examples of their application.

The dataset is based on a set of medical terms obtained from Wikidata, accessible via the Wikidata Query Service (https://query.wikidata.org/). The initial set of terms was compared with the terms in the reference Slovenian Medical Dictionary published on Termania (https://www.termania.net/slovarji/95/slovenski-medicinski-slovar). Only matching terms were selected for further processing. The final set of terms was structured and enriched with descriptions generated using large language models (Azure OpenAI, GPT-4.1).

It includes: • Professional descriptions of medical terms and phrases for medical professionals • Popular descriptions of medical terms and phrases for the general public • Conversions between professional and popular descriptions • Synonyms and antonyms for medical terms and phrases

The result is a standardized database in an instructional format. It is suitable for use in computational linguistics, natural language processing (NLP), medical informatics, for training and adapting large language models, developing medical chatbots and assistants in Slovene, supporting healthcare professionals in medical terminology, standardizing medical terminology in Slovene, education in the field of medicine, and conversion between professional and colloquial medical language.

For more details on the structure of the dataset, please consult 00README.txt.

Identifier
PID http://hdl.handle.net/11356/2089
Related Identifier https://www.cjvt.si/povejmo/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2089
Provenance
Creator Plesnik, Emil; Tovornik, Robert; Fabjan, Borut; Radnić, Vuk; Marjanović, Anđela; Korošec, Filip; Žabkar, Ines; Kuzman, Ema; Rigler, Martin; Škufca, Lara; Satler, Maša
Publisher Better, d.o.o.; Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2026
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics