The Tourism Corpus TURK 3.0 is a multilingual corpus of tourism-related texts in Slovenian, accompanied by some texts (about 6% of the corpus) in English, Italian and German. TURK 3.0 contains almost 1,460 texts or 20 million words and is an upgraded version of the previously existing corpus TURK2 (2016–2024), which contained 16,787 documents and 30 million words. As part of the 2025 upgrade, a number of texts were removed from the corpus, as they were found to be of insufficient quality or contained a mix of languages, while the corpus was expanded with 127 additional documents (approximately 100,000 words) collected from key contemporary Slovenian tourism sources, including the Slovenian Tourist Board (STO), Visit Ljubljana, and Visit Koper. These new materials reflect post-COVID-19 developments in Slovenian tourism and emphasize sustainability, cultural and experiential tourism, and the growing popularity of sports and outdoor tourism.
The corpus texts are classified according to their language, topic (e.g. Cultural tourism, Sports tourism), class (e.g. written, periodical, monthly), type (e.g. newspaper article, diploma thesis, advertising) and whether the text has been proofread or not.
The texts have been automatically annotated with linguistic information according to the Universal Dependencies formalism. The Slovenian texts were annotated with with CLASSLA (https://github.com/clarinsi/classla), a fork of Stanza (https://github.com/stanfordnlp/stanza), while the other languages were annotated with Stanza using the appropriate language model.
TURK 3.0 provides an important foundation for further research in Slovenian tourism terminology and directly supports the development of the growing TURS Tourism Dictionary (https://turs.upr.si).