Monitor corpus of Slovene Trendi 2023-02


The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the Gigafida 2.0 reference corpus of written Slovene ( All the contents of the Trendi corpus are at the moment obtained using the Jožef Stefan Institute Newsfeed service ( The texts have been annotated using the CLASSLA-Stanza pipeline (, including syntactic parsing according to the Universal Dependencies ( and Named Entities ( An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at (Text classification model SloBERTa-Trendi-Topics 1.0), (Text classification model fastText-Trendi-Topics 1.0), and (SloBERTa model). At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers.

Related Identifier
Related Identifier
Related Identifier
Related Identifier
Metadata Access
Creator Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
Publisher Jožef Stefan Institute; Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2023
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type corpus
Format downloadable_files_count: 0
Discipline Linguistics