-
South Slavic web corpus collection CLASSLA-web 2.0
The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian,... -
Multilingual training dataset for CAP policy topic classification ParlaCAP-train
The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated... -
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,... -
Multilingual IPTC Media Topic dataset EMMediaTopic 1.0
The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17... -
Text classification model fastText-Trendi-Topics 1.0
The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts... -
Text classification model SloBERTa-Trendi-Topics 1.0
The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts...
