Dataset - B2FIND

South Slavic web corpus collection CLASSLA-web 2.0

The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian,...

Multilingual training dataset for CAP policy topic classification ParlaCAP-train

The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated...

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,...

Multilingual IPTC Media Topic dataset EMMediaTopic 1.0

The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17...

Text classification model fastText-Trendi-Topics 1.0

The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts...

Text classification model SloBERTa-Trendi-Topics 1.0

The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts...