South Slavic web corpus collection CLASSLA-web 2.0

Dataset

PID

The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. This second major CLASSLA-web release follows the methodology of the CLASSLA-web 1.0 corpus collection while providing more recent texts and additional annotation layers, including automatic topic annotation alongside genre classification. The collection comprises approximately 17 billion words across 38 million texts: 2.31B words in the Slovenian corpus, 3.01B in the Croatian corpus, 1.01B in the Bosnian corpus, 294M in the Montenegrin corpus, 3.71B in the Serbian corpus, 691M in the Macedonian corpus, and 5.99B words in the Bulgarian corpus. Detailed size statistics for each corpus are provided in the accompanying README file.

Each corpus in the CLASSLA-web 2.0 collection is based on dedicated web crawls of the corresponding national top-level domains (TLDs) and connected general domains (e.g. .com), namely, .si for Slovenian, .hr for Croatian, .ba for Bosnian, .me for Montenegrin, .rs and .срб for Serbian, .mk and .мкд for Macedonian, and .bg and .бг for Bulgarian. All texts were collected in 2024. The corpora are linguistically annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). Linguistic processing included tokenization, morphosyntactic annotation, and lemmatization. Each corpus was further automatically annotated with genre labels using the X-GENRE classifier (http://doi.org/10.57967/hf/0927) and with topic labels using the IPTC news topic classifier (http://doi.org/10.57967/hf/4709). Additional details on corpus construction are available at https://clarinsi.github.io/classla-web/.

The CLASSLA-web 2.0 corpora are distributed in two complementary formats. In JSONL format, each web document is represented in a single line containing a complete JSON object with document-level metadata and full text, enabling efficient line-by-line processing of large datasets. This format is primarily intended for downloading, filtering, and offline processing. Two JSONL files are provided for each corpus, with the suffixes .jsonl and .anno.jsonl. Both files are identical, however, the .anno.jsonl version additionally includes linguistically annotated text in CoNLL-U format. The second format is the so called vertical format (VERT): a vertically tokenized, XML-like representation that integrates document-, paragraph-, sentence-, and token-level information together with linguistic annotation, and can be used by (no)Sketch Engine and CWB concordancers. The provided document-level metadata in both formats include document ID, title, URL, domain, top-level domain (tld), language, script (Latin or Cyrillic, applicable to the Bosnian, Croatian, Montenegrin, and Serbian corpora), year of crawling, and predicted genre and topic categories. Further details on metadata attributes and formats are provided in the accompanying README file. In addition, compressed lists of full URLs for each web corpus are available, offering a concise overview of the corpora’s content.

Compared to CLASSLA-web 1.0 (collected in 2021–2022), the new release provides a substantially larger and more recent snapshot of web content, with only about 20 percent textual overlap between the two versions. The new release additionally includes topic annotations alongside genre labels and is distributed in the widely used JSONL and VERT formats. The CLASSLA-web 1.0 corpora were published as separate entries, namely Bosnian (https://hdl.handle.net/11356/1927), Bulgarian (https://hdl.handle.net/11356/1928), Croatian (https://hdl.handle.net/11356/1929), Macedonian (https://hdl.handle.net/11356/1932), Montenegrin (https://hdl.handle.net/11356/1930), Serbian (https://hdl.handle.net/11356/1931) and Slovenian (https://hdl.handle.net/11356/1882).

Notice and take down: Should you consider that our data contains material that is owned by you and should not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

Identifier
PID	http://hdl.handle.net/11356/2079
Related Identifier	https://doi.org/10.48550/arXiv.2601.11170
Related Identifier	https://clarinsi.github.io/classla-web/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2079

Provenance
Creator	Kuzman Pungeršek, Taja; Rupnik, Peter; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2026
Rights	CC0-No Rights Reserved; https://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian; Bulgarian; Croatian; Macedonian; Serbian; Slovenian; Slovene
Resource Type	corpus
Format	application/octet-stream; application/gzip; application/zip; text/plain; charset=utf-8; downloadable_files_count: 29
Discipline	Linguistics