Text collection for training the BERTić transformer model BERTić-data

PID

The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).

Identifier
PID http://hdl.handle.net/11356/1426
Related Identifier https://arxiv.org/abs/2104.09243
Related Identifier https://huggingface.co/classla/bcms-bertic
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1426
Provenance
Creator Ljubešić, Nikola
Publisher Jožef Stefan Institute
Publication Year 2021
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bosnian; Croatian; Serbian
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 10
Discipline Linguistics