-
PAISÀ Corpus of Italian Web Text
The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been... -
Croatian web corpus hrWaC 2.1
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via... -
Text collection for training the BERTić transformer model BERTić-data
The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to... -
Slovene web corpus MaCoCu-sl 2.0
The Slovene web corpus MaCoCu-sl 2.0 was built by crawling the ".si" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The... -
Icelandic-English parallel corpus MaCoCu-is-en 2.0
The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as... -
Macedonian web corpus CLASSLA-web.mk 1.0
The Macedonian web corpus CLASSLA-web.mk 1.0 is based on the MaCoCu-mk 2.0 web corpus crawl (http://hdl.handle.net/11356/1801), which was additionally cleaned and enriched with... -
Montenegrin web corpus CLASSLA-web.cnr 1.0
The Montenegrin web corpus CLASSLA-web.cnr 1.0 is based on the MaCoCu-cnr 1.0 web corpus crawl (http://hdl.handle.net/11356/1809), which was additionally cleaned and enriched... -
Turkish-English parallel corpus MaCoCu-tr-en 1.0
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other... -
Icelandic web corpus MaCoCu-is 1.0
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler... -
Serbian Web Corpus PDRS 1.0
PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a... -
Montenegrin web corpus meWaC 1.0
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into... -
Finnish-English parallel corpus fienWaC 1.0
The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor... -
Catalan-English parallel corpus MaCoCu-ca-en 1.0
The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the... -
Turkish web corpus MaCoCu-tr 1.0
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The... -
Bulgarian web corpus CLASSLA-web.bg 1.0
The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with... -
Slovene-English parallel corpus MaCoCu-sl-en 2.0
The Slovene-English parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains... -
Croatian web corpus MaCoCu-hr 1.0
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is... -
Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0
The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other... -
Maltese-English parallel corpus MaCoCu-mt-en 2.0
The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well.... -
Slovenian web corpus CLASSLA-web.sl 1.0
The Slovenian web corpus CLASSLA-web.sl 1.0 is based on the Slovenian MaCoCu-sl 2.0 web corpus crawl (http://hdl.handle.net/11356/1795), which was additionally cleaned and...