Dataset - B2FIND

PAISÀ Corpus of Italian Web Text

The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been...

Croatian web corpus hrWaC 2.1

The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via...

Text collection for training the BERTić transformer model BERTić-data

The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to...

Slovene web corpus MaCoCu-sl 2.0

The Slovene web corpus MaCoCu-sl 2.0 was built by crawling the ".si" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...

Icelandic-English parallel corpus MaCoCu-is-en 2.0

The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as...

Macedonian web corpus CLASSLA-web.mk 1.0

The Macedonian web corpus CLASSLA-web.mk 1.0 is based on the MaCoCu-mk 2.0 web corpus crawl (http://hdl.handle.net/11356/1801), which was additionally cleaned and enriched with...

Montenegrin web corpus CLASSLA-web.cnr 1.0

The Montenegrin web corpus CLASSLA-web.cnr 1.0 is based on the MaCoCu-cnr 1.0 web corpus crawl (http://hdl.handle.net/11356/1809), which was additionally cleaned and enriched...

Turkish-English parallel corpus MaCoCu-tr-en 1.0

The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...

Icelandic web corpus MaCoCu-is 1.0

The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler...

Serbian Web Corpus PDRS 1.0

PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a...

Montenegrin web corpus meWaC 1.0

The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into...

Finnish-English parallel corpus fienWaC 1.0

The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...

Catalan-English parallel corpus MaCoCu-ca-en 1.0

The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the...

Turkish web corpus MaCoCu-tr 1.0

The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The...

Bulgarian web corpus CLASSLA-web.bg 1.0

The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with...

Slovene-English parallel corpus MaCoCu-sl-en 2.0

The Slovene-English parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...

Croatian web corpus MaCoCu-hr 1.0

The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...

Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0

The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...

Maltese-English parallel corpus MaCoCu-mt-en 2.0

The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....

Slovenian web corpus CLASSLA-web.sl 1.0

The Slovenian web corpus CLASSLA-web.sl 1.0 is based on the Slovenian MaCoCu-sl 2.0 web corpus crawl (http://hdl.handle.net/11356/1795), which was additionally cleaned and...

65 datasets found