Dataset - B2FIND

DSI-enriched ParaCrawl 9 en-es corpus

This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...

Bulgarian web corpus MaCoCu-bg 2.0

The Bulgarian web corpus MaCoCu-bg 2.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well....

Icelandic-English parallel corpus MaCoCu-is-en 2.0

The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as...

Serbian-English parallel corpus MaCoCu-sr-en 1.0

The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to...

Finnish web corpus fiWaC 1.0

The Finnish web corpus fiWaC was built by crawling the .fi top-level domain in 2015 for both Finnish and English documents. The corpus was naively tokenised (via spaces),...

Bulgarian web corpus MaCoCu-bg 1.0

The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well....

Slovene Web genre identification corpus GINCO 1.0

The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and...

Icelandic-English parallel corpus MaCoCu-is-en 1.0

The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as...

Montenegrin web corpus MaCoCu-cnr 1.0

The Montenegrin web corpus MaCoCu-cnr 1.0 was built by crawling the ".me" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well....

Croatian web corpus MaCoCu-hr 2.0

The Croatian web corpus MaCoCu-hr 2.0 was built by crawling the ".hr" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...

Croatian-English parallel corpus MaCoCu-hr-en 2.0

The Croatian-English parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...

DSI-enriched ParaCrawl 9 en-nl corpus

This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...

Turkish-English parallel corpus MaCoCu-tr-en 1.0

The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...

Macedonian web corpus MaCoCu-mk 2.0

The Macedonian web corpus MaCoCu-mk 2.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well....

Bulgarian-English parallel corpus MaCoCu-bg-en 1.0

The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other...

Finnish-English parallel corpus fienWaC 1.0

The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...

Indonesian web corpus

Indonesian web corpus crawled in 2010. Encoded in UTF-8, cleaned, deduplicated, tagged by Morphind.

Oromo web corpus

Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Hungarian Web Corpus

Monolingual written general; 700 million tokens; Segmentation, disambiguation

Amharic Web Corpus

Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC...

62 datasets found