Dataset - B2FIND

Oromo web corpus

Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2021 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

C4Corpus (publicdomain part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

CEHugeWebCorpus

This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus) (2022...

The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

C4Corpus (CC BY-SA part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2020 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

C4Corpus (CC-BY part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2016 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

C4Corpus (CC BY-ND part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Amharic Web Corpus

Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC...

C4Corpus (CC BY-NC part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

HWC2023 –Hamburg.de Website Corpus 2023

A petition for a referendum (called: "Schluss mit Gendersprache in Verwaltung und Bildung" / eng.: "abolition of gender language in administration and education") was formed in...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2018 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Hungarian Web Corpus

Monolingual written general; 700 million tokens; Segmentation, disambiguation

24 datasets found