Dataset - B2FIND

ParCzech 4.0

The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current...

16S rRNA sequencing of the gut microbiome of young and aged mice infected wit...

Data correspond to a gut microbiome analysis through 16S rRNA sequencing of caecal samples collected from young and aged mice infected with influenza virus at day 0 and at days...

Hyperlink Graph of the World Wide Web of 2012 (aggregated by first level subd...

Knowledge about the general graph structure of the hyperlink graph is important for designing ranking methods for search engines. To amend the ranking calculated by search...

Hyperlink Graph of the World Wide Web of 2012 (aggregated by host)

Knowledge about the general graph structure of the hyperlink graph is important for designing ranking methods for search engines. To amend the ranking calculated by search...

MLASK: Multimodal Summarization of Video-based News Articles

The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam...

Hausa Visual Genome 1.0

Data Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We...

ParCzech PS7 2.0

The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...

Open SDP

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Oromo web corpus

Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Deep Universal Dependencies 2.4

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional...

Synthetic part of CzEng 2.0

CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...

WMT16 Tuning Shared Task Models (English-to-Czech)

This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training...

ParCzech PS7 1.0

The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

jusText

jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2021 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

KER - Keyword Extractor

KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from...

Universal Dependencies 1.3

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set...

Amharic WIC Corpus

Substantially cleaned version of existing morphologically annotated WIC Corpus.

500 datasets found