Dataset - B2FIND

OK, Computer, what are these books about? - data files

The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked...

W2C – Web to Corpus – tool

A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc. A set of 120 corpora collected using this...

Context-Aware Representations for Knowledge Base Relation Extraction

We provide a subcorpus of Wikipedia that was annotated with Wikidata relations using a distant supervision procedure. The corpus contains two types of annotations: entities and...

OK, Computer, what are these books about? - data files

The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked...

ensiwiki-2011 dataset for readability modelling

The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular...

wikinewsy

Parallel Corpora from Comparable Corpora tool

Script consists of 2 parts: article parser aligner Required software (install before using script): yalign additional Ubuntu packages: mongodb ipython python-nose...

Próbny korpus

wikinewsy

Wiki test - 34 categories

Wikipedia, 34 kategorie - zbiór do testów klasyfikatora

Wikinews korpus próbny

Set of files containing wikinews

Wikipedia Infobox Mapping PL

Mapping between infobox attributes used in Polish Wikipedia and KPWr named entity schema.

Korpus test - Wikinews

Testowa baza na zajęcia

Wiki train - 34 categories

Wikipedia, 34 kategorie - zbiór do uczenia klasyfikatora

CEN

Corpus of Economic News (CEN) contains 797 documents from Polish Wikipedia annotated with 65 categories of proper names in ccl format....

14 datasets found