-
OK, Computer, what are these books about? - data files
The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked... -
W2C – Web to Corpus – tool
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc. A set of 120 corpora collected using this... -
Context-Aware Representations for Knowledge Base Relation Extraction
We provide a subcorpus of Wikipedia that was annotated with Wikidata relations using a distant supervision procedure. The corpus contains two types of annotations: entities and... -
OK, Computer, what are these books about? - data files
The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked... -
ensiwiki-2011 dataset for readability modelling
The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular... -
wikinewsy
wikinewsy -
Parallel Corpora from Comparable Corpora tool
Script consists of 2 parts: article parser aligner Required software (install before using script): yalign additional Ubuntu packages: mongodb ipython python-nose... -
Próbny korpus
wikinewsy -
Wiki test - 34 categories
Wikipedia, 34 kategorie - zbiór do testów klasyfikatora -
Wikinews korpus próbny
Set of files containing wikinews -
Wikipedia Infobox Mapping PL
Mapping between infobox attributes used in Polish Wikipedia and KPWr named entity schema. -
Korpus test - Wikinews
Testowa baza na zajęcia -
Wiki train - 34 categories
Wikipedia, 34 kategorie - zbiór do uczenia klasyfikatora -
CEN
Corpus of Economic News (CEN) contains 797 documents from Polish Wikipedia annotated with 65 categories of proper names in ccl format....
