Corpus of contemporary blogs


In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.

The corpus was created from ten contemporary blogs:

Related Identifier
Metadata Access
Creator Grác, Marek
Publisher Masaryk University, NLP Centre
Publication Year 2011
Rights Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0);; PUB
OpenAccess true
Contact Masaryk University, NLP Centre
Language Czech
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics