Corpus of contemporary blogs

Dataset

PID

In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.

The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone

Identifier
PID	http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
Related Identifier	http://nlp.fi.muni.cz/projekty/cocb/
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8

Provenance
Creator	Grác, Marek
Publisher	Masaryk University, NLP Centre
Publication Year	2011
Rights	Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0); http://creativecommons.org/licenses/by-nc-nd/3.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics