SiR 2.0

PID

SiR 2.0 is an update of an annotated corpus of Czech articles published on iRozhlas, a news server of a Czech public radio (https://www.irozhlas.cz/). SiR 2.0 is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution of citation phrases and sources. The sources are classified into several classes of named and unnamed sources. The corpus consists of two parts, depending on the origin of the annotations: (i) expert-annotated articles: 589 articles (13 280 sentences) annotated originally by three or two student annotators and later curated or re-annotated by an expert, (ii) student-annotated articles:: 1 129 articles (29 610 sentences) annotated each by a single student annotator. The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each article is represented by the original plain text and a stand-off annotation file. In total, there are annotated 10 033 citation phrases, 8 960 citation sources and 9 317 links between sources and phrases.

Identifier
PID http://hdl.handle.net/11234/1-6144
Related Identifier http://hdl.handle.net/11234/1-4840
Related Identifier https://ufal.mff.cuni.cz/anotace-citacnich-frazi-v-datech-irozhlas/sir-20
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-6144
Provenance
Creator Mírovský, Jiří; Hladká, Barbora; Kopp, Matyáš; Moravec, Václav
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2026
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Czech
Resource Type corpus
Format application/zip; text/plain; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics