SumeCzech

PID

This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.

The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.

The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.

Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.

Identifier
PID http://hdl.handle.net/11234/1-2615
Related Identifier https://www.aclweb.org/anthology/L18-1551.pdf
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-2615
Provenance
Creator Straka, Milan; Mediankin, Nikita; Kocmi, Tom; Žabokrtský, Zdeněk; Hudeček, Vojtěch; Hajič, Jan
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2018
Rights Mozilla Public License 2.0; http://opensource.org/licenses/MPL-2.0; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Czech
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics