Jing bao ground truth – text block crops and annotations

DOI

This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. The dataset contains two subsets: The pairs of text block crops and corresponding ground truth annotations from April 1920, 1930 and 1939 of the Jingbao newspaper (jingbao_annotated_crops.zip). The labeled images of single characters which we automatically cropped from the April 1939 issues of the Jingbao using separators generated from projection profiles (jingbao_char_imgs.zip).

Identifier
DOI https://doi.org/10.11588/data/PVYWKB
Related Identifier IsCitedBy https://doi.org/10.6853/DADH.202310_(12).0001
Related Identifier IsCitedBy https://doi.org/10.11588/heidok.00030845
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/PVYWKB
Provenance
Creator Henke, Konstantin ORCID logo; Arnold, Matthias ORCID logo
Publisher heiDATA
Contributor Arnold, Matthias; Henke, Konstantin; Heidelberg Centre for Transcultural Studies (HCTS); Heidelberg Research Architecture, University of Heidelberg
Publication Year 2023
Rights CC BY-SA 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-sa/4.0
OpenAccess true
Contact Arnold, Matthias (Heidelberg Centre of Transcultural Studies)
Representation
Resource Type Image files in jpg and png formats; Dataset
Format application/zip
Size 189721051; 78600047
Version 1.2
Discipline Humanities; Linguistics
Spatial Coverage Heidelberg Centre for Transcultural Studies, University of Heidelberg