Jing bao ground truth – text block crops and annotations


This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. The dataset contains two subsets: The pairs of text block crops and corresponding ground truth annotations from April 1920, 1930 and 1939 of the Jingbao newspaper (jingbao_annotated_crops.zip). The labeled images of single characters which we automatically cropped from the April 1939 issues of the Jingbao using separators generated from projection profiles (jingbao_char_imgs.zip).

DOI https://doi.org/10.11588/data/PVYWKB
Related Identifier IsCitedBy https://doi.org/10.6853/DADH.202310_(12).0001
Related Identifier IsCitedBy https://doi.org/10.11588/heidok.00030845
Creator Henke, Konstantin ORCID logo; Arnold, Matthias ORCID logo
Contributor Arnold, Matthias; Henke, Konstantin; Heidelberg Centre for Transcultural Studies (HCTS); Heidelberg Research Architecture, University of Heidelberg
Publication Year 2023
Rights CC BY-SA 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-sa/4.0
Contact Arnold, Matthias (Heidelberg Centre of Transcultural Studies)
