This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper.
The dataset contains two subsets:
The pairs of text block crops and corresponding ground truth annotations from April 1920, 1930 and 1939 of the Jingbao newspaper (jingbao_annotated_crops.zip).
The labeled images of single characters which we automatically cropped from the April 1939 issues of the Jingbao using separators generated from projection profiles (jingbao_char_imgs.zip).