Jing bao ground truth – text block crops and annotations - Dataset

Dataset

Jing bao ground truth – text block crops and annotations

DOI

This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. The dataset contains two subsets: The pairs of text block crops and corresponding ground truth annotations from April 1920, 1930 and 1939 of the Jingbao newspaper (jingbao_annotated_crops.zip). The labeled images of single characters which we automatically cropped from the April 1939 issues of the Jingbao using separators generated from projection profiles (jingbao_char_imgs.zip).

Identifier
DOI	https://doi.org/10.11588/data/PVYWKB
Related Identifier	IsCitedBy https://doi.org/10.6853/DADH.202310_(12).0001
Related Identifier	IsCitedBy https://doi.org/10.11588/heidok.00030845
Metadata Access	https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/PVYWKB

Provenance
Creator	Henke, Konstantin ; Arnold, Matthias
Publisher	heiDATA
Contributor	Arnold, Matthias; Henke, Konstantin; Heidelberg Centre for Transcultural Studies (HCTS); Heidelberg Research Architecture, University of Heidelberg
Publication Year	2023
Rights	CC BY-SA 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-sa/4.0
OpenAccess	true
Contact	Arnold, Matthias (Heidelberg Centre of Transcultural Studies)

Representation
Resource Type	Image files in jpg and png formats; Dataset
Format	application/zip
Size	189721051; 78600047
Version	1.2
Discipline	Humanities; Linguistics
Spatial Coverage	Heidelberg Centre for Transcultural Studies, University of Heidelberg