ECPO joins several important digital collections of the early Chinese press and puts them into a single overarching framework. In the first phase, several databases on early women’s periodicals and entertainment publishing were created: “Chinese Women’s Magazines in the Late Qing and Early Republican Period” (WoMag), “Chinese Entertainment Newspapers” (Xiaobao), and databases hosted at the Academia Sinica in Taiwan. These systems approach the material in two ways: in the intensive approach we record all articles, images, advertisements, and related agents and assign them to a complete set of scanned pages, while in the extensive approach we record the main characteristic features of publications. ECPO is distinguished from other existing databases of Chinese periodicals in that it not only provides image scans but also preserves materials often excluded in reprint, microfilm, or digital (even full-text) editions, such as advertising inserts and illustrations. In addition, it aims at incorporating metadata in both English and Chinese, including keywords and biographical information on editors, authors and individuals represented in illustrations and advertisements in the journals.
As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches toward full-text generation. Computer-aided processing of image scans of historical periodicals is still challenging with the current state of technology, in particular, because processing standards for Latin-script newspapers do not apply to the Chinese context. It is only with new approaches in machine learning that it is now possible to transform material that was previously inaccessible just a few years ago. However, many challenges remain. Extremely complex layouts resulting in difficulties for reliable automatic detection of page segmentation have prevented full-text generation for these newspapers even within China.
The application of artificial intelligence requires a ground truth data set. This error-free, manually corrected text with structural information is used for evaluation and training of software models for text and layout recognition. In the fall of 2021, the project successfully implemented OCR on a newspaper 晶報 Jing bao (The Crystal) sample with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding recently received from the Research Council Cultural Dynamics in Globalized Worlds for the first half of 2022, the project is currently producing a new data set. The project’s aim is to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning.
The project’s current work will further develop its original aims and contribute to the field of research as a whole. With the disclosure of the project’s network models and data sets, its results can be reproduced and evaluated, and others can adopt its approaches in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes its work may serve as good practice examples for such initiatives.
The data set provides a first and complete extract of all metadata edited by the project so far. Future versions will also incorporate the fulltext produced in our OCR pipeline.