Ground truth data for a an OCR model. Will be continually updated.
Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadeśa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetālapañcaviṃśati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800).
The training was done on 441 pages and validation on 242 pages.
This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode.
Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetālapañcaviṃśati (HS. Or. 6414) and Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhū Purāṇa." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Guṇakāraṇḍavyūhasūtram," New Delhi: International Academy of Indian Culture, 1999.
Transkribus, 1.20.0–1.20.1