Deriving Orthographic Data from Classical Japanese Texts with Machine- Learning Methods

Chau, Herman; Zeng, Michael R.; Atkins, Paul S.

Deriving Orthographic Data from Classical Japanese Texts with Machine- Learning Methods

Files

jinmonkon paper.pdf (3.83 MB)

Date

2025-12-13

relationships.isAuthorOf

Chau, Herman

Zeng, Michael R.

Atkins, Paul S.

Abstract

This project applies advanced machine-learning techniques to extract orthographic data—specifically jibo 字母, the Chinese character matrices underlying cursive Japanese hiragana—from classical Japanese manuscripts. Inspired by the National Diet Library’s NDLkotenOCR and the Center for Open Data in the Humanities’ (CODH) KuroNet, our aim is to automate the generation of jibo data from manuscript images. This automation enables large-scale orthographic analysis and scribal attribution, which has traditionally required extensive manual effort. By integrating modern computer vision techniques, we seek to create a robust pipeline that identifies jibo to facilitate deeper linguistic and historical insights into classical Japanese texts.

Description

published in JINMONKON 2025, Proceedings of the annual conference of the Computers and Humanities Special Interest Group of the Information Processing Society of Japan (IPSJ), December, 2025.