Named Entity Resolution for Historical Texts
MetadataShow full item record
The field of digital humanities has spurred an increase in applications of computational lin- guistics to historical documents, but the field remains underdeveloped. Standard natural language processing (NLP) techniques developed using contemporary texts tend to perform poorly when applied to historical documents due to challenges such as spelling variation, semantic shifts, and lack of standard orthography. In this thesis, we compare performance of common Named Entity Recognition (NER) libraries including Stanford CoreNLP, spaCy, and Flair on historical texts. We also present a method for named entity resolution designed specifically for historical texts, which combines domain adapted word embeddings with pho- netic and lexical similarities. This has the potential to increase the speed of digitization of historical documents and improve search capabilities across historical corpora. The algorithm is one of the first trained on historical documents and improves upon common approaches to spelling normalization for historical documents using only lexical and/or phonetic similarity. Additionally, we provide a user interface so that scholars without programming expertise can easily use the tools developed in this thesis. Future work will include linking historical named entities to contemporary references and constructing knowledge graphs for historical corpora.
- Linguistics