Automatic Characterization of Text Difficulty
MetadataShow full item record
For the millions of U.S. adults who do not read well enough to complete day-to-day tasks, challenges arise in reading news articles or employment documents, researching health conditions, and a variety of other tasks. Adults may struggle to read because they are not native speakers of English, because of learning disabilities, or simply because they did not receive sufficient reading instruction as children. In classroom settings, struggling readers can be given hand-crafted texts to read. Manual simplification is time-consuming for a teacher or other adult, though, and is not available for adults who are not in a classroom environment. In this thesis, we present a fundamentally new approach to understanding text difficulty aimed at supporting automatic text simplification. This way of thinking about what it means for a text to be "hard" is useful both in deciding what should be simplified and in deciding whether a machine-generated simplification is a good one. We start by describing a new corpus of parallel manual simplifications, with which we are able to analyze how writers perform simplification. We look at which words are replaced during simplification, and at which sentences are split into multiple simplified sentences, shortened, or expanded. We find very low agreement with respect to how the simplification task should be completed, with writers finding a variety of ways to simplify the same content. This finding motivates a new, empirical approach to characterizing difficulty. Instead of looking at human simplifications to learn what is difficult, we look at human reading performance. We use an existing, large-scale collection of oral readings to explore acoustic features during oral reading. We then leverage measurements from a new eye tracking study, finding that all hypothesized acoustic measurements are correlated with one of two features from eye tracking that are known to be related to reading difficulty. We use these human performance measures to connect text readability assessment to individual literacy assessment methods based on oral reading. Finally, we develop several text difficulty measures based on large text corpora. Using comparable documents from English and Simple English Wikipedia, we identify words that are likely to be simplified. We use a character-based language model and features from Wiktionary definitions to predict word difficulty, and show that those measures correlate with observed word difficulty rankings. We also examine sentence difficulty, identifying lexical, syntactic, and topic-based features that are useful in predicting when a sentence should be split, shortened, or expanded during simplification. We compare those predictions to empirical sentence difficulty based on oral reading, finding that lexical and syntactic features are useful in predicting difficulty, while topic-based features are useful in deciding how to simplify a difficult sentence.
- Electrical engineering