Labeling and Automatically Identifying Basic-Level Categories

Mills, Chad

Labeling and Automatically Identifying Basic-Level Categories

Files

Mills_washington_0250E_19253.pdf (2.82 MB)

labels.txt (703.55 KB)

system_output.txt (2.24 MB)

Date

2018-11-28

relationships.isAuthorOf

Mills, Chad

Abstract

Basic-level categories are the primary categories humans use to think and communicate; they are the first categories learned, with numerous psychological advantages including quick exemplar recognition time. They are valuable in a range of applications such as assessing text readability. Using WordNet, we create the first broad, representative dataset to build and evaluate systems to identify basic-level categories. We show there is significant label bias in the limited labels available in the psychology literature, and we add one novel label value since we find some chains in a hypernym/hyponym hierarchy do not include basic-level categories. We expand the number of labels available by a factor of 72, from 152 to 11,221. We build a heuristic baseline system to detect basic-level categories, showing systems evaluated on the previously-available data can look twice as effective as they perform on a more broadly-representative dataset. We take advantage of the increased quantity of labeled data to build a classifier-based system that improves performance to an f-measure of 0.607 from 0.381 for the heuristic-based system. We demonstrate basic-level categories may be useful in a range of applications. For measuring text readability, we show lower reading levels have proportionally more basic-level categories and our comparison of the reading levels of Wikipedia and Simple Wikipedia using basic-level categories alone aligns well with existing research in the area. We also show that image captions tend to be much more likely to include basic-level categories than normal text, further suggesting that basic-level categories may be a useful signal in language grounding applications.