Natural Language as a Scaffold for Visual Recognition
MetadataShow full item record
A goal of artificial intelligence is to create a system that can perceive and understand the visual world through images. Central to this goal is defining what exactly should be recognized, both in structure and coverage. Numerous competencies have been proposed, ranging from low level tasks such as edge detection to high level tasks, such as semantic segmentation. In each case, a specific set of visual targets is considered (e.g. particular objects or activities to be recognized) and it can be difficult to define a comprehensive set of everything that could be present in the images. In contrast to these efforts, we consider taking a broader view of visual recognition. We propose to use natural language as a guide for what people can perceive about the world from images and what ultimately machines should emulate. We show it is possible to use unrestricted words and large natural language processing ontologies to define relatively complete sets of targets for visual recognition. We explore several core questions centered around this theme: (a) what kind of language can be used, (b) what it means to label everything and (c) can structure in language be used to define a recognition problem. We make progress in several directions, showing for example that highly ambiguous sentimental language can be used to formulate concrete targets and that linguistics feature norms can be used to densely annotate many complex aspects of images. Finally, this thesis introduces situation recognition, a novel representation of events in images that relies on two natural language processing resources to achieve scale and expressivity. The formalism combines WordNet, an ontology of nouns, with FrameNet, an ontology of verbs and implicit argument types, and is supported by a newly collected large scale image resource imSitu. Situation recognition significantly improves over existing formulations for activities in images, allowing for higher coverage, increased richness of the representation, and more accurate models. We also identify new challenges with our proposal, such as rarely observed target outputs, and develop methods for addressing them.