Building Blocks for Data-Driven Theories of Language Understanding

Michael, Julian

Building Blocks for Data-Driven Theories of Language Understanding

Files

Michael_washington_0250E_25907.pdf (2.03 MB)

Date

2023-08-14

Authors

Michael, Julian

Abstract

I propose a paradigm for scientific progress in natural language processing, centered around the development of data-driven theories of language understanding. The central idea is to collect data in tightly scoped, carefully defined ways which allow for exhaustive annotation of a behavioral phenomenon of interest. With such data, we can use machine learning to construct explanatory theories of these phenomena which can be used as building blocks for intelligible AI systems. After laying some conceptual groundwork for the idea, I describe a series of investigations into the development of data and theory for representations of shallow semantic structure in natural language — in particular, using Question-Answer driven Semantic Role Labeling (QA-SRL), a simple schema for annotating verbal predicate-argument structure using highly constrained question-answer pairs. While this just scratches the surface of the complex language behaviors of interest in AI, I outline principles for data collection and theoretical modeling which can inform future scientific progress.