Socially Responsible and Factual Reasoning for Equitable AI Systems
Loading...
Date
Authors
Gabriel, Saadia
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Through natural language communication, writers have enormous persuasive power over readers. This can have broad-reaching positive societal impact like in the case of social movements (e.g. the Black Lives Matter movement and protests against anti-Asian hate), however there are severe negative ramifications when communication is used with malintent (e.g. to directly inflict harm through hate speech or mislead). The ability to read between the lines of what is explicitly stated and adapt to dynamic social contexts is critical to detecting false or harmful text. However, existing deep learning approaches still have limited generalization and commonsense reasoning capabilities. To expand machine reasoning capabilities, we propose theoretical formalisms to measure intent, factuality and social bias of language. We first introduce reaction frames, which allow us to distill knowledge of cognitive and physical effects on readers like implied actions (e.g. given the false statement ``Water boiled with garlic cures coronavirus,'' we can infer that the writer is compelling an audience to ``drink garlic water''). We find that while neural misinformation detection classifiers are highly capable of distinguishing between truthful and false content, these models are challenged by commonsense implications derived using our neuro-symbolic approach. We discuss how a major bottleneck comes from the inability of neural models to correctly interpret meaning, particularly when this pertains to plausibility of claims. We conduct a meta-evaluation to test efficacy of factuality metrics, and expose that the evaluation used for generation is ill-suited to benchmarking progress in learning factuality. This study pinpoints specific failure cases of metrics and underlying models, outlining future directions for factuality evaluation. Finally we show how, despite their limitations, large pretrained language models like GPT-3 can be used to mitigate dataset bias in existing hate speech corpora. We use adversarial generation approaches to better align classifiers with human interpretation of toxicity and mitigate potentially harmful vulnerabilities in classifiers. As future work, we discuss the need for a proactive, community-driven approach to reduce online harms.
Description
Thesis (Ph.D.)--University of Washington, 2023
