Socially Responsible and Factual Reasoning for Equitable AI Systems

Loading...
Thumbnail Image

Authors

Gabriel, Saadia

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Through natural language communication, writers have enormous persuasive power over readers. This can have broad-reaching positive societal impact like in the case of social movements (e.g. the Black Lives Matter movement and protests against anti-Asian hate), however there are severe negative ramifications when communication is used with malintent (e.g. to directly inflict harm through hate speech or mislead). The ability to read between the lines of what is explicitly stated and adapt to dynamic social contexts is critical to detecting false or harmful text. However, existing deep learning approaches still have limited generalization and commonsense reasoning capabilities. To expand machine reasoning capabilities, we propose theoretical formalisms to measure intent, factuality and social bias of language. We first introduce reaction frames, which allow us to distill knowledge of cognitive and physical effects on readers like implied actions (e.g. given the false statement ``Water boiled with garlic cures coronavirus,'' we can infer that the writer is compelling an audience to ``drink garlic water''). We find that while neural misinformation detection classifiers are highly capable of distinguishing between truthful and false content, these models are challenged by commonsense implications derived using our neuro-symbolic approach. We discuss how a major bottleneck comes from the inability of neural models to correctly interpret meaning, particularly when this pertains to plausibility of claims. We conduct a meta-evaluation to test efficacy of factuality metrics, and expose that the evaluation used for generation is ill-suited to benchmarking progress in learning factuality. This study pinpoints specific failure cases of metrics and underlying models, outlining future directions for factuality evaluation. Finally we show how, despite their limitations, large pretrained language models like GPT-3 can be used to mitigate dataset bias in existing hate speech corpora. We use adversarial generation approaches to better align classifiers with human interpretation of toxicity and mitigate potentially harmful vulnerabilities in classifiers. As future work, we discuss the need for a proactive, community-driven approach to reduce online harms.

Description

Thesis (Ph.D.)--University of Washington, 2023

Citation

DOI