Better Education Through Improved Reinforcement Learning
Mandel, Travis Scott
MetadataShow full item record
When a new student comes to play an educational game, how can we determine what content to give them such that they learn as much as possible? When a frustrated customer calls in to a helpline, how can we determine what to say to best assist them? When an ill patient comes in to the clinic, how do we determine what tests to run and treatments to give to maximize their quality of life? These problems, though diverse, are all a seemingly natural choice for reinforcement learning, where an AI agent learns from experience how to make a sequence of decisions to maximize some reward signal. However, unlike many recent successes of reinforcement learning, in these settings the agent gains experience solely by interacting with humans (e.g. game players or patients). As a result, although the potential to directly impact human lives is much greater, intervening to collect new data is often expensive and potentially risky. Therefore, in this thesis I present several methods that allow us to evaluate candidate learning approaches offline using previously-collected data instead of actually deploying them. First, I present an unbiased evaluation methodology based on importance sampling that allows us to compare policies built on very different representations. I show how this approach enables us to improve student achievement by over 30\% on a challenging and important educational games problem with limited data but 4,500 features. Next, I examine the understudied problem of offline evaluation of algorithms that learn online. In the simplified case of bandits, I present a novel algorithm that is (often vastly) more efficient than the previously state-of-the-art approach. Next, for the first time I examine the more general reinforcement learning case, developing several new evaluation approaches, each with fairly strong theoretical guarantees. Using actual student data, we show that each method has different empirical tradeoffs and is useful in different settings. Further, I present new learning algorithms which ensure that, when we do choose to deploy algorithms to humans, the data we gather is maximally useful. I first examine the important real-world problem of delayed feedback in the bandit case. I present an exploration algorithm which is theoretically on par with the state-of-the-art but much more attractive empirically, as evaluated on real-world educational games data. I show how one can incorporate arbitrary heuristics to further improve reward without harming theoretical guarantees. Next I present Thompson Clustering for Reinforcement Learning (TCRL), a Bayesian clustering algorithm which addresses the key twin problems of exploration and generalization in a computationally-efficient and data-efficient manner. TCRL has gained traction in industry, being used by an educational startup to serve literacy content to students. Finally, I explore how reinforcement learning agents should best leverage human expertise to gradually extend the capabilities of the system, a topic which lies in the exciting area of Human-in-the-Loop AI. Specifically, I develop Expected Local Improvement (ELI), an intuitive algorithm which carefully directs human effort when creating new actions (e.g. new lines of dialogue). I show that this approach performs extremely well across a variety of simulated domains. I then conclude by launching a large-scale online reinforcement learning system, in which ELI is used to direct actual education experts to improve hint quality in an math word problems game. Our preliminary results, based on live student data, indicate that ELI shows good performance in this setting as well.