Self-Improving Crowdsourcing: Near-Effortless Design of Adaptive Distributed Work
MetadataShow full item record
Systems that coordinate work by many contributors have enormous potential to solve many of the most pressing problems facing society today. In particular, crowdsourcing systems enable data collection at scale, critical to accelerating scientific discovery and supporting a host of new machine learning applications. However, ensuring these systems are cost-effective and produce high-quality data remains a key challenge, and one of central importance to downstream applications. Creating an efficient, successful crowdsourcing task requires significant time investment and iteration to optimize the entire task pipeline: recruiting workers who have the requisite knowledge, defining and communicating the task requirements, training and testing workers on those requirements, routing tasks to workers in a skill-aware manner, and prioritizing tasks to avoid wasted effort. These design costs underlie nearly every reported crowdsourcing success, yet are seldom acknowledged and therefore often underestimated. This initial investment makes crowdsourcing impractical for all but the largest tasks; in many cases, it may actually be less costly for the task designer ("requester") simply to perform the task herself. The central thesis of this dissertation is: high-quality, efficient crowdsourcing tasks can be created at low cost through self-improvement meta-workflows combining algorithms, workers, and minimal requester involvement. Toward this end, I present methods for automating or semi-automating the design of many stages of the task pipeline, thus reducing the burden of the task designer: • Chapter 3 presents algorithms for efficiently recruiting workers with the requisite knowledge based on their digital footprints, reducing the number of recruiting requests the requester needs to issue. • Chapter 4 provides algorithms for managing recruited workers by optimizing the amount of training or testing they receive. These algorithms outperform common ad-hoc requester policies and require no tuning by the requester. • Chapter 5 presents algorithms for routing tasks to all trained workers in parallel in a skill-aware manner. These methods also outperform baselines policies and do not require hand-tuning. • Chapter 6 provides algorithms for prioritizing tasks to reduce wasted effort on multi-label classification tasks, a common type of task. These methods use less than 10% of the labor of previously-used requester policies. • Chapter 7 presents a tool that helps the task designer rapidly specify and improve the task design and instructions, by enabling prioritized navigation of the dataset (through worker-surfaced ambiguous categories of questions) and semi-automated workflow creation (through suggested questions for training and testing workers). This work fully closes the loop, with algorithms, the requester, and workers all contributing to task self-improvement.