Managing Premium Data
MetadataShow full item record
Data is transforming science, business, and governance by making decisions increasingly data-driven and by enabling data-driven applications. The data used in these contexts usually has significant economic or social value. Frequently, data is purchased from a provider where the price is linked to how the data will be used and the allowed usage is typically detailed in a license agreement. Data processing, too, is moving to public clouds where users must pay for access to cloud resources, which are frequently shared by multiple users, especially when users analyze a common dataset. Current solutions to manage the economic value of data (prices and licenses) rely on expensive support from economists, auditors and lawyers, thus, reducing the net value of data. Similarly, how to price shared cloud resources is poorly understood and when pricing ignores the shared nature of use, the cloud resources are significantly underutilized and users cannot realize the full value of their data. In this thesis, we develop novel, principled and usable tools to manage data licenses and the pricing issues for data and cloud-based data processing. We first present DataLawyer, a system to specify and enforce data use policies on re- lational databases. It includes an SQL-based formalism to precisely define policies, and novel algorithms, to automatically and efficiently evaluate the policies. Experiments on a real dataset from the health-care domain demonstrate overhead reductions of up to 330× compared to a direct implementation of such a system on existing databases. Next, we present a new approach for selecting and pricing shared optimizations on the cloud by using Mechanism Design. We develop new mechanisms, where users bid for opti- mizations, to select and price additive and substitutive optimizations, and for the general setting where the users and their bids can change over time. We show analytically that our mechanisms incentivize truthful bidding and ensure that the cloud never loses money. We show experimentally that our mechanisms yield higher utility than the state-of-the-art approach based on regret accumulation. Lastly, we present improvements to data APIs. APIs are a common way to buy data. But users can significantly overpay when they makes multiple API calls and end up purchasing the same data item more than once. We provide a novel, lightweight and fast method to support pricing where a buyer is only charged once for each purchased tuple, even with multiple API calls. To enable this, we present a pricing framework where buyers can refund repeat purchases of data. We provide the protocols for refunds and develop optimizations to reduce the overhead of exercising refunds. Experiments show that data costs are significantly reduced (10x to 99x) for comparatively modest increases (2x to 5x) in query runtimes.