Novel Data Summaries for Join Query Optimization
| dc.contributor.advisor | Suciu, Dan | |
| dc.contributor.author | Cai, Walter Zhen | |
| dc.date.accessioned | 2021-10-29T16:20:02Z | |
| dc.date.available | 2021-10-29T16:20:02Z | |
| dc.date.issued | 2021-10-29 | |
| dc.date.submitted | 2021 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2021 | |
| dc.description.abstract | As the demand for data intensive pipelines has grown and the diversity of settings has expanded, generalizability of database management systems has suffered. The execution of join queries, especially multi-join queries, remains one of the field's greatest challenges. We propose novel approaches to such queries using data sketching. Traditional cost-based query optimizers have existed for decades, becoming the de facto method for designing performant analytical systems. Nevertheless, these systems are still hampered by the cost estimation stage. In particular, modern systems fall back on strong assumptions about the underlying data when confronted with multijoin queries. In lieu of chasing perfect estimates over multi table queries, we propose the application of theoretically guaranteed cardinality upper bounds. These have the benefit that they force the optimizer to act conservatively and deliver fewer high risk plans to the executor. We demonstrate that the use of bounds leads to fewer disastrous plans than traditional cost estimation techniques but is still on par with `easy' queries where traditional query optimization techniques already perform well. We also preview how this technique may be generalized to large scale distributed data scenarios. Streaming query optimization introduces fresh challenges on top of post hoc analytic pipelines. While queries are often semantically simpler, the introduction of temporal semantics, required immediacy of output results, and less reliable hardware puts a strain on the execution layer. In particular, the natural method of combining separate data streams -a temporal join- places the onus of adapting to changing stream characteristics on an integrated optimizer-executor. We propose a novel state management algorithm applicable to the threshold-function-over-joins-scenario; a common setting in streaming data management. We demonstrate significant state savings and prove that our method is optimal while still guaranteeing no false positives; no threshold function triggers will be lost. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Cai_washington_0250E_23500.pdf | |
| dc.identifier.uri | http://hdl.handle.net/1773/47991 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY | |
| dc.subject | bounding | |
| dc.subject | cardinality estimation | |
| dc.subject | entropy | |
| dc.subject | join | |
| dc.subject | query optimization | |
| dc.subject | stream | |
| dc.subject | Computer science | |
| dc.subject.other | Computer science and engineering | |
| dc.title | Novel Data Summaries for Join Query Optimization | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Cai_washington_0250E_23500.pdf
- Size:
- 1.92 MB
- Format:
- Adobe Portable Document Format
