Novel Data Summaries for Join Query Optimization

dc.contributor.advisorSuciu, Dan
dc.contributor.authorCai, Walter Zhen
dc.date.accessioned2021-10-29T16:20:02Z
dc.date.available2021-10-29T16:20:02Z
dc.date.issued2021-10-29
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractAs the demand for data intensive pipelines has grown and the diversity of settings has expanded, generalizability of database management systems has suffered. The execution of join queries, especially multi-join queries, remains one of the field's greatest challenges. We propose novel approaches to such queries using data sketching. Traditional cost-based query optimizers have existed for decades, becoming the de facto method for designing performant analytical systems. Nevertheless, these systems are still hampered by the cost estimation stage. In particular, modern systems fall back on strong assumptions about the underlying data when confronted with multijoin queries. In lieu of chasing perfect estimates over multi table queries, we propose the application of theoretically guaranteed cardinality upper bounds. These have the benefit that they force the optimizer to act conservatively and deliver fewer high risk plans to the executor. We demonstrate that the use of bounds leads to fewer disastrous plans than traditional cost estimation techniques but is still on par with `easy' queries where traditional query optimization techniques already perform well. We also preview how this technique may be generalized to large scale distributed data scenarios. Streaming query optimization introduces fresh challenges on top of post hoc analytic pipelines. While queries are often semantically simpler, the introduction of temporal semantics, required immediacy of output results, and less reliable hardware puts a strain on the execution layer. In particular, the natural method of combining separate data streams -a temporal join- places the onus of adapting to changing stream characteristics on an integrated optimizer-executor. We propose a novel state management algorithm applicable to the threshold-function-over-joins-scenario; a common setting in streaming data management. We demonstrate significant state savings and prove that our method is optimal while still guaranteeing no false positives; no threshold function triggers will be lost.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherCai_washington_0250E_23500.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47991
dc.language.isoen_US
dc.rightsCC BY
dc.subjectbounding
dc.subjectcardinality estimation
dc.subjectentropy
dc.subjectjoin
dc.subjectquery optimization
dc.subjectstream
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleNovel Data Summaries for Join Query Optimization
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Cai_washington_0250E_23500.pdf
Size:
1.92 MB
Format:
Adobe Portable Document Format