Accelerating and enabling discovery in the decade of astronomical surveys

Bektesevic, Dino

Accelerating and enabling discovery in the decade of astronomical surveys

Files

Bektesevic_washington_0250E_28301.pdf (14.16 MB)

Date

2025-08-01

Authors

Bektesevic, Dino

Abstract

With the advent of a new generation of astronomical surveys such as the Legacy Survey of Space and Time (LSST) from the Rubin Observatory astronomers will have access to a wealth of data. If we are to fully exploit the data these surveys will generate, we will need to develop novel algorithmic approaches for analyzing astronomical images and tools to scale these algorithms to petabytes of data. In this thesis I focus on these two aspects scaling novel algorithms to extract more science from Rubin data than was previously possible, and developing a novel approach to the analysis and classification of lightcurves. The challenges in scaling to petabytes of data are multi-faceted. The Vera C. Rubin Science Pipelines are a collection of algorithms and a workflow management functionality intended to be used to process data taken by the Rubin Observatory's Legacy Survey of Space (LSST). In the first chapter of this thesis I describe how we implemented an Amazon Web Services and Google Computing Services compliant cloud service backend for the Rubin Middleware components that enable executing the Rubin Science Pipelines on cloud resources. I demonstrate how for short-term projects with a large in-going dataset and a small out-going data volume of results the cloud is almost always cost effective. Analysis results may be retrieved much sooner by allocating more resources, than they would when allocating less compute resources, at the same cost. In chapter 2 I demonstrate how new algorithms can be used to improve on the amount of science delivered by processing Rubin-like data on Rubin-like scales. Kernel Based Moving Object Detection package (KBMOD) is a tool developed to perform searches for moving objects on collections of images using a shift-and-stack method along linear trajectories. \citet{Smotherman2024} demonstrated it can detect objects below the SNR of a single exposure. In this chapter I discuss work required to improve the performance of KBMOD and execute it on all of DEEP's data (~200TB, equivalent to 10 nights of Rubin data), while simultaneously being able to increase the range of searched angles by a 100\% and the range of searched velocities by an additional 38\% compared to the previous search. Compared to Smotherman et al (2024), we achieve a 10\% higher peak detection efficiency but a 0.3 magnitude lower limiting magnitude at which 50\% of the objects were recovered. We identify the higher filtering threshold values, chosen due to a large number of estimated returned results, as the key culprit for the loss of limiting magnitude. In the final chapter of the thesis I develop a new approach to time series classification. By applying ideas inherited from differential geometry on Riemannian manifolds I demonstrate how it's possible to construct a measure of distance between two curves based on nothing more than their shape. I consider two distance measures Square Root Velocity and varifold fidelity measures. The latter is robust different light curve parameterizations. Multiple classification schemes are constructed based on these distances including agglomerative (hierachical clustering), fitting a sum of Gaussians, and a K-Means like algorithm to find the generalized means (Frechet means). Classification accuracy for a high SNR dataset was 96.58\%, 95.9\% and 98.93\% for the sums of Gaussians, agglomerative clustering and K-Means approach respectively. Validation of the approach on the PLAsTICC lightcurves dataset was less successful, achieving a top classification accuracy of 77.03\%. Two key reasons for the drop in classification accuracy are identified: heteroskedastic uncertainties in the data, and reparameterizations of curves.