Computer science and engineering

Permanent URI for this collectionhttps://digital.lib.washington.edu/handle/1773/4909

Browse

Recent Submissions

Now showing 1 - 20 of 569
  • Item type: Item ,
    Single Cell Methods to Learn Transcription Factor Interactions and Necessary Noncoding DNA During Zebrafish Somitogenesis
    (2026-04-20) Mullen, Andrew Carter; Trapnell, Cole
    The precise control of gene expression during embryonic development is orchestrated by complex networks of transcription factors (TFs) and their interactions with noncoding regulatory DNA elements. However, our understanding of how these elements function in vivo and how TFs cooperate to drive cell fate decisions remains incomplete and limited in scope. In this thesis, I will present a method sciPlex ATAC-seq, a multiplexed low cost method for single-cell assay for transposable accessible chromatin. Pairing this method with sciPlex RNA-seq, a similar method for transcriptomic measurement, allowed me to build an integrated single-cell atlas of zebrafish embryogenesis and organogenesis from unpaired data to map cis-regulatory elements during zebrafish development. This approach enables single-cell resolution mapping of accessible regulatory DNA in hundreds to thousands of individually indexed embryos, allowing the identification of enhancers and TF binding motifs across hundreds of cell types. I further integrate these data with deep learning models to infer TF-TF cooperativity and test mechanistic predictions using F0 CRISPR injected zebrafish embryos. Following up on novel interactions from our single-cell perturbation experiments, I developed and applied methods to identify TF perturbation responsive noncoding DNA elements and tested their sufficiency and necessity during zebrafish somitogenesis. This framework offers a low-cost and scalable approach to decode the regulatory logic of development and provides a blueprint for functionally annotating the noncoding genomes of multicellular models.
  • Item type: Item ,
    Distance and Symmetry: Two Pillars of a Good Code
    (2026-04-20) Sprumont, Oscar; Rao, Anup
    This thesis explores the benefits of distance and symmetry in the design of error correcting codes. Our main measure of distance will be the minimum distance of a code, i.e., the smallest number of coordinates any two codewords may differ in. Our main criterion for symmetry will be transitivity, i.e., the requirement that any two coordinates be interchangeable. We will also consider generalizations of these two notions, for instance double transitivity (the requirement that any two pairs of coordinates be interchangeable) and generalized distances (the minimum number of nontrivial coordinates in any subcode of a certain size). We argue that codes with large distances and high symmetry present desirable properties for communication on noisy channels. Concretely, we prove the following results: 1) Any linear code that achieves list decoding capacity and has superconstant minimum distance also achieves capacity on the symmetric channel. 2) For any linear code C with large enough generalized distances, the bit-decoding and block-decoding thresholds of C on the erasure channel are asymptotically equal. 3) Any transitive linear code $C\subseteq\F_q^N$ contains at most $q^{h_q(\alpha) \cdot\textnormal{dim }C }$ codewords of weight $\alpha N$. This upper bound is tight, as evidenced by repetition codes. 4) For every doubly transitive code C, there is a range of noises for which C achieves the information-theoretic optimal trade-off between rate and list decoding size. 5) The canonical example of linear codes with large distances and high symmetries is the family of Reed-Muller codes. We show that for appropriate choices of Reed-Muller codes $C_1$, $C_2$ and $C_3$, the tensor code $C_1\otimes C_2\otimes C_3$ achieves capacity on the symmetric channel with quasilinear decoding time.
  • Item type: Item ,
    Accelerating Collective Communication for Distributed Machine Learning
    (2026-04-20) Zhao, Liangyu; Krishnamurthy, Arvind
    Collective communication has emerged as a cornerstone of distributed machine learning, enabling datacenter-scale clusters of accelerators to collaboratively train or serve large language models. However, it has also become a significant performance bottleneck, impeding the efficient utilization and scalability of hardware resources. This dissertation focuses on optimizing collective communication for machine learning hardware and workloads, approaching the challenge from the perspectives of network topology, communication scheduling, and parallelization strategies. We first present our work on co-optimizing network topology and communication scheduling for direct-connect optical circuit networks. We propose expansion techniques and a linear programming-based schedule generation algorithm to synthesize efficient large-scale topologies and schedules, thereby forming a Pareto frontier of the latency-throughput trade-off. Our approach enables efficient collective communication on low-diameter topologies. Then, we introduce ForestColl, a schedule generation algorithm capable of producing throughput-optimal schedules for any network topology in polynomial time. ForestColl leverages prior graph-theoretical results to construct spanning trees for collective communications. It is the first work to achieve throughput optimality for collective communications while delivering orders-of-magnitude speedups in schedule generation compared to prior approaches. Finally, we outline our future work on automating the search for parallelization and optimization strategies in machine learning training. We propose a strategy grounded in the sharding and processing states of tensors within the compiled computation graph. By adopting a unified view of all tensor types, we propose a method that can discover optimal parallelization and optimization strategies through the determination of abstract tensor states.
  • Item type: Item ,
    Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape
    (2026-04-20) Nguyen, Thao; Oh, Sewoong; Schmidt, Ludwig
    Foundation models have transformed the machine learning landscape with unprecedented generalization capabilities across a variety of tasks. Central to their success is the data on which they are trained, which has grown massively in scale through large web crawls and data generation efforts. Despite growing awareness of the need for data curation, current data practices remain largely heuristic and coupled with specific model and training configurations, making it difficult to isolate data-centric contributions. In this thesis, I present my work towards developing systematic, generalizable, and timely methods to optimize dataset design for foundation models. In the first work, I provided one of the earliest empirical demonstrations that indiscriminately mixing different web data sources undermines model generalization, establishing data quality as a foundational principle for large-scale curation. As the field embraced data quality and proposed increasingly aggressive filtering pipelines, I found that these methods tend to overfit to existing benchmarks and systematically exclude valuable data, such as non-English content, which can improve model performance as a whole. My subsequent work thus argues that diversity in representation should be a deliberate design decision in the curation process, instead of existing only as a byproduct. Next, moving beyond filtering as the primary curation tool, I proposed image recaptioning as a way to transform low-quality image-text pairs into useful training data. Rather than asking what data to discard, my research instead asked what discarded data can be recovered. In the last work covered by this thesis, I extended this philosophy to the text domain. I addressed the growing scarcity of high-quality web texts by offering a sustainable approach to recycle discarded documents, effectively doubling the yield of useful pretraining tokens. Collectively, my research contributes to establishing data curation as a scientific discipline---one that is systematic, adaptive, and central to the future of foundation model development.
  • Item type: Item ,
    Reconstructing Visual Appearance and Process by Repurposing Pretrained Diffusion Models
    (2026-04-20) Chen, Bowei; Curless, Brian; Seitz, Steven M.
    Visual generation problems often arise in regimes where the available observations are incomplete or indirect. Inputs may capture only fragments of visual appearance, or omit the intermediate processes that produced the final result, yet models are expected to synthesize outputs that are visually complete, coherent, and plausible. This setting places strong demands on generative models, requiring them to infer missing information, integrate fragmented evidence, and reconstruct underlying structure or dynamics consistent with the observed outcome. This thesis investigates how pretrained diffusion models can be repurposed to address visual generation tasks arising under partial or indirect observation. I study a set of representative applications in which this tension manifests in different forms, spanning both appearance reconstruction and process reconstruction. These include synthesizing complete human appearance and motion from a small number of casually captured selfies, selectively editing portraits while preserving fine-grained identity features, and reconstructing plausible painting processes from a single finished artwork. Across these settings, the desired outputs are visually complete, while the inputs provide only sparse, incomplete, or indirect constraints. Although recent diffusion models learn powerful visual priors through large-scale pretraining, they are primarily designed for generic generation tasks such as text-to-image or text-to-video synthesis, and are not directly suited to these partial-observation scenarios. Mismatches in input structure, supervision, and data availability make naive fine-tuning or prompt-based adaptation ineffective. This thesis addresses the central question: how can pretrained diffusion models be repurposed to support novel visual generation tasks under partial observation? I explore repurposing strategies across different supervision regimes. In settings where paired training data is unavailable, I develop methods that exploit weakly aligned observations or synthetically constructed supervision to enable appearance reconstruction from fragmented inputs. In settings with limited but well-aligned paired data, I show that composing multiple pretrained models into cascaded pipelines can amplify scarce supervision and enable complex process reconstruction, such as inferring plausible sequences of painting actions consistent with a final artwork. Beyond pipeline-level design, the effectiveness of reconstructing visual appearance and process also depends critically on the quality of visual priors learned during diffusion pretraining. Stronger priors lead to more robust adaptation and higher generation quality across tasks. This observation motivates the final part of the thesis, which explores repurposing at a more fundamental level. I propose AlignTok, a framework that repurposes pretrained visual encoders as tokenizers for diffusion models, enabling diffusion to operate in a semantically rich latent space. This design simplifies training and improves efficiency, scalability, and generation quality, resulting in stronger visual priors that better support downstream reconstruction tasks.
  • Item type: Item ,
    Grounding Perception and Reasoning in Multimodal Models
    (2026-04-20) Park, Jae Sung; Farhadi, Ali; Choi, Yejin
    Current large multimodal models achieve impressive recognition capabilities that approach human-level performance. Their success is driven largely by massive web-scale pretraining, with continued improvements from scaling model and data size. However, this reliance on web data has not yet translated into fully reliable multimodal systems; these models remain prone to hallucinating content and failing to correctly ground their reasoning in images and videos. This disconnect arises because web data often fails to capture the implicit, grounded reasoning inherent to human cognition, which is rarely explicitly verbalized online. Consequently, this creates a data scarcity where purely scaling up the current pipeline may no longer yield proportional improvements. This thesis investigates methods to bridge this gap by grounding perception and reasoning through targeted data distillation from structured representations. First, to address the scarcity of grounded reasoning data, I introduce VisualComet, a large-scale dataset of human annotations designed to train models that predict dynamic events, past contexts, and human intents from images. Recognizing the scalability bottlenecks of manual annotation, I subsequently present Localized Symbolic Knowledge Distillation (LSKD), which leverages large language models to generate synthetic reasoning chains and utilizes a trained critic model to filter these outputs for quality and alignment with human judgment. Next, I investigate the reliability of the underlying perception systems by presenting a diagnostic framework using automatic contrast sets for video-language models. By systematically manipulating entities and verbs in video descriptions, this work reveals that state-of-the-art models frequently ignore visual signals in favor of language priors. Finally, to directly enhance these perception capabilities, I introduce Synthetic Visual Genome, a method for scaling up structured scene graphs defined by objects and their relationships. By distilling this structural knowledge into the training process, I show that this approach significantly improves grounded relationship understanding and reasoning.
  • Item type: Item ,
    Systematic Explorations for Data-Efficient LLM Training
    (2026-04-20) Li, Jeffrey; Schmidt, Ludwig; Ratner, Alexander
    Large language models (LLMs) have demonstrated remarkable performance across many downstream domains. When training these models, the first phase of pretraining involves learning to predict the next token across massive amounts of web-scale text data. While the ever-increasing scale of pretraining gives models a crucial foundation of knowledge and skills, it also comes with extremely high costs, constraining both achievable performance (within practical compute budgets) and amenability to scientific study. In this dissertation, we discuss our work along two key directions for more data-efficient LLM pretraining. First, we study data curation, tackling the problem of how to best process and filter Internet data into usable training datasets. Second, we consider the question of how to best continually update models on new data as the world evolves over time. For both directions, we propose novel benchmark setups to systematically explore different strategies and highlight key challenges, ultimately resulting in interventions that offer significant efficiency gains. Beyond pretraining, we also examine the labeled data bottleneck when fine-tuning models for specific tasks. We investigate the intersection of programmatic weak supervision and semi-supervised learning, clarifying when and how the latter can further improve the labeling efficiency of the former. Collectively, these works aim to contribute to the broader effort of making LLM training a more effective and scientifically rigorous practice.
  • Item type: Item ,
    IVF Singular Search: Agent-Based Implementation of Vector Search on GPU
    (2026-02-05) Rakhmatullaev, Akbarbek Azamatovich; Fukuda, Munehiro
    Vector search plays a crucial role in large-scale similarity search applications, with IVF (Inverted File Index) being a widely used indexing method due to its balance between accuracy and efficiency. However, traditional vector search algorithms that used IVF as an indexing method, such as IVF Flat and IVFPQ, yield results by brute force searching within each cluster/list. This paper introduces a new IVF-based vector search algorithm, called IVF Singular Search, which does the search within each cluster/list through a different arrangement of data and traversal using the Binary Search. In order to accelerate the development phase, the author used MASS CUDA to handle the searching part, which allowed to leverage the abstraction level of the code. We evaluated our IVF Singular Search, implemented for GPUs using MASS CUDA, against two other algorithms, IVF Flat and IVFPQ, demonstrating the significant speed efficiency of the approach. The findings suggest IVF Singular Search can make vector search more efficient and robust on infrastructure that requires immediate response, such as navigation systems or robots.
  • Item type: Item ,
    Retrofitting automated verification to systems code by scaling symbolic evaluation
    (2026-02-05) Nelson, Luke Robert; Wang, Xi
    Formal verification is a technique for eliminating classes of bugs insystems software by formally proving that a system's implementation meets its intended specification. While effective at systematically preventing hard-to-catch bugs, formal verification demands a significant effort from developers in the form of manual proofs. Automated verification techniques reduce the burden of verification by leveraging automated reasoning to avoid the need for manual proofs. But as a result, they sacrifice generality and require developers to build bespoke verification tools and to carefully design systems with automated verification in mind. This dissertation explores how to make it easier to build andreuse automated verifiers, and how to retrofit systems to automated verification. To do so, we built Serval, a framework for writing automated verifiers for systems code. To use Serval, developers write an interpreter for a language; Serval then leverages the Rosette programming language to lift the interpreter into a verifier via symbolic evaluation. Serval also comes with a set of techniques and optimizations to help overcome verification bottlenecks. We use Serval to develop automated verifiers for RISC-V, x86, Arm,LLVM IR, and BPF. We apply these verifiers to retrofit automated verification to two existing security monitors previously formally verified using other techniques: CertiKOS, an OS kernel with strict process isolation, and Komodo, a monitor that implements secure enclaves. We port these two systems to RISC-V, modifying their interfaces for automated verification and to improve security. We write specifications amenable to automation, and compare our efforts with that of the original systems. To demonstrate applicability to systems beyond security monitors, weuse Serval to build Jitterbug, a framework for writing and verifying just-in-time (JIT) compilers for the Berkeley Packet Filter (BPF) language in the Linux kernel. We develop a specification of compiler correctness suitable for these JITs. Using this approach, we found and fixed more than 30 new bugs in the JITs in the Linux kernel and developed a new, verified BPF JIT for 32-bit RISC-V.
  • Item type: Item ,
    Steps Towards the Pluralistic Alignment of Language Models
    (2026-02-05) Sorensen, Taylor John; Choi, Yejin
    AI alignment is concerned with ensuring that AI systems understand and adhere to human values and preferences. However, most prior alignment work makes a simplifying assumption that preferences are monolithic. In reality, human values and preferences can vary between and within individuals, groups, and societies. In this dissertation, I formalize and advance the study of \textit{pluralistic alignment}, or aligning AI systems with diverse human values, perspectives, and preferences. Specifically, I use large language models (LLMs) as a test-bed for pluralistic alignment.I first motivate the need for pluralism in alignment, outlining failure modes and risks of either assuming that value variation either doesn't exist or ignoring such variation. I propose a concrete framework for pluralistic alignment, including three definitions of how models and benchmarks can each be pluralistic. Based on this framework, I propose a roadmap with recommendations and directions for further empirical and methodological work in the area. This framework has been widely adopted by the community, and serves as an agenda for the remainder of the dissertation. Next, I focus on improving LLMs' ability to properly model and steer to varied human values. I introduce a large-scale dataset for value pluralism (\textsc{Value Prism}), and conduct a human study to understand whose values are represented. With this dataset, I train \textsc{Value Kaleidoscope}, a model for assessing the relevance of values to a particular situation and giving contextual judgments based on a value description. I find that the model is sensitive to situational changes and that it helps to explain human variation.I then propose an autoencoder-based approach for inferring the values that could have led to a particular individual's judgments (called \textit{value profiles}). I find that our value profile approach is able to preserve $>$70\% of the predictive information found in the rater demonstrations on which they are based, and offers benefits in terms of interpretability and steerability. Based on value profiles, I propose a novel rater clustering method for assigning individuals to a fixed number of clusters. I find that these clusters are far more predictive than demographic groupings of the same size, and that the clusters enable dataset-specific analysis of the dimensionality of rater variation. Generalizing beyond textual value descriptions, I focus on language model post-training for general tasks and abilities. I find that current instruction-tuning techniques reduce pluralism in many ways, harming LLMs' ability to steer to subjective judgments and diverse generation distributions, leading to mode collapse on queries with many valid answers, and reducing distributional alignment. Pretrained models are better at steering and matching distributions, but are less usable as a result of being poor at following instructions.To improve instruction-following while also improving pluralism, I compile a large-scale resource from $>$40 datasets in a unified format that require inferring and steering to diverse generation functions in-context (\textsc{Spectrum Suite}). With this data, I introduce \textsc{Spectrum Tuning}, a simple and scalable post-training method which improves instruction-following concurrently with several modes of pluralism, leading to more steerable models which also avoid mode collapse. Based on \textsc{Spectrum Tuning}, I further design a system for steering to individuals, which achieves state-of-the-art at individual subjective judgment modeling. To conclude, I survey related work in the community building on our pluralistic alignment framework and methodologies and outline directions for future work.
  • Item type: Item ,
    Towards Interpretable and Robust ML Systems
    (2026-02-05) Verma, Sahil; Bilmes, Jeffery JB; Shah, Chirag CS
    Recent advancements in ML have taken strides in enabling models to accomplish unprecedented tasks, starting from the bare minimum binary classification for loan applications to intrinsically complex self-driving. As the models have become better, faster, and more powerful, they have also become larger and more opaque. This has happened because of the widespread use of neural networks, which enable capturing and expressing incredibly complex representations but are uninterpretable to humans. This phenomenon raises the question of trust -- as humans who want to be in the position of control, how do we trust the model to make correct decisions? In this thesis, I aim to answer this question by making models more interpretable, examining their robustness, and ensuring they are safe for us as a society to rely on.
  • Item type: Item ,
    Improving Online Community Governance at Web Scale
    (2026-02-05) Weld, Galen Cassebeer; Althoff, Tim; Zhang, Amy X
    Nearly two out of every three people on the planet are members of an an online community, and this number is forecast to keep growing. These communities have an incredible diversity of topic, size, and structure, and they offer unique ways to connect their users and bring people together. Unfortunately, online communities have also been associated with significant offline harms, including the mental health crisis, abuse and harassment, interference with free and democratic elections, and radicalization and political polarization. Almost all online communities rely on some form of governance to set and enforce rules, role model good behavior, and generally lead the community. The forms that this governance takes varies widely from community to community. On some platforms, moderators' work is conducted in the background, while in many others, community leaders are volunteers who take a more visible role. Many communities' governance also relies on a range of complex technical tools. Some communities operate on a pseudodemocratic basis, with nominations and regular elections, while others operate on a consensus model, and still others are effectively autocracies. It is very difficult to know how best to govern an online community, given different community needs, the enormous range of available governance strategies, and the challenge of empirically measuring governance and outcomes. In this dissertation, I conduct research that makes online communities better through data-driven analyses of community values, moderation practices, and experiments with new tools. My work focuses on three important research activities: (1) I \emph{characterize} communities' values in community members' own words to build a foundational understanding of communities' needs and what `better' actually means. (2) I \emph{assess} existing moderation practices and community affordances such as voting at a massive scale across hundreds of thousands of communities in order to identify which practices are most promising. (3) I \emph{deploy} interventions and best practices in partnership with community leaders to maximize real world impact . Much of my research is conducted on Reddit, one of the largest platforms for online communities, and a platform where I am a longtime moderator of several subreddits, and a member of the Reddit Moderator Council. My dissertation makes several key contributions: My theoretical contributions include the first ever taxonomy of community values, based on the largest-to-date surveys of community members. My methodological contributions include a new method for scalably measuring community outcomes by quantifying how community members talk about their moderators, and a new method for classifying the rules enforced by communities. Finally, I make artifact contributions by publishing classifiers for discussions of moderators and rules, and datasets of anonymized survey results, community rules, and news sharing behavior.
  • Item type: Item ,
    Hybrid Static-Dynamic Feature-Weighted Analysis for IoT Botnet Malware Detection
    (2026-02-05) Lemak, Colleen; Thamilarsu, Geetha
    As the Internet of Things (IoT) domain continues to evolve, IoT devices face escalating security challenges. Recent waves of IoT botnets have exploited device vulnerabilities to launch dangerous large-scale Distributed Denial of Service (DDoS) attacks from compromised, resource-constrained devices. These networks of infected devices pose a unique threat to modern infrastructure, homes, schools, medical facilities, and transportation systems at heightened risk of malicious exploitation. This paper proposes a novel hybrid framework that combines static and dynamic analysis techniques for IoT botnet malware detection without relying on complex Machine Learning (ML) models. By extracting and weighing the importance of key features from malware binaries based on their relevance to DDoS behavior, the framework maintains statistical adaptability to observed data while avoiding large memory usage and opaque black-box decision processes. Designed for interpretability and efficiency, this malware detection framework bridges code-level structure and runtime behavior, offering a transparent and practical botnet detection strategy for diverse resource-constrained IoT ecosystems.
  • Item type: Item ,
    Benchmarking TenSEAL’s Homomorphic Encryption Through Predicting Encrypted RNA Sequencing Data
    (2026-02-05) Choi, Logan; Kim, Wooyoung
    This study addresses the growing need to protect sensitive healthcare data as digital technologies and cloud-based analytics become integral to modern medical research and care delivery. Healthcare data, such as clinical or genomic information, holds immense potential to enhance disease understanding and improve diagnostics through machine learning models; however, adopting third-party cloud technologies increases the risks of data breaches and noncompliance with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). To address these concerns, this research investigates homomorphic encryption, a cryptographic method that allows computations on encrypted data without exposing sensitive information. The study benchmarks the TenSEAL library to evaluate its performance in encrypting healthcare test datasets and executing predictions through a pre-trained machine learning model, while also evaluating memory utilization and encryption time. The findings show that TenSEAL’s CKKS encryption scheme effectively enables data encryption and secure machine learning inference on genomic datasets for breast, lung, and prostate cancers, achieving an average accuracy of 90% across all datasets. On the other hand, our results also highlight a key trade-off: as encryption strength and dataset size increase, computational overhead rises sharply. Thus, medical professionals and data scientists must carefully balance the need for security with the practical deployment in real-world healthcare systems.
  • Item type: Item ,
    Understanding Aging at Multi-scale
Using Explainable AI
    (2026-02-05) Qiu, Wei; Lee, Su-In
    As human lifespans increase, understanding the biological and clinical mechanisms that shape aging has become increasingly important. This dissertation presents a set of explainable AI (XAI) frameworks that illuminate aging at multiple scales, ranging from population-level health data to bulk transcriptomics and single-cell gene expression. I begin with IMPACT, an XAI framework for all-cause mortality prediction in NHANES dataset. IMPACT improves prediction accuracy over traditional models and uses XAI methods to reveal previously underappreciated risk factors and clinically meaningful feature interactions. Building on this foundation, ENABL Age extends the IMPACT framework to model biological age. ENABL Age combines machine learning with XAI to estimate biological age and to quantify how specific lifestyle, clinical, and biochemical factors contribute to accelerated or slowed aging. This framework provides individualized insights into modifiable components of aging and supports the development of interpretable precision aging tools. At the molecular scale, DeepProfile learns biologically meaningful latent representations from 50,211 cancer transcriptomes across 18 tumor types. It identifies universal immune activation signals, cancer-type specific subtype structure, and mechanistic links among mutation burden, cell-cycle activity, antigen presentation, and patient survival. By studying cancer across many organs, DeepProfile also offers insight into organ health and organ aging, illustrating how unsupervised learning can uncover clinically relevant biology from large transcriptomic datasets. Finally, ACE is an explainable deep generative model for single-cell RNA sequencing data that isolates aging-related gene expression changes from dominant background variation, enabling the study of cellular aging. Applied to mouse, fly, and human datasets, ACE recovers tissue and cell-type specific aging signatures, identifies conserved aging pathways across species, predicts biological age at cellular resolution, and prioritizes novel regulators such as Uba52, whose relevance is validated through lifespan-shortening RNAi experiments in C. elegans. Together, these contributions form an integrated XAI-driven framework for understanding aging at multi-scale and advance both mechanistic aging biology and transparent approaches for improving human healthspan.
  • Item type: Item ,
    Building Flexible Data Center Network Stacks for the Terabit Era
    (2026-02-05) Shashidhara, Rajath; Peter, Simon
    Modern data center workloads demand end-host network stacks that sustain terabit-scale bandwidth alongside ??-scale latency, overwhelming traditional software TCP stacks with high CPU overheads. ASIC-based transport offloads deliver high performance and energy efficiency but sacrifice flexibility, hindering customization to diverse application and deployment needs. This thesis explores flexible stateful TCP offload using emerging programmable in-network accelerators. It tackles the core challenge of mapping TCP’s complex, stateful processing onto the restrictive programming models of resource-constrained hardware, enabling fine-grained data-path parallelization. We present FlexTOE and Laminar, two novel TCP stack offloads built on Network Processing Unit (NPU) and Reconfigurable Match-Action Table (RMT) architectures. Both eliminate all host TCP data-path CPU overheads, integrate transparently with existing applications, remain robust under realistic network dynamics, and crucially, retain software programmability. The design principles developed generalize beyond TCP and extend naturally to other accelerator architectures. Through extensive evaluation, we demonstrate that these practical designs achieve a meaningful balance of high-performance, energy efficiency, and flexibility, surpassing state-of-the-art software stacks and offering a viable, adaptable alternative to rigid hardware transports.
  • Item type: Item ,
    Navigating the Ocean of Language Model Training Data
    (2026-02-05) Liu, Jiacheng; Choi, Yejin; Hajishirzi, Hannaneh
    One crucial step toward understanding large language models (LLMs) is to understand their training data. Modern LLMs are trained on text corpora with trillions of tokens, hindering them from being easily analyzed. In this thesis, I discuss my research on making these massive text corpora efficiently searchable and revealing insights to the connection between LLMs and their training data. First, I developed infini-gram, a search engine system that enables fast string counting and document retrieval. With infini-gram, I indexed four open text corpora commonly used for LLM pretraining, totaling 5 trillion tokens. A by-product was the biggest n-gram language model ever built as of the date of publication, which I combined with neural LLMs to greatly improve their perplexity. Next, on top of infini-gram, I led the development of a system for tracing LLM generations into their multi-trillion-token training data in real time, named OLMoTrace. OLMoTrace shows long verbatim matches between LLM outputs and the full training data, enabling us to do fact-checking, trace "creative expressions", understand LLM's math capabilities, and much more. Finally, to enable searching in even bigger, Internet-scale corpora with limited budget, more storage-efficient indexing techniques are needed. To that end, we developed infini-gram mini, a search system with 12x less storage requirement than the original infini-gram, conceptually allowing us to index the entirety of Common Crawl (the main source of training data for LLMs). We indexed 83TB of text, including the Common Crawl snapshots between January and July 2025, making it the largest body of searchable text in the open-source community. With infini-gram mini, we revealed that many crucial LLM evaluation benchmarks are heavily contaminated, and we are hosting a public bulletin to continuously monitor this dire evaluation crisis. Together, my research enables everyone to inspect and understand LLM training data at scale, and paves way towards comprehending and debugging LLM behaviors from a data perspective.
  • Item type: Item ,
    Delivering Predictable Tail Latency in Data Center Networks
    (2026-02-05) Zhao, Kevin; Anderson, Thomas E
    Modern web services decompose a user request into thousands of RPCs whose slowest 1% dominate end-to-end latency, costing revenue and straining user patience. Operators codify expectations as tail latency SLOs, but meeting them is difficult even in well-run data center networks. Although such networks expose configuration parameters that have a large impact on tail latency, like switch weights, congestion windows, and switch marking thresholds, operators typically set these parameters once and rarely revisit them. When workload characteristics shift, for example in burstiness, traffic mix, or demand patterns, the resulting mismatch between the workload and the network can degrade user-observed performance and cause SLO violations, even in networks that deploy congestion control, traffic engineering, and class-based scheduling. A natural response is to adapt network parameters when workloads change, but existing methods adjust parameters by trial and error, risking intermediate violations and slow convergence in high-dimensional, noisy settings. This dissertation argues that prediction-guided control is an effective technique for delivering predictable tail latency in data center networks. It makes two contributions. First, Parsimon is a scalable tail-latency estimator. Through a series of approximations, Parsimon decouples links and simulates them in parallel, allowing it to run orders of magnitude faster than full-fidelity simulators while retaining distribution-level accuracy. Second, Polyphony embeds such estimators in a closed loop control system to improve network performance. It treats predictions as priors, fuses them with live measurements, and searches safely inside a trust region that resets as conditions drift. In a small testbed on real machines, Polyphony meets tail latency SLOs within minutes, whereas a state-of-the-art model-free tuner fails to converge after an hour. Together, fast prediction and prediction-guided control form a promising toolkit for steering large networks toward better performance for latency-sensitive applications, reducing the cost of provisioning and the risk of unsafe exploration.
  • Item type: Item ,
    Facilitating FPGA Prototyping with Hardware OS Primitives
    (2026-02-05) Lim, Katherine; Anderson, Thomas; Kasikci, Baris
    Both data center operators and the research community have embraced hardware accelerators,because of their potential for significant improvements in performance and energy efficiency. There have now been several large-scale deployments of accelerators in datacenters from com- panies such as Google, Facebook, and Microsoft. FPGAs have become a compelling acceleration platform, because their reconfigurability allows them to be repurposed as the application mix changes. Both Microsoft and Amazon have deployed FPGAs throughout their datacenter to both rent to consumers as well as accelerate their own services. Microsoft in particular attaches the FPGAs it uses to accelerate its own workloads directly to the network. Directly attaching the FPGA to the network further reduces latency, improves cost-performance, and reduces energy use relative to mediating network communications with CPUs. However, building accelerated applications or services for direct-attached FPGAs is challenging, especially with the complex I/O and multi-accelerator capacity of modern FPGAs. This thesis argues that direct-attached accelerator systems can be built in a modular manner that preserves the benefits of a direct-attached accelerator while also reducing the engineering burden. We first describe a design and prototype for Apiary, a microkernel operating system for direct-attached FPGA accelerators based on messaging passing over a network on chip (NoC) architecture. The key idea in Apiary is to raise the level of abstraction for accelerated application code, with isolation, threaded execution, and interprocess communication provided by a portable hardware OS layer in order to ease development difficulties. We propose specific hardware OS primitives to provide these services and abstractions. We then conduct an end-to-end case study of Apiary by prototyping a selection of these primitives to evaluate how well they serve Apiary’s design goals. We then describe Beehive, a hardware network stack we designed and prototyped for Apiary based around message passing over a NoC. We show that our architecture is better able to support the complexity of a software datacenter network stack by providing replication of elements and applications and standard TCP and UDP interoperation. At the same time, direct- attached accelerators using Beehive can achieve 4x improvement in end-to-end RPC tail latency for Linux UDP clients versus a CPU-attached accelerator.
  • Item type: Item ,
    Generative Keyframing
    (2026-02-05) Wang, Xiaojuan; Seitz, Steven M.; Curless, Brian
    Keyframing is a fundamental element of animation creation and video editing. It involves defining specific frames, i.e., keyframes, that mark important moments of change and guide how the intermediate frames are filled or interpolated. In early hand-drawn animation, a keyframe is a visual drawing created by animators, with assistants manually drawing the in-between frames. With the advent of digital animation and video editing software, a keyframe became a set of parameters that define the state of the rendered character or object at specific times, with in-between transitions produced by interpolating these parameters. However, such parametric approaches rely heavily on manually designed controls and artist-crafted heuristics, making them difficult to capture complex, nuanced, and realistic motions. Furthermore, they do not naturally generalize to real image and video domains. The rapid progress of visual generative models that are trained on large collections of visual data and capable of learning rich appearance and motion patterns, has made it possible to generate high-fidelity imagery and realistic motion. Building on these advances, this thesis investigates generative keyframing, a data-driven, non-parametric, image-based approach to the keyframing process. To this end, I present a series of works in this thesis that collectively develop and explore this idea. I begin with the basic aspect: using generative models to synthesize transitions directly from images, and even to fully generate in-between motions. I first present a GAN-based technique for smoothing jump cuts in talking head videos, synthesizing seamless transitions between the cuts even in challenging cases involving large head movement. I then introduce a method for generating in-between videos with dynamic motion between more distant key frames by adapting a pretrained large-scale image-to-video diffusion model with minimal fine-tuning effort. Beyond automatically generating transitions between keyframes, I further explore multi-scale keyframing for achieving very deep zoom. Specifically, I introduce a multi-scale joint sampling diffusion approach for generating consistent images (keyframes) across different spatial scales while adhering to their respective input text prompts. This enables deep semantic zoom and a continuous zoom video can be rendered from these images. When working with multiple keyframes, one import question is how they should be ordered in the final video. I address this in the context of dance video generation---specifically, music synchronized and choreography-aware animal dance video---where unordered keyframes representing distinct animal poses are arranged via graph optimization to satisfy a specified choreography pattern of beats that defines the long-range structure of a dance. Finally, I conclude with discussions and directions for future works.