Generalizable Object Tracking in Complex Real-World Scenes with Contextual Cues and Memory

Yang, Cheng-YenGeneralizable Object Tracking in Complex Real-World Scenes with Contextual Cues and MemoryMy University2026Deep LearningObject TrackingArtificial intelligenceComputer scienceElectrical and computer engineeringMy UniversityMy UniversityHwang, Jenq-Neng2026-04-202026-04-202026en-USThesisYang_washington_0250E_29235.pdfhttps://hdl.handle.net/1773/55498application/pdfCC BY-NCThesis (Ph.D.)--University of Washington, 2026Multiple-Object Tracking (MOT) serves as a cornerstone of computer vision, yet achieving robust data association remains a significant challenge in dynamic environments characterized by frequent occlusions. This dissertation investigates the strategic integration of spatial contextual cues and hierarchical memory to enhance tracking stability. To bridge the gap between camera and image space, we first analyze three distinct spatial perspectives: an extrinsic approach leveraging drone metadata for maritime scenarios, an intrinsic method utilizing self-calibration for multi-camera consistency, and a depth-aware modality to prioritize non-occluded objects in dense crowds. Building upon these spatial foundations, we leverage vision foundation models to introduce SAMURAI, a motion-aware zero-shot tracker, and SAMURAI++, a unified framework that reconciles tracking-by-detection and tracking-by-query paradigms. By maintaining dual-horizon memory—short-term and long-term—for each tracklet, this work achieves superior identity preservation and cross-domain generalizability without the need for task-specific fine-tuning. Collectively, these contributions demonstrate that the synergy of temporal memory and spatial context provides a robust trajectory toward generalizable object tracking in complex, real-world scenes.