Robust and Efficient 6D Object Pose Estimation Using Color Information

Yang, Xingjian

Robust and Efficient 6D Object Pose Estimation Using Color Information

Date

2026-02-05

Authors

Yang, Xingjian

Abstract

6D object pose estimation, a fundamental computer vision task, has seen remarkable progress driven by deep learning advancements and growing demands across various applications. The challenge lies in determining objects' 3D orientation and position from 2D images while handling complications such as occlusions, cluttered environments, and varying lighting conditions. While traditional methods using geometric features and templates face limitations with texture-less objects and complex scenes, deep learning approaches have significantly improved accuracy and efficiency. These approaches include direct pose regression, keypoint detection, and hybrid methods, particularly addressing the synthetic-to-real domain gap. Current developments focus on end-to-end trainable frameworks capable of efficient multi-object handling and real-time operation from RGB inputs. The field continues to evolve with innovations in data augmentation, network architectures, self-supervised learning from synthetic data, and weakly supervised approaches using 2D annotations. A particular emphasis is placed on mobile computing platforms and real-time performance capabilities, as many applications require processing at the edge with limited network resources. First, we propose the Sparse Color-Code Net (SCCN). It represents an innovative three-stage pipeline designed specifically for real-time 6D object pose estimation, capable of handling both single and multiple objects efficiently. At its core, SCCN employs a unique color-code representation system that enables neural networks to effectively learn and memorize correspondences between object points and their associated colors. The framework's architecture consists of three key stages: First, it utilizes Sobel filters to extract sparse contours that capture essential surface details. These contours, along with the input image, are processed through a UNet architecture for object segmentation and bounding box detection. The second stage involves another UNet that performs pixel-level color-code regression on cropped object patches, establishing crucial 2D-3D correspondences while incorporating a novel symmetry mask to address object ambiguities. In the final stage, color-code pixels are strategically selected based on the contours and transformed into a 3D point cloud, with the PnP algorithm generating the definitive 6D pose estimate. SCCN's efficiency is particularly noteworthy, achieving impressive performance metrics of 19 FPS and 6 FPS on the LINEMOD and LINEMOD Occlusion datasets respectively when running on an NVIDIA Jetson AGX Xavier platform. This performance, combined with its ability to maintain high estimation accuracy and handle symmetric objects effectively, positions SCCN as a significant advancement in real-time 6D pose estimation technology. Then, we present the research: Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices. Specifically, our approach eliminates the need for training on labeled object pose data, requiring only the object's mesh file and texture information as reference. The pipeline begins with object detection handled by a YOLO neural network, followed by the Segment Anything Model (SAM) to obtain precise instance masks. Central to our method is a novel, lighting-invariant color-pair feature descriptor extracted from texture contours. Instead of learning classification, we employ a robust similarity metric based on triangular geometry in the CIELAB space to establish reliable correspondences between the scene and the rendered model. The final pose estimation utilizes a voting scheme based on semantic triangles retrieved from a pre-built hash database, followed by a composite ICP optimization that jointly refines the alignment of individual semantic parts and the global structure. Finally, we integrate a lightweight tracking module that leverages optical flow filtered by our color-pair metric. By training on procedurally generated random shapes using viewpoint-invariant features, the tracker captures general geometric motion, enabling efficient, continuous 6D pose tracking for arbitrary objects without object-specific training. Looking forward, we identify two potential research trajectories in the domain of 6D object pose estimation: (1) zero-shot 6D object pose tracking, which extends the pose estimation problem into the temporal domain while maintaining independence from target-specific training data and mesh reference, and (2) mesh-free few-shot 6D pose estimation methodology that circumvents the requirement for explicit 3D mesh models. Our future research will pursue one of these directions to further advance our work in 6D pose estimation.