Inferring the 3D information from the Outside World Using Monocular Cameras

Zhang, Haotian

Inferring the 3D information from the Outside World Using Monocular Cameras

Files

Zhang_washington_0250E_25024.pdf (27.28 MB)

Date

2023-01-21

relationships.isAuthorOf

Zhang, Haotian

Abstract

Technological advances have made autonomous driving more and more feasible in common driving scenarios. Many large companies such as Waymo, Tesla, GM, and Uber have tested their self-driving vehicles with success in limited capacities. These vehicles employ a combination of cameras, radar, sonar, and LiDAR sensors. Yet the high cost of LiDAR, as well as the unreliability of sonar and radar, makes them unsuitable for quick large-scale deployment. On the contrary, camera-based autonomous driving has the potential to be a cheap and reliable alternative through steadily advancing computer vision and deep learning techniques. A general autonomous driving system incorporates three correlated technologies: 3D-based object detection, tracking, and localization. While all three components are important, most relevant papers tend to only focus on one single component. In this work, we first propose a multi-stage monocular vision-based framework for 3D-based detection, tracking, and localization by effectively integrating all three tasks in a complementary manner. Our system contains an RCNN-based Localization Network (LOCNet), which works in concert with fitness evaluation score (FES) based single-frame optimization, to get more accurate and refined 3D vehicle localization. To better utilize the temporal information, we further use a multi-frame optimization technique, taking advantage of camera ego-motion and a 3D TrackletNet Tracker (3D TNT), to improve both accuracy and consistency in our 3D localization. Moreover, we propose a joint framework (JMV3D) that can effectively associates moving objects over time and estimate their 3D localization information as well as segmentation masks from a sequence of 2D images so as to compensate for the individual drawbacks of each component. We further extend the existing Localization Network (LOCNet) to become Localization for Tracking Network (Loc4Trk-Net). A spatial Attention (SA) Neck is added to highlight the foreground (target of interest) and suppress the background with the help of mask segmentation so that more concentrated appearance features can be obtained. Besides, one additional embedding head is introduced to train discriminative feature embeddings to leverage deep pairwise contrastive learning and identify objects in various poses and viewpoints with appearance cues. Then, a straightforward combination of a 3D Kalman filter and the Hungarian algorithm is further utilized for robust instance association via both feature similarity and 3D localization information. Overall, both systems outperform the state-of-the-art image-based solutions in diverse scenarios and is even comparable with LiDAR-based methods. The proposed JMV3D pipeline also ranks 1st place on the KITTI-MOTS & KITT-STEP leaderboards and also achieves impressive results among all image-based solutions on nuScenes 3D tracking benchmark. Furthermore, monocular 3D object detection requires decoding 3D predictions solely from a single 2D image. However, by formulating this problem as a region-level understanding task, previous approaches neglect the image-level understanding of depth and semantics. To address this, we present the monocular 3D object detection via coarse-to-fine training, a new transformer-based architecture with an effective two-stage training strategy that can seamlessly handle both levels of tasks: (i) coarse-grained training on the whole image based on monocular depth data; followed by (ii) fine-grained training on specific regions based on 3D bounding boxes annotations. Instead of having dedicated transformer layers for fusion after the uni-modal backbone, Mono3DCFT pushes multi-modal cross-attention fusion into both the vision and depth backbones and achieves significant gains on the KITTI benchmark coupled with two-stage training. Trained solely based on limited publicly available KITTI depth data, our Mono3DCFT performs comparably against the previous best state-of-the-art, which is pre-trained on 15M additional proprietary depth data along with a more compute-intensive architecture. Extensive ablation studies demonstrate the effectiveness of our approach and its potential to serve as a transformer baseline for future monocular 3D monocular object detection.