Skip to content

Latest commit

 

History

History
240 lines (144 loc) · 22.8 KB

README.md

File metadata and controls

240 lines (144 loc) · 22.8 KB

Blog Article for Analysis of Center-based 3D Object Detection and Tracking

Blog article for Center-based 3D Object Detection and Tracking [1] paper written by Halil Eralp Kocas for the Seminar: Computer Vision and Deep Learning for Autonomous Driving in Summer Semester 2021. The presented paper is published in CVPR2021. The structure of this article will be the introduction of the task and related work that would help understand the main method. Then, the proposed model, CenterPoint, will be explained. After CenterPoint, experiments will be discussed before the conclusion and discussion of the paper.

Introduction

Object detection is one of the fundamental tasks of both 2D and 3D Computer Vision. In basic terms, object detection is the task of localizing and identifying objects present in given data. It has a wide range of applications such as pedestrian detection, video surveillance, and autonomous driving. Although object detection models are studied widely on images or image sequences, point clouds are another data type for an object detection system. The following images [2][3] are examples of object detection in the 2D image domain and 3D point cloud domain:

In addition to object detection, object tracking also works in both 2D and 3D. It is the task of forming temporal associations between the same objects across the input data sequences. In recent algorithms, the tracking-by-detection approach is applied widely. In tracking-by-detection, an object detection model is utilized to detect objects in given data. Then, a tracker creates a unique ID for each newly seen object. Also, the tracker tries to associate all detections in the new frame with previously created tracks before assigning any new ID. The following figure [4] is an example of tracking:

The inputs are RGB images. From the upper image to the bottom, time is moving forward. The black boxes over people indicate that these people are detected. Then, the output is colorized according to the tracking results.

In this blog, the main tasks of the presented paper, CenterPoint, are 3D Object Detection and 3D Object Tracking. CenterPoint is applied on point clouds. Also, if the task is tracking, it utilizes a tracking-by-detection approach with a minor modification on its model. Next, we will review related work to understand the motivation for developing CenterPoint. We will discuss the problems of object detection on point clouds as well as the challenges of box-based object detection.

Related Work

Box-Based Object Detection

Before moving into the center-based approaches, it is beneficial to talk about box-based detectors so that we can better understand why we need center-based approaches. 2D object detection models predict axis-aligned bounding boxes from image inputs. Two-stage object detectors in 2D, mainly RCNN families such as Faster R-CNN [2], find category-agnostic bounding box candidates, then classify and refine them. One-stage object detectors such as YOLO, directly find category-specific box candidates. Alignment of bounding boxes with the image reference frame in 2D sufficiently represents objects in the image domain. However, this is challenging in 3D because we do not simply fit aligned bounding boxes with the image reference frame. However, we try to obtain 3D bounding boxes around objects. For that, we need a third dimension which we lost in images, object lengths.

3D Object Detection models predict three-dimensional rotated bounding boxes. They mainly differ from 2D models in the input encoder. Many 3D models are extensions of 2D detection methods to 3D. Recent works in two-stage 3D object detection directly apply RCNN style 2D detectors. The detection pipeline is similar to the 2D object detection in which features are obtained from images or point clouds utilizing backbone networks.

There are some challenges in 3D Object Detection for box-based object detectors on point clouds. Point clouds are sparse. Also, many 3D regions in point clouds are without measurements. Additionally, objects in 3D have a wide range of sizes, shapes, aspect ratios, and no particular orientation. In addition, fitting an axis-aligned bounding box to the rotated objects in a 3D scene is challenging. Finally, they are slow due to the matching between anchors and samples generated by the detection network. Authors claim that center-based object representation is a better fit for 3D object detection given the challenges of box-based approaches.

Center-Based Object Detection

Center-based approaches represent objects as points at their bounding box center instead of boxes. The main approach is to detect object centers using keypoint detectors on top of extracted features from a backbone and regress other attributes from this keypoint descriptor. In opposition to box-based approaches, center-based models increase efficiency and provide end-to-end training since the need for post-processing is eliminated. The following example can help to understand how center-based approaches detect in given data:

The earlier center-based approaches are shown to be effective on images for both detection and tracking tasks. CenterNet [5] utilizes an encoder-decoder backbone to extract a keypoint heatmap from input data. Then, a keypoint estimator is used to detect the center of an object and other geometric properties such as size, location, and orientation. This keypoint estimator predicts centers that are local maximum points of their 8 neighbors in the extracted heatmap. CenterTrack [6] is a tracking-by-detection method. It uses CenterNet as its detection module. It uses a prior and current frame for tracking. As it is shown in the following figure, CenterTrack predicts an offset of objects. This offset vector is a displacement vector where the model predicts from the object's center from the current frame to the object's center in the previous frame. It follows a greedy matching according to the closest-distance for tracking that is based on the distance between centers of objects that are detected in the previous frame and offset vectors of objects in the current frame.

We will first talk about PointNet [7] for dealing with point clouds. PointNet directly learns pointwise features from point clouds. To learn, it utilizes max-pooling as a symmetric function to aggregate information from all points. PointNet is a simple neural network consisting of multilayer perceptrons and max pooling. There is no drawing for its architecture in its paper but the figure on the right in the following figures is a visualization of PointNet taken from VoxelNet paper [8] that we are going to talk about next. Its drawback is the data obtained using LIDAR contains 100k points and training in this scale results in high computational and memory requirements. To solve the scaling issue VoxelNet [8] is proposed. It is an end-to-end, unified 3D detection method. It divides input data into equally spaced 3D voxels, then it transforms points in these voxels through its stacked voxel feature encoding layer. This encoding layer utilizes the PointNet for each voxel. Then, the middle layer is 3D convolutions to consolidate the vertical axis before passing through a region proposal network. Then, a region proposal network is utilized on top of the middle layers to detect objects. The architecture of VoxelNet is the left figure in the following figures:

The drawback of VoxelNet is the 3D convolutions in middle layers are computationally bottleneck. To solve the computational bottleneck due to 3D convolution layers, PointPillars [9] is proposed. PointPillars is an end-to-end learning pipeline with only 2D convolutional layers for 3D object detection. It encodes the input point cloud to a proper format for the detection pipeline. These encodings can be used with any 2D convolution detection model. It discretizes the point cloud into an evenly-spaced grid in the x-y plane. Then, a simplified PointNet is used and after that, features are scattered back to the original pillar locations to create a pseudo-image. 2D CNN backbone and Detection Head are used to detect objects over this pseudo-image. The architecture of PointPillars can be seen in the following figure:

The proposed method, CenterPoint, is built on top of these mentioned center-based approaches and we will talk about it next.

CenterPoint

Authors of CenterPoint claim that the main challenge for domain adaptation between 2D and 3D is the representation of objects. They propose that point representation of objects is much simpler than bounding box representation as we talked about earlier. This representation has several advantages according to them:

  1. Since points have no intrinsic orientation, it reduces the search space for an object while the detector is still robust.
  2. It simplifies the process for tracking since forming temporal relations between points is easier based on the predicted displacement between frames.
  3. Lastly, the center-based approach is much faster than the bounding box approaches since there is no matching between prior anchors and predicted locations.

CenterPoint is a two-stage 3D object detection model that performs on point clouds. In summary, CenterPoint utilizes a keypoint detector to find centers of objects and regress their properties from extracted feature sets using a 3D backbone network. Also, the second stage is proposed to refine estimated objects. The proposed method consists of a backbone 3D feature encoder followed by two stages for detection as it is shown in the following figure:

A 3D backbone network based on VoxelNet or PointPillars extracts features for the detection. The reason they used two different backbones, namely VoxelNet and PointPillars, is to show their network is compatible with any 3D point cloud encoder. This backbone builds a representation of the input point cloud and converts it to a map-view feature map for the center head. PointPillar’s Pillar Feature Net and VoxelNet’s Feature Learning Network modules are used for the backbone. In other words, their backbones are used to extract features from input data.

The first stage consists of two steps. In the first step, CenterNet is utilized as a center heatmap head. It is utilized to produce a heatmap peak at the center locations of detections. Heatmap peaks, in other terms local maximums, are pixels whose values are greater than its neighbor 8 pixels in the heatmap. This heatmap has channels as many as the number of classes. In the second step, regression heads are used for all detected centers in the first step. It regresses several object properties such as a sub-voxel refinement to reduce quantization error from voxelization and striding of backbone, height-above-ground to help localize objects in 3D and to add missing elevation information removed by map-view projection, and cosine and sine of yaw angle to find orientation. Combined with box size, it provides the full state information for the 3D bounding box. Regression is on a logarithmic scale to better handle various shapes. For training of this stage two diffent loss functions are utilized. In the first step, training aims a 2D Gaussian which is the projection of 3D centers of ground-truth bounding boxes into the map-view. Then, a focal loss is used for optimization. In the second step, L1 regression loss is used for optimization with the ground-truth centers only.

The second stage is used to refine object locations and confidence scores. This stage predicts a class-agnostic confidence score and box refinement on top of predictions from the first stage. Additional point features are extracted to refine the predictions in the first stage. It extracts one point feature from the center and each face of the predicted 3D object. Thus, it extracts 5 point-features as it is also visualized around the boxes in the latest part of the first stage on CenterPoint architecture above. For each of these points, it also extracts a feature from map-view. These points and map-view features are concatenated and are passed through an MLP to predict finalized bounding box and confidence score. For box regression, the second stage predicts on top of the first stage predictions. The box regression in the second stage is trained with L1 Regression Loss. For the confidence score, the score target I is aimed at using 3D Intersection over Union with corresponding ground-truth box. Then, supervision is provided with Binary Cross-Entropy Loss with the predicted confidence score. The score target I is calculated as follows:

where IoU_t is the 3D IoU between t-th proposal box and corresponding ground-truth box.

where I with the hat is the predicted confidence score.

For tracking, CenterPoint learns to predict two-dimensional velocity for each detected object in the stage of regression heads. If the task is tracking, velocity is regressed as an additional output from the first stage. Since it predicts the difference in object position, it requires two input map-views from the current and previous time-step. It follows the same approach as in CenterTrack to track objects. For training, the L1 Regression Loss is used with the ground-truth position of objects in the current time-step.

During the training of CenterPoint, all heatmap and regression losses are combined in one training objective, all of them are optimized together. An example output from the CenterPoint is following:

Experiments and Ablation Studies

The following two datasets are used for experiments:

  1. Waymo Open Dataset [10] (798 training, 202 validation sequences, Lidar sensor, point clouds)
  2. nuScenes [3] (700 training, 150 validation, 150 test sequences, Lidar sensor with 20 FPS frequency, point clouds)

The following metrics are used in experiments:

  1. Mean Average Precision (mAP) is the average value of precision values on precision-recall curve where recall is [0, 0.1. 0.2, ..., 1.].
  2. Multiple Object Tracking Accuracy (MOTA) [11]:

where FN is false negatives, FP is false positives, GT is ground-truth, IDS is the number of identity switches at time t.

  1. Multiple Object Tracking Precision (MOTP) [11]:

where c is the number of matches in time t and d is the bounding box overlap of target i with the assigned ground-truth object.

  1. Average Multiple Object Tracking Accuracy (AMOTA) [12] and Average Multiple Object Tracking Precision (AMOTP) [12] are averages of MOTA and MOTP that are used in evaluation of nuScenes dataset.
  2. nuScenes Detection Score (NDS) [3] consolidates the different error types to a single scalar. It is calculated using the following formula:

where mAP is mean average precision and mTP is the set of five mean True Positive metrics which are location, size, orientation, attributes, and velocity of detections.

  1. Planning KL-Divergence (PKL) [13] is utilized to compute the impact of 3D detections based on KL Divergence of a planner and ground-truth trajectory. Planner's trajectory is computed using detections.

CenterPoint is evaluated on the Waymo Open Dataset and nuScenes dataset. Both results are obtained using a single CenterPoint-Voxel model. The following two tables are the results of evaluations for 3D detection. Level 1 and Level 2 in the left table correspond to the boxes with more than 5 Lidar points and at least one Lidar point respectively. CenterPoint outperforms all other methods in both datasets and reaches state-of-the-art in 3D Object detection.

Waymo Test Set & nuScenes Test Set

The authors performed additional experiments to show where the improvements come from in CenterPoint. The following two tables are to aimed to show the effects of rotated objects and effects of differently-sized objects in Level 2 Waymo validation data. Results show the strength of CenterPoint in both heavily rotated objects and objects with deviated from average size.

Waymo Validation Set

The consistent improvement significantly comes from the small categories such as traffic cones and extreme aspect ratio categories such as bicycle and construction vehicles as the following tables show.

nuScenes Test Set

The authors also experimented with the tracking ability of CenterPoint in these datasets. As the following tables show CenterPoint also reaches state-of-the-art in 3D tracking.

Waymo Test Set & nuScenes Test Set

The following table shows that CenterPoint performs better than last year’s challenge winner in both Kalman filter-based approach and the CenterTrack-based approach. In addition, since it does not require a separate module for tracking, it runs in negligible time. You can see the track times on the table which shows how fast their tracking is.

nuScenes Validation Set

I think their experiments are strong and well-designed to show where their model performs better and where do the improvements come from. Next, we will conclude the article and discuss the CenterPoint.

Conclusion and Discussion

In summary, CenterPoint is a two-stage 3D object detection model that is built on previous center-based approaches to detect objects on point clouds. It utilizes VoxelNet's and PointPillars' feature encoding backbones as its 3D backbone. By this, they showed that CenterPoint can be applied over any 3D point cloud encoder. Fully-convolutional CenterNet and regression heads are utilized to extract heatmaps with predicted centers and regress bounding boxes with categories in the first stage. As an optional performance boost step, the second stage is utilized with convolutional layers to refine the predictions from the first stage.

I think that the motivation and intuition of this work are explained well. CenterPoint is not such complicated architecture. It follows a classic two-stage object detection pipeline where a backbone process the input data and then some convolutional layers process extracted features to obtain bounding boxes and categories. The experiments and ablation studies are well-designed so that we can learn which features of the model contribute how much and in which cases the model performs better. It is robust to rotation and size invariance as shown in experiments. However, there is a case where the CenterPoint may fail. Since it only uses center feature of objects, these center features may not be sufficient for an accurate object localization in some cases. The authors reduced the effects of this weakness with the second stage of CenterPoint but even though there may some cases the model fails.

For tracking quality, I can say that they did not bring a novelty to the field. They just applied what is done in CenterTrack to their model in the 3D domain. We see that it reaches state-of-the-art performance, hence, it works well but I think the reason is the quality of detections. It works since their detection module works well and we know that it also reaches state-of-the-art. The tracking algorithm is a simple greedy distance-based algorithm, and also, its inference time is so fast.

References

[1] Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. arXiv:2006.11275 (2020)

[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, 2015.

[3] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. CVPR, 2020.

[4] Guillem Brasó and Laura Leal-Taixé. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6247–6257, 2020.

[5] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv:1904.07850, 2019

[6] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. ECCV, 2020.

[7] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017.

[8] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. CVPR, 2018.

[9] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. CVPR, 2019.

[10] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: An open dataset benchmark. CVPR, 2020.

[11] Anton Milan, Laura Leal-Taixe, Ian D. Reid, Stefan Roth, and Konrad Schindler. MOT16: A benchmark for multiobject tracking. CoRR, abs/1603.00831, 2016.

[12] "A Baseline for 3D Multi-Object Tracking", X. Weng and K. Kitani, In arXiv 2019.

[13] J. Philion, A. Kar, and S. Fidler, “Learning to evaluate perception models using planner-centric metrics,” 2020.