Visual SLAM (VSLAM) has been developing rapidly due to its advantages of low-cost sensors, the easy fusion of other sensors, and richer environmental information. Traditional visionbased SLAM research has made many achievements, but it may fail to achieve wished results in challenging environments. Deep learning has promoted the development of computer vision, and the combination of deep learning and SLAM has attracted more and more attention. Semantic information, as high-level environmental information, can enable robots to better understand the surrounding environment.
Open source SLAM systems : MonoSLAM, PTAM, ORB-SLAM Series, LSD-SLAM, SVO, RTAB-MAP ; IMU-Integrated VSLAM, Semantic SLAM.
SLAM algorithm name (Sensors∗) : | ||||||
---|---|---|---|---|---|---|
MonoSLAM (M) |
PTAM (M) |
DTAM (R) |
ORB-SLAM (M/S/R) |
LSD-SLAM (M) |
RTAB-MAP (S/R) |
SVO (M) |
RGBD-SLAM-V2 (R) |
ROVIO (M+IMU) |
DVO (R) |
DSO (M) |
OKVIS (M/S+IMU) |
VINS (M+IMU) |
GMapping (L) |
Elastic fusion (R) |
Hector SLAM (L) |
∗M=Monocular, S=Stereo, R=RGB-D, L=Lidar
MonoSLAM | PTAM |
---|---|
J. Davison’s monocular SLAM [ paper and paper 2 ] was proposed in 2007 as the first real-time monocular visual SLAM system. MonoSLAM uses the extended Kalman filter as the backend to track very sparse feature points shi-Tomasi corner point for feature point matching on the frontend. Since EKF occupies a prominent dominant position in early SLAM, MonoSLAM is also based on EKF, using the current state of the camera and all landmark points as the state quantity, and updating the mean and covariance of the states. The monocular camera tracks very sparse feature points in an image using active tracking technology. In EKF, each feature’s position obeys the Gaussian distribution, so we can express its mean value and uncertainty as an ellipsoid. In the right half of the picture, we can find some small ellipsoids distributed in 3D space. If the ellipsoid looks long in a specific direction, the landmark corresponding to it is more uncertain in that direction. We can imagine that if a feature point converges, we should see it change from a very long ellipsoid (initially very uncertain in the Z axis of the camera system) to a small point. This approach seems to have many drawbacks today. Still, it was already a milestone work at that time because most of the previous visual SLAM systems could not run online. |
In 2007, Klein’s team proposed PTAM (Parallel Tracking and Mapping) [ paper ], which is also an important event in the development of visual SLAM. The important significance of PTAM lies in the following two points:
However, from a modern perspective, PTAM can be regarded as one of the early SLAM work combined with AR. |
ORB-SLAM Series | LSD-SLAM |
ORB-SLAM [ paper ] is a very famous and it was proposed in 2015 and is one of the most complete and easy-to-use systems in modern SLAM systems (if not the most complete and easy-to-use). . ORBSLAM is a peak of the mainstream feature-based open-source SLAM. Compared with previous work, ORB-SLAM has the following obvious advantages:
The advantages mentioned above make ORB-SLAM reach the state-of-the-art open-source visual SLAM system. Of course, ORB-SLAM also has some shortcomings. Since the entire SLAM system uses feature points for calculation, we must detect the ORB feature for each image, which is very time-consuming. |
LSD-SLAM (Large Scale Direct monocular SLAM) is a SLAM work [ paper, paper 2 ] proposed by J. Engel et al. in 2014. Analogous to ORB-SLAM to feature points, LSDSLAM marks the direct method’s successful application in monocular SLAM. The core contribution of LSD-SLAM is to apply the direct method to semi-dense monocular SLAM. It does not need to calculate the features but can also construct a semidense map. The semi-dense means estimating the pixel position with an obvious gradient. Its main advantages are as follows:
Since LSD-SLAM uses the direct tracking method, it has both the direct method’s advantages and disadvantages. For example, LSD-SLAM is very sensitive to camera intrinsics and exposure time and is easily lost when the camera moves quickly. Also, in the loop detection part, LSD-SLAM still relies on the feature point method for loop detection and has not entirely eliminated the calculation of feature points. |
SVO | RTAB-MAP |
SVO is the abbreviation of Semi-direct Visual Odometry [paper]. It is visual odometry based on the sparse direct method proposed by Forster et al. in 2014. The meaning of semi-direct in the original text refers to the mixed-use of feature points and direct methods: SVO tracks some key points (corner points, no descriptors), and then, like the direct method, uses the information of these key points to estimate camera movement and the map point’s position. SVO uses small blocks of 4 × 4 around the key points for block matching to estimate the camera’s motion. Compared with other programs, SVO’s most significant advantage is high-speed. Due to the use of the sparse direct method, it does not have to laboriously calculate the descriptor, nor does it need to process as much information as dense and semi-dense approaches. So, it can achieve real-time performance even on low-level computing platforms. Another SVO innovation is that it puts forward the concept of depth filter and derives a depth filter based on uniform-Gaussian mixture distribution. |
RTAB-MAP (Real-Time Appearance-Based Mapping) [paper] is a classic scheme in RGB-D SLAM. It implements everything that should be included in RGB-D SLAM: feature-based visual odometry, bag-of-words-based loop detection, backend pose map optimization, and point cloud and triangular mesh maps. Therefore, RTABMAP provides a complete (but somewhat huge) RGB-D SLAM solution. At present, we can obtain the binary program directly from ROS |
VSLAM has the advantage of richer environmental information and is considered to be able to give mobile robots stronger perceptual ability and be applied in some specific scenarios.
In Section 1, this note introduces the characteristics of traditional VSLAM in detail, including the direct method and the indirect method based on the front-end vision odometer, and makes a comparison between the depth camera-based VSLAM and the classical VSLAM integrated with IMU. In Section 2, this note is divided into two parts. We firstly go through the combination of deep learning and VSLAM from two neural networks, CNN and RNN. So this note summarizes the development direction of semantic VSLAM from three aspects of localization, mapping, and elimination of dynamic objects.
VSLAM can be divided into Traditional VSLAM ( Monocular / Stero vSLAM, RGB-D vSLAM { vSLAM based on feature based method and vSLAM based on direct method } , Visual Inertial SLAM { Loosely Coupled Visual Inertial and Tightly Coupled Visual Inertial } ) and Semantic VSLAM.
The sensors used in the VSLAM typically include the monocular camera, stereo camera, and RGB-D camera. The monocular camera and the stereo camera have similar principles and can be used in a wide range of indoor and outdoor environments. As a special form of camera, the RGB-D camera can directly obtain image depth mainly by actively emitting infrared structured light or calculating time-of-flight (TOF). It is convenient to use, but sensitive to light, and can only be used indoors in most cases. Events camera as appeared in recent years, a new camera sensor, a picture of a different from the traditional camera. Events camera is “events”, can be as simple as “pixel brightness change”. The change of events camera output is pixel brightness, SLAM algorithm based on the event camera is still only in the preliminary study stage. TE (Absolute Trajectory Error) and RPE (Relative Pose Error) are the two most important indicators used to evaluate the accuracy of SLAM. The relative pose error is used to calculate the difference of pose changes in the same two-time stamps, which is suitable for estimating system drift. The absolute trajectory error directly calculates the difference between the real value of the camera pose and the estimated value of the SLAM system.
ATE: The absolute trajectory error is the direct difference between the estimated pose and the real pose, which can directly reflect the accuracy of the algorithm and the global trajectory consistency. It should be noted that the estimated pose and ground truth are usually not in the same coordinate system, so we need to pair them first: For stereo SLAM and RGB-D SLAM, the scale is uniform, so we need to calculate a transformation matrix from the estimated pose to the real pose by the least square method S ∈ SE (3). For monocular cameras with scale uncertainties, we need to calculate a similar transformation matrix S∈ Sim (3) from the estimated pose to the real pose.
VSLAM can be divided into the direct method and indirect method according to the different image information collected by the front-end visual odometer. The indirect method needs to select a certain number of representative points from the collected images, called key points, and detect and match them in the following images to gain the camera pose. It not only saves the key information of the image but also reduces the amount of calculation, so it is widely used. The direct method uses all the information of the image without preprocessing and directly operates on pixel intensity, which has higher robustness in an environment with sparse texture.
The core of indirect VSLAM is to detect, extract and match geometric features( points, lines, or planes), estimate camera pose, and build an environment map while retaining important information, it can effectively reduce calculation.
Feature extraction mostly adopted corner extraction methods in the early, such as Harris [paper], FAST [paper], GFTT [paper], etc. However, in many scenarios, simple corners cannot provide reliable features, which prompts researchers to seek more stable local image features. Nowadays, typical VSLAM methods based on point features firstly use feature detection algorithms, such as SIFT [paper], SURF [paper], and ORB [paper], to extract key points in the image for matching. Then gain pose after minimizing reprojection error. Feature points and corresponding descriptors in the image are employed for data association.
Comparison table of commonly used feature extraction algorithms:
Method | Year | Type | Speed | Rotation Invariance | Scale Invariance | Illumination Invariance | Anti Invariance |
---|---|---|---|---|---|---|---|
ORB [paper] |
2011 | Point | High | Yes | Yes | Yes | Stronger |
SURF [paper] |
2008 | Point | Middle | Yes | Yes | No | Weak |
FAST [paper] |
2006 | Point | High | No | Yes | No | Weak |
SIFT [paper] |
2004 | Point | Low | Yes | Yes | Yes | Strong |
Shi Tomasi / GFTT [paper] |
1994 | Corner | Middle | Yes | No | Yes | Week |
Harris [paper] |
1988 | Corner | Low | Yes | No | Yes | Weak |
LSD [paper] |
2010 | Line | Middle | Yes | Yes | Yes | Stronger |
The commonly used line feature extraction algorithm is LSD.
SLAM Algorithms : MonoSLAM [paper] in 2007, PTAM [paper] in 2007, LineSLAM [paper] in 2014, ORB-SLAM [paper] in 2015, ORB SLAM 2 [paper] in 2017, monocular PL SLAM [paper] in 2017, stereo PL SLAM [paper] in 2017, ORB SLAM 3 [paper] in 2021.
The ORB-SLAM family is one of the most widely used visual SLAM solutions due to its real-time CPU performance and robustness. However, the ORB-SLAM series relies heavily on environmental features, so it may be difficult to obtain enough feature points in an environment without texture features
Different from feature-based methods, the direct method operates directly on pixel intensity and can retain all information about the image. Furthermore, the direct method cancels the process of feature extraction and matching, so the computational efficiency is better than the indirect method. Furthermore, it has good adaptability to the environment with complex textures.
SLAM Algorithms : DTAM [paper] in 2011, SVO [paper] in 2014 (semi-direct), DSO [paper] in 2018,
SVO combines the advantages of the feature point method and direct method. The algorithm is divided into two main threads: motion estimation and mapping. Motion estimation is carried out by feature point matching, but mapping is carried out by the direct method.
DSO can calculate accurate camera attitude in poor feature point detector performance, improving the robustness of low-texture areas or blurred images. In addition, the DSO uses both geometric and photometric camera calibration results for high accuracy estimation. However, DSO only considers local geometric consistency, so it inevitably produces cumulative errors. Furthermore, it is not a complete SLAM because it does not include loop closure, map reuse, etc.
An RGB-D camera can simultaneously collect environmental color images and depth images, and directly gain depth maps mainly by actively emitting infrared structured light or calculating time-of-flight (TOF).
Some SLAM algorithms for sensors with an RGB-D camera :
Method | Year | Camera Tracking | Loop Closure | Code Resource |
---|---|---|---|---|
KinectFusion [paper] |
2011 | Diret | No |
[code] |
Kinitinuous [paper] |
2012 | Direct | Yes |
[code] |
RGB-D SLAMv2 [paper] |
2013 | Indirect | Yes |
[code] |
ElasticFusion [paper] |
2016 | Direct | Yes |
[code] |
DVO-SLAM [paper] |
2017 | Direct | Yes |
[code] |
BundleFusion [paper] |
2017 | Hybrid | Yes |
[code] |
RGBDTAM [paper] |
2017 | Direct | Yes |
[code] |
KinectFusion
is the first real-time 3D reconstruction system based on an RGB-D camera. It uses a point cloud created by the depth to estimate the camera pose through ICP (Iterative Closest Point). Then splices multi-frame point cloud collection based on the camera pose, and expresses reconstruction result by the TSDF (Truncated signed distance Function) model. The 3D model can be constructed in real-time with GPU acceleration. However, the system has not been optimized by loop closure. Furthermore, there will be obvious errors in long-term operation, and the RGB information of the RGB-D camera has not been fully utilized.
In contrast, ElasticFusion
makes full use of the color and depth information of the RGB-D camera. It estimates the camera pose by the color consistency of RGB and estimates the camera pose by ICP. Then improves the estimation accuracy of the camera pose by constantly optimizing and reconstructing the map. Finally, the surfel model was used for map representation, but it could only be reconstructed in a small indoor scene. Kinitinuous
adds loop closure based on KinectFusion and makes non-rigid body transformation for 3d rigid body reconstruction by using a deformation graph for the first time. So it makes the results of two-loop closure reconstruction overlap, achieving good results in an indoor environment.
Although the RGB-D camera is more convenient to use, the RGB-D camera is extremely sensitive to light. Furthermore, there are many problems with narrow, noisy, and small horizons, so most of the situation is only used in the room.
The pure visual SLAM algorithm has achieved many achievements. However, it is still difficult to solve the effects of image blur caused by fast camera movement and poor illumination by using only the camera as a single sensor. IMU is considered to be one of the most complementary sensors to the camera. It can obtain accurate estimation at high frequency in a short time, and reduce the impact of dynamic objects on the camera. In addition, the camera data can effectively correct the cumulative drift of IMU. At the same time, due to the miniaturization and cost reduction of cameras and IMU, visual-inertial fusion has also achieved rapid development.
Nowadays, visual-inertial fusion can be divided into loosely coupled and tightly coupled according to whether image feature information is added to the state vector. Loosely coupled means the IMU and the camera estimate their motion, respectively, and then fuse their pose estimation. Tightly coupled refers to the combination of the state of IMU and the state of the camera to jointly construct the equation of motion and observation, and then perform state estimation.
The core of the tightly coupled is to combine the states of the vision sensor and IMU through an optimized filter. It needs the image features to be added to the feature vector to jointly construct the motion equation and observation equation.
Then perform state estimation to obtain the pose information. Tightly coupled needs full use of visual and inertial measurement information, which is complicated in method implementation but can achieve higher pose estimation accuracy. Therefore, it is also the mainstream method, and many breakthroughs have been made in this area.
This table holds information on best visual-based SLAM algorithms:
Method | Sensor | Frontend | Backend | Loop Closure | Mapping | Code Resources | |
---|---|---|---|---|---|---|---|
MonoSLAM [paper] |
M | P | F | No | Sparse |
[code] |
|
PTAM [paper] |
M | P | O | No | Sparse |
[code] |
|
ORB-SLAM2 [paper] |
M/S/R | P | O | Yes | Sparse |
[code] |
|
PL-SVO [paper] |
M | PL | O | No | Sparse |
[code] |
|
Visual |
PL-SLAM [paper] |
M/S | PL | O | Yes | Sparse |
[code] |
DTAM [paper] |
M | D | O | No | Dense |
[code] |
|
SVO [paper] |
M | H | O | No | Sparse |
[code] |
|
LSD-SLAM [paper] |
M/S | D | O | Yes | Semi Dense |
[code] |
|
DSO [paper] |
M | D | O | No | Sparse |
[code] |
|
Method | Sensor | Frontend | Backend | Loop Closure | Mapping | Code Resources | |
MSCKF [paper] |
M + I | T | F | No | Sparse |
[code] |
|
Visual-inertial |
OKVIS [paper] |
S + I | T | O | No | Sparse |
[code] |
ROVIO [paper] |
M + I | T | F | No | Sparse |
[code] |
|
VINS-Mono [paper] |
M + I | T | O | Yes | Sparse |
[code] |
Sensor: M represents Monocular camera; S represents Stereo camera; R represents RGB-D camera and I represents IMU. Front-end: P represents Point; PL represents Point-line; D represents Direct; H represents Hybrid. Back-end: F represents Filtering; O represents Optimization. Coupling: T represents Tightly.
As a supplement to cameras, inertial sensors can effectively solve the problem that a single camera cannot cope with. Visual inertial fusion is bound to become a long-term hot direction of SLAM research.
Name | Year | Visual odometry | Feature method | Optimization | Loop closure detection | Real-time efficiency | Robustness | Characteristic |
---|---|---|---|---|---|---|---|---|
Mono-SLAM
[paper] |
2007 | Feature method | Shi-Tomasi | EKF | — | ✩✩ | ✩✩ | The Mono-SLAM is the first real-time monocular V-SLAM system. |
PTAM
[paper] |
2007 | Feature method | Shi-Tomasi | BA | — | ✩✩✩ | ✩✩✩ | The PTAM is the first monocular V-SLAM system based on nonlinear optimization of key frames and proposes the concept of front end and back end. |
LSD-SLAM
[paper] |
2014 | Direct method | — | Depth filter | √ | ✩✩✩ | ✩✩✩ | The LSD-SLAM applies the direct method to semidense monocular V-SLAM |
SVO
[paper] |
2014 | Semidirect method | FAST + LKe | Depth filter | — | ✩✩✩✩✩✩ | ✩✩✩ | The SVO is a visual odometer based on the sparse direct method. The biggest advantage of SVO over other schemes is its speed. |
SPTAM
[paper] |
2017 | Feature method | Shi-Tomasi | BA | √ | ✩✩✩✩ | ✩✩✩✩ | Binocular V-SLAM system based on PTAM |
LDSO
[paper] |
2018 | Direct method | Frames | BA | √ | ✩✩✩✩ | ✩✩✩✩✩ | The LDSO expands on DSO into a monocular V-SLAM system with loop detection and pose optimization. |
LCSD-SLAM
[paper] |
2018 | Semidirect method | Descriptor | BA | √ | ✩✩✩✩ | ✩✩✩✩✩ | The LCSD-SLAM is a loosely coupled semidirect monocular V-SLAM. |
Maplab
[paper] |
2018 | Semidirect method | Descriptor | — | √ | ✩✩✩✩ | ✩✩✩ | The maplab is a flexible and versatile multi-robot and multimode V-SLAM framework. |
RE-SLAM
[paper] |
2019 | Semidirect method | Edges | BA | √ | ✩✩✩ | ✩✩✩✩✩✩ | The RE-SLAM is a real-time robust edge-based V-SLAM system |
CCM-SLAM
[paper] |
2019 | Feature method | ORB | BA | √ | ✩✩✩ | ✩✩✩✩✩ | The CCM-SLAM is a kind of multi-robot cooperative V-SLAM, which can be used for cooperative mapping in unknown environment. |
ORB-SLAM3
[paper] |
2020 | Feature method | ORB | BA | √ | ✩✩✩✩✩ | ✩✩✩✩✩ | The ORB-SLAM3 is the first feature-based tightly coupled VIO system that relies only on maximum a posteriori estimation. |
VINS-fusion
[paper] |
2021 | Semidirect method | Shi-Tomasi + LK | BA | √ | ✩✩✩✩✩ | ✩✩✩ | The Vins-fusion is a multi-sensor state estimator based on optimization, which can achieve accurate self-localization for autonomous applications. |
VOLDOR-SLAM
[paper] |
2021 | Dense-indirect method | Optical flow | BA | √ | ✩✩✩✩ | ✩✩✩✩ | The VOLDOR is an intensive indirect visual ranging method that takes an externally estimated optical flow field as input |
ESVO
[paper] |
2021 | Semidirect method | Events | Depth filter | — | ✩✩✩✩✩✩ | ✩✩✩✩ | The ESVO is a novel pipeline for real-time visual odometry using a stereo event-based camera. |
There are several new SLAM methods like - HKU-MaRS-HBA, Pharos SLAM and few of them can be found at Hilti SLAM Challenge leaderboard.