Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2019,  1 (5):   435 - 460

Published Date:2019-10-20 DOI: 10.1016/j.vrih.2019.09.001


Simultaneous localization and mapping (SLAM) has attracted considerable research interest from the robotics and computer-vision communities for >30 years. With steady and progressive efforts being made, modern SLAM systems allow robust and online applications in real-world scenes. We examined the evolution of this powerful perception tool in detail and noticed that the insights concerning incremental computation and temporal guidance are persistently retained. Herein, we denote this temporal continuity as a flow basis and present for the first time a survey that specifically focuses on the flow-based nature, ranging from geometric computation to the emerging learning techniques. We start by reviewing two essential stages for geometric computation, presenting the de facto standard pipeline and problem formulation, along with the utilization of temporal cues. The recently emerging techniques are then summarized, covering a wide range of areas, such as learning techniques, sensor fusion, and continuous-time trajectory modeling. This survey aims at arousing public attention on how robust SLAM systems benefit from a continuously observing nature, as well as the topics worthy of further investigation for better utilizing the temporal cues.


1 Introduction
The SLAM problem aims at obtaining the state of a camera along with the globally consistent representation of an unknown environment. From this perspective, SLAM has practical applications in a scene that can hardly be predefined. With over 30 years of research, state-of-the-art SLAM algorithms have achieved robust performance under certain conditions with regard to sensors, computational resources, environments, and motion types and are commonly applied to recently emerging areas such as augmented reality, mobile robotics, and autonomous driving.
Both dealing with the joint tasks of camera motion recovery and static environment modeling, SLAM and structure-from-motion (SfM) share a similar pipeline, as illustrated in Figure 1. SLAM and SfM are best distinguished by the input data and real-time implementation. SLAM deals with an ordered image sequence and allows other complementary sensor signals. The sequential data ensure the continuous observation of the unknown environment. However, the stringent timing requirement makes SLAM a task-driven problem, pursuing operational compliance but not a perfect calculation. Hence, approximations and assumptions have been made to simplify the problem.
From this perspective, we find that utilizing temporal continuity is one main characteristic of recent progress made by the community. Although SLAM is commonly viewed as the online version of SfM[3,4,5], this opinion only captures the real-time property of the SLAM problem. The inherently continuous motion leads to continuous observations, and the advantages arising from this continuity makes real-time implementation possible. We herein refer to this continuous nature of the SLAM problem as a flow basis. Specifically, the flow can be characterized as follows: (1) Temporal guidance: continuity makes the problem highly predictable, which allows guided searching and good initialization. (2) Incremental computation: the pre-computed results can be fully leveraged in an incremental fashion to reduce the amount of unnecessary operations. (3) Local constraint: the temporal constraints not only establish relative correlations but also include more information for local regularization.
1.1 Comparison with other surveys
Durrant-Whyte and Bailey[6,7] provided an in-depth and comprehensive review of the SLAM problem in the early stage of research (1986-2006), when filtering approaches were dominant. Recent progress was summarized by Cadena et al.[8], where the history of SLAM was divided into three stages: the classicalage, the algorithm-analysis age, and the current robust-perception age. Increased demands for better scene understanding and a self-tuning system push research in a wide range of directions. Consequently, current reviews adopt more specific perspectives, e.g., problem formulation[9,10], dynamic environments[4], autonomous driving application[11], motion interpolation[12], and visual-inertial fusion[13].
Existing surveys provide a comprehensive overview of the technical development, whereas the present paper focuses on the flow-based perspective. We aim to reveal the ideas regarding how temporal cues are leveraged in each period and arouse consideration of topics that are overlooked and may lead to the remaining challenges.
1.2 Paper organization
The remainder of this paper is organized as follows. Section 2 and 3 provide an overview of the technical development with regard to two major components: data association and mathematical solution. Advances in the learning technique are presented in Section 4. Section 5 discusses the emerging trends and open problems in SLAM, including sensor fusion, continuous-time modeling, and other future potentials. Final remarks are presented in Section 6.
2 Data association
Data association establishes connections between unknown pose variables, map variables, and observations. Owing to sensor noise, data association is inherently ambiguous in an unknown environment, and this ambiguity makes SLAM an optimal estimation problem. Online data association for the SLAM problem allows on-the-fly graph structuring[10], where localization (visual odometry and relocalization) provides local node connections with regard to inter-frame co-visibility or temporal continuity, and loop closure detection provides a global graph topology by associating new observations with previous landmarks. In general, data association involves the following three steps: (1) Representing observations in a compact form; (2) Efficient similarity reasoning; (3) Strict verification check.
The specific requirement of each sub-task leads to disparate methods of temporal information usage: visual odometry complies with the strict real-time requirements, hence focusing mainly on computational cost. The temporal cues provide guidance to enhance the efficiency. In contrast, loop closure detection is formulated as an image-retrieval task, where precision is of the highest priority. From this perspective, the temporal cues are usually treated as reliable constraints for better robustness.
2.1 Visual odometry
Visual odometry stresses the relative motion between consecutive frames, thus favoring temporally trackable features. The trackability is highly dependent on the similarity reasoning strategy, which presents the major computational burden. As illustrated in Figure 2, the choice of image features determines the representation of the reconstructed three-dimensional (3D) map: a feature is the two-dimensional (2D) noisy observation induced by a landmark on the map, and a map maintains the invariance of the environment.
Existing SLAM algorithms are commonly categorized as sparse/dense or direct/indirect[25], where sparse/dense refers to the representation choice, and direct/indirect refers to the similarity reasoning strategy. Taking the flow basis into consideration, we herein categorize existing visual odometry methods into two types: descriptor-based matching methods and flow-based tracking methods. A comparison of the front-end from prevalent SLAM systems is presented in Table 1.
Comparison of prevalent SLAM systems with regard to front-end choices
Reference Representation Matching







MonoSLAM[18] Shi-Tomasi corner point Patch descriptor (NCC) Sparse Indirect Descriptor-based
PTAM[19] Shi-Tomasi corner point Patch descriptor (SSD) Sparse Indirect Descriptor-based
ORB-SLAM[14] FAST feature ORB descriptor Sparse Indirect Descriptor-based
LSD-SLAM[16] Edge points OF on SE3 Semi-dense Direct Flow-based
StructSLAM[15] Struct line Patch descriptor (ZNCC) Sparse Indirect Descriptor-based
PL-SLAM[20] ORB feature+ line feature[21] ORB and LBD[22] descriptors Sparse Indirect Descriptor-based
VINS[23] Shi-Tomasi corner point KLT tracker[24] Sparse Indirect Flow-based
ElasticFusion[17] Dense pixels OF on SE3 Dense Direct Flow-based
2.1.1   Descriptor-based matching methods
Descriptor-based matching methods rely on feature descriptors to build correspondences. Hence, such methods fall into the sparse and indirect category. Representations such as corner points[14,26] and line segments[15,20] are commonly utilized, as reliable descriptors ensure the distinguishability and are invariant to illumination changes, rotation, scale, and perspective distortion. However, standard descriptor-based matching is inherently an exhaustive searching strategy, which makes the problem computationally expensive. Recent DBoW2[27] maintains direct indices with promising efficiency. Owing to decades of research, descriptor-based matching methods achieve promising performance, allowing the incorporation of robust relocalization and loop closure detection modules, while random sample consensus (RANSAC)[28] is commonly applied for outlier rejection. However, the reliance on the sparse feature extraction potentially leads to failure when motion blurring occurs or in a repetitive or insufficient scenario.
Benefiting from their continuous nature, motion models, e.g., the decaying velocity model[19], constant motion model[14], and constant rotation model[20], are commonly utilized for modern SLAM systems. Motion models provide approximated positions of the reprojected map. This temporal guidance enhances the efficiency and accuracy by reducing the searching space. Meanwhile, the verification check evaluates the convergence of motion estimation guided by predicted camera poses. Once the tracking is lost, the system falls back to the standard method of global descriptor matching, where the system robustness is carefully retained.
2.1.2   Flow-based tracking methods
Flow-based methods perform tracking rather than matching by incorporating an idea similar to optical flow estimation[29]. A warping function is applied through photometric error minimization. The main difference among existing methods lies in the motion space: commonly, the optical flow is represented in the Euclidean space[30], while local homogeneity in an affine space[23,31] and global homogeneity in Lie Algebra[2,25] are also encouraged. From the similarity reasoning perspective, establishing correspondences through optical flow estimation in the Euclidean space or affine space falls into a sparse indirect category, as these parallax-aware correspondences follow a feature-extraction-feature-matching framework. In contrast, Lie Algebra is usually denoted as a direct method in a semi-dense[32,33] or dense[34,35] way, without a specific matching procedure.
Flow-based tracking methods better reveal the flow basis underlying the SLAM problem: assuming a small displacement between two consecutive frames, brightness changes can be represented with a partial derivative equation (PDE). This PDE-based formulation leads to an incremental method of computation, where historical estimations are temporally propagated. Compared with the descriptor-based matching methods, flow-based tracking methods are highly efficient and parallelizable[2,23]. However, in contrast to descriptor-based matching methods that have analytical solutions, flow-based tracking methods suffer from a non-convex issue, similar to optical flow estimation. The small motion and brightness constancy assumption can be easily violated in reality. Hence, good initialization (motion prediction) is crucial to flow-based tracking methods to avoid being trapped in a local minimum in cases such as abrupt motion and illumination changes.
2.2 Loop closure detection
Loop closure detection aims at finding the most similar pre-visited place compared with the current frame. Hence, the loop closing status serves as a factor to enforce global consistency and provide the real topology of the unknown environment[8]. The performance of a loop closure detection system is evaluated through a precision-recall curve. In practice, 100% precision is the prerequisite, as false positives corrupt the system catastrophically, and higher recall indicates a more robust system against a changing environment. For a real-time system, the computational time of loop closure detection is strictly limited, although it is performed in a back-end. As illustrated in Figure 3, the basic idea of loop closure detection is to build a database that stores sequential images for efficient look-up.
2.2.1   Standard appearance-based methods
For the loop detection task in the SLAM problem, the representative FAB-MAP[37] and DBoW2[27] dominate the field. Similar to prevalent SfM methods, these algorithms utilize a bag-of-words (BoW) model to discretize the feature descriptor space into a hierarchical tree structure, or namely vocabulary. Each type of descriptor is then categorized as a word on the vocabulary tree. Through this tree-structured database, the entire image sequence can be quantized and stored compactly, allowing efficient retrieval. In contrast to feature descriptor discretization, other methods utilize the entire image information for compression in a downsampled form[38] or with randomized ferns[36]. We refer readers to a previous work[39] for a comprehensive review of visual place recognition (Figure 4).
2.2.2   Flow-aided method
For a flow-based SLAM system, incremental updating is one notable trait. The word quantization is constructed and stored on-the-fly, which preserves spatiotemporal information of the environment. However, the commonly used methods[27,37] require clustering or classification as a pre- training process for vocabulary construction. This presents the issue that the word quantization may be inaccurate, as the words from the query frame are inconsistent with the predefined and environment-dependent vocabulary. Hence, agglomerative clustering strategies are adopted for online vocabulary formulation[42,43,44,45], where the bag-of-words model is more adaptive to a live status.
Meanwhile, the temporal cues are leveraged to further verify the detection results. A temporal consistency check is commonly used to determine whether the loop closure candidate should be accepted[27]. Furthermore, CAT-SLAM[40] models a continuous trajectory for a better frequency and adopts a particle filter to assign weights for metric similarity and appearance similarity; SeqSLAM[41] utilizes the local sequence for matching. This additional temporal constraints notably alleviate the perceptual aliasing induced by environment changes.
3 Joint pose and map recovery
Given associated observations, SLAM mainly deals with interferences caused by noise and outliers and is thus formulated in a probabilistic way by estimating the posterior probability of all camera poses
x 1 : k  
and the map
m = { l 1 ,   ,   l m }
given observations
z 1 : k
and control signals
u 1 : k
as follows:
P x 1 : k ,   m | z 1 : k ,   u 1 : k , x 1
As illustrated in Figure 5(a), the state space grows continuously over time, which may lead to an intractable implementation. Hence, the solution to (1) is separated into four divergent methods of temporal information utilization for better computational efficiency: the filtering-based method assumes a Markov process to marginalize all previous states and updates states in a recursive way; windowed optimization marginalizes out the states outside the sliding window to maintain a constant-time complexity, while local constraints are enforced within the sliding window for better robustness; keyframe-based optimization conducts keyframe insertion and culling on-the-fly and simply discards states that are not heuristically chosen as keyframes; incremental optimization leverages advances in matrix theory to perform global optimization by incrementally updating the information matrix.
3.1 Filtering-based methods
Filter-based methods maintain the instant probability distribution and hence are also known as online SLAM. The Bayes theorem is applied to ensure a prediction-updating fashion[7]:
P x k , m | z 1 : k - 1 , u 1 : k , x 1 = x k - 1 x P x k | x k - 1 , u k P x k - 1 , m | z 1 : k - 1 , u 1 : k - 1 , x 1
P x k , m | z 1 : k , u 1 : k , x 1 P z k | x k , m P x k , m | z 1 : k - 1 , u 1 : k , x 1
The recursive state transition from time k-1 to time k is achieved with a motion model
P x k | x k - 1 , u k
and an observation model
P ( z k | x k , m )
. This prediction-updating fashion reveals the flow-based nature of the SLAM problem. The pioneering MonoSLAM[18,46] uses a 6-degree of freedom (DoF) pose and a 3D feature representation for state propagation through an extended Kalman filter (EKF). By linearizing both models and assuming a Gaussian distribution, the system achieves real-time performance over a long time period. On basis of the prediction-updating framework, other variants of EKF are applied to deal with non-Gaussian and nonlinear cases: unscented Kalman filter-based[47] methods use a sigma-point propagation as a nonlinear solution; particle filter-based methods[48] forgo the Gaussian assumption and operate in a non-parametric manner.
Although filtering-based methods fully respect the temporal information propagation in an elegantly incremental and probabilistic way, issues arise owing to practical operations. As illustrated in Figure 6, on one hand, because of the nonlinear and non-Gaussian properties of SLAM, any form of approximation inevitably introduces inconsistency to the system[50], which leads to error accumulation that slowly corrupts the system. On the other hand, the elimination of past states leads to matrix fill-in and makes landmark states quickly become fully connected. This issue makes the computational cost quadratic with respect to the number of landmarks[5] and causes a severe scale limitation.
The inconsistency issue was thoroughly reviewed in previous studies[51,52]. Huang et al.[53] revisited the problem with an observability analysis and introduced a First-Estimates Jacobian (FEJ) EKF. By evaluating the Jacobians at the same linearization point, the FEJ-EKF ensures the same dimension of observable subspace between the linearized error-state system and the actual SLAM system, thus alleviating the overconfident estimates. For reduced complexity, FastSLAM[54] leverages the conditional independences in the SLAM problem to obtain a factorized form, where the complexity is significantly reduced by decomposing the landmark and pose distribution. One remarkable finding in the SLAM community is the sparse structure of the information matrix. Hence, an information filter[55] is adopted in place of the Jacobian. This insight is also adopted by optimization-based SLAM, which is discussed later in this paper. Another trend is the introduction of sub-maps[56] that divide the environment into local groups, thus ensuring a bounded size.
Although filtering-based methods are theoretically suited to the continuous and incremental SLAM system, Strasdat et al. presented a detailed comparison between filtering-based and optimization-based methods and argued that optimization-based methods are more beneficial for a modern application[5]. Since that study, optimization-based methods have attracted the majority of research attention. Nevertheless, recent progress reveals great potential: StructSLAM[15] leveraged robust data associations with a novel representation to achieve stable and accurate results in a Manhattan-world scenario; Lenac et al. implemented a Lie group while maintaining a sparse information matrix, achieving results comparable to those of graph-based optimization[57]. Owing to recent progress with regard to parametrization, sensors, features, and theoretical tools, the covariance-aware and incrementally updated framework may see a revival with more robust and adaptive performance.
3.2 Optimization-based methods
In contrast to filtering-based methods that value temporal propagation, optimization-based methods highlight the spatial connectivity of the scene graph. As these methods retain the entire historic state space, they are denoted as full SLAM[58] or smoothing andmapping (SAM)[59] and are formulated as a maximum a posteriori (MAP) problem:
{ x 1 : k , m } = a r g   m a x   P x 1 i = 2 k P x i | x i - 1 , u i j = 1 m P z j | x i j , l j
By assuming a Gaussian distribution, (4) can then be transformed into a least-squares error-minimization problem, and the conditional distribution is treated as factors that enforce constraints between variables. Through Gauss-Newton iteration or the Levenberg-Marquardt algorithm, the minimum of the error function is reached by solving a succession of the linear function:
H Δ θ = b
Dellaert and Kaess investigated the sparse nature of the SLAM problem and introduced the seminal square root SAM(
)[59] approach to implement sparse linear algebra, where the matrix
is factorized for efficient inversion. Recently, open-source libraries such as
g 2 o
[60] and the Ceres solver[61] have been commonly used, which exploit the sparse Jacobian structure for efficient batch optimization. One notable issue remains: the retained states continuously grow over time, which may lead to computational explosion. Hence, for optimization-based SLAM that deals with the entire sequence, the major concern lies in the treatment of increasingly growing data, which presents a tradeoff between information usage and computational speed.
3.2.1   Keyframe-based batch optimization
As illustrated in Figure 5(d), keyframe-based optimization methods select keyframes and map points heuristically and simply discard the remaining information, thus avoiding redundancies and maintaining the best of compactness. Through the survival-of-the-fittest strategy, the matrix dimension is reduced significantly to ensure a large-scale implementation, and the matrix sparsity is carefully maintained. Two representative methods include PTAM[19] and ORB-SLAM[14], which allow real-time implementations with promising robustness.
Keyframe-based methods are more similar to a standard SfM algorithm and perform graph pruning on-the-fly, where the size of the graph grows over time but with a far lower speed. Frame selection and thread splitting play central roles in a keyframe-based optimization method. These operations allow accurate but computationally expensive bundle adjustment to be applied in a back end without sacrificing the real-time implementation. With the delicately designed implementation, ORB-SLAM[14,26] achieves a great balance between efficiency, accuracy, and robustness and serves as a benchmark for modern SLAM algorithms.
3.2.2   Windowed optimization
As illustrated in Figure 5(c), windowed optimization performs local optimization over a small set of time-successive frames, where states outside the sliding window are marginalized out. It is also known as fixed-lag smoothing. The fixed size of the sliding window ensures bounded complexity for online and constant-time optimization while maintaining more historic states as local constraints compared with a filter-based method. Additionally, the marginalization ensures temporal propagation without information loss.
As illustrated in Figure 7, windowed optimization[62] can be considered as a compromise between the computationally expensive batch optimization and the less accurate filtering-based solution. However, it suffers similar issues to both of these methods: marginalization operation brings inconsistency and fill-in issues, as mentioned previously. Fill-in may not make the matrix fully dense as in a filtering-based method. Nonetheless, the sparsity of the matrix will certainly be affected, which limits the window to a small size for real-time implemen-tation. Similar to filtering-based methods, the choice of linearization points is the key to the inconsistency issue. Commonly, prior[25,63] or optimal[64] linearization points are selected to ensure the same observability properties between the linearized system and the actual system.
3.2.3   Incremental optimization
Incremental optimization exploits the incremental nature of the SLAM problem to alleviate unnecessary calculations. Owing to the on-the-fly graph structuring, new observations only have local effects, while a large portion of the graph remains untouched. From this perspective, pre-calculated components can be reused to only update entries that are affected by new measurements. As SLAM struggles to compress information for a tradeoff between efficiency and accuracy, the reduced computation arising from the incremental optimization allows a larger parameter space to be optimized globally, where a longer sequence, more map points, and additional variables can be employed for better robustness and global accuracy.
The seminal work of iSAM[65] leverages an incremental QR factorization for efficient nonlinear optimization. As illustrated in Figure 8, new entries under the diagonal are zeroed out with Givens rotations, and most of the matrix is unchanged. To handle the fill-in and nonlinear issues, variable reordering and relinearization are implemented periodically in a batch mode, where sparsity is carefully maintained. AprilSAM[66] takes this one step further by adaptively selecting between incremental updating and batch updating. Cholesky factorization replaces the previous QR factorization to reduce the number of nonzero components.
However, these methods still require periodical batch updating. In contrast, iSAM2[67] introduces an incremental variable reordering and fluid relineari-zation technique. The factor graph is transferred to a Bayes tree, which ensures a cheap inference along with a recursive variable reordering. The validity of linearization points is tracked to inform necessary relinearization. The insight behind the methods is that information is propagated upward to the root. When new observation enters, the sub-trees below the new factor cliques are not affected. In contrast, SLAM++[68] takes incremental covariance recovery into considera-tion, offering feedback of information confidence. The confidence guides information and connection selection through online graph pruning. Hence, redundancy is eliminated to ensure both efficiency and scalability.
4 Learning-based methods
Deep neural networks have recently gained popularity in the computer-vision community, exhibiting remarkable performance. However, the SLAM problem is more systematic than a single estimation problem. The requirement of a failsafe mechanism and a real-time implementation makes the SLAM problem difficult to be designed in a completely end-to-end learning fashion. In this section, we mainly categorize existing learning-based methods into two types: the first type follows the traditional SLAM pipeline and utilizes the progress of learning techniques in relevant sub-tasks to boost the performance of the SLAM system; the second type formulates the SLAM problem as a regression problem in an end-to-end fashion, where neural networks are utilized to directly represent the generative model of the SLAM problem.
4.1 Learning-enhanced SLAM in traditional framework
Deep learning is commonly viewed as a powerful perceptual tool for vision problems. It is generally a data-driven technique and is capable of acquiring a high-level representation from a large amount of data[69,70]. In contrast to geometric computation, which formulates the problem mathematically, neural networks learn generative models directly from data. The capability of data compression is well-suited for the SLAM problem. Existing methods leverage the expressive representation of a single raw image with learning techniques to enhance the performance of certain modules within the traditional SLAM framework.
Methods that fall into this category typically employ learning techniques for data post-processing. The network output is expected to provide more information, along with the input single image. The flow basis is inherited in the traditional SLAM framework, while the neural network learns a fixed generative model without temporal information involved.
4.1.1   Deep feature for SLAM
For geometric computation, the invariance of features is leveraged for robust data association. However, the reliance on handcrafted feature extraction makes sparse SLAM vulnerable under a textureless scene or when motion blurring occurs. Finding a feature representation that remains invariant, distinctive, well-distributed, and sufficient under diverse scenarios is a major concern.
GCNv2[72] designs a correspondence network for a 3D projective geometry. A deep feature is trained through the metric learning technique to learn the explicit keypoint detector and descriptor. This lightweight structure can be adaptively incorporated into the modern ORB-SLAM[26] system with readily enhanced performance. This indicates that invariant feature learning[73,74,75,76] can be beneficial for the SLAM problem. In contrast, DeTone and Malisiewicz[71] simplified the SLAM problem as a point tracking task, as illustrated in Figure 9. The pipeline is similar to that of feature matching-based visual odometry, as the pose is estimated between two views without building a global map. However, it introduces the possibility that deep SLAM may be formulated as a series of end-to-end networks following the traditional pipeline.
Maintaining the invariant property under all circumstances is nontrivial. A recent trend is observed concerning finding a suitable feature for a specific task: Jayaraman et al.[77] and Agrawal et al.[78] used similar methods to learn feature representation specifically for an egocentric task; Schmidt and Roth[79] introduced a rotation-aware descriptor. The task-specific invariance is known as the equivariant property[80]. Although the aforementioned methods focus on the recognition task, equivariance feature learning studies the connections between feature-representation changes and the input image transformation, which better characterizes the invariance of a specific task.
4.1.2   High-level cues for SLAM
The large receptive field induced by the convolutional operation allows high-level scene understanding with the learning technique[81]. Recent trends indicate that this high-level information within a single image can handle issues that geometry computation suffers from.
One notable tendency is the incorporation of semantic cues. Geometry-based SLAM merely utilizes photometric and geometric information to constrain the system, while semantic information contains strong priors at a category level. A semantic label can be viewed as an invariant property to regularize the optimization that leads to a more reliable data association[82], even in a complex dynamic scene[83]. In contrast, dense mapping also benefits from additional semantic information. Maintaining a dense semantic map allows better scene understanding, which is meaningful for intelligent robot interaction[84]; shape priors help to reconstruct a parametrized fine-grained object surface[85,86]. Conversely, semantic reasoning can also benefit from the temporal correlation established by SLAM[87]. The application of semantic reasoning fully connects the advances in recognition, tracking, and modeling for a highly perceptual system.
There is another trend regarding nontrivial monocular depth estimation. This task is inherently ambiguous owing to its ill-posed nature. Nonetheless, after the introduction of an end-to-end network[88], deep monocular depth estimation tackles challenging cases for monocular SLAM, e.g., pure rotation[89] and scale ambiguity[90], and allows dense mapping based on a modern sparse SLAM system in real-time[91]. CodeSLAM[92] is one inspiring method. A compact form of the depth map is obtained through auto-encoder-like training. The coded depth ensures efficient optimization and a complete scene geometry, exhibiting superiority over the existing geometry-based SLAM system.
4.2 Learning-based SLAM in end-to-end framework
SLAM performs pose estimation and map (explicit or implicit) recovery simultaneously. Some methods view SLAM as an end-to-end pose regression problem. None of the methods that fall into this category can be viewed as a deep SLAM system; rather, they are merely steps toward deep SLAM. Absolute pose regression treats the training procedure as global map establishment, but the pose estimation is conducted in a known environment. Relative pose regression does not establish an explicit global map; hence, it is merely a visual odometry task, as global consistency can hardly be achieved.
4.2.1   Standard pose regression
Absolute pose regression, which is a visual localization task, is highly correlated to the relocalization and loop closure detection sub-problems of SLAM. The task aims to regress the absolute camera pose given one image in a pre-trained environment. The general pipeline is similar to that of a relocalization task in the traditional SLAM framework: the data are first encoded as deep features and then stored in an embedding. The major insight is to train the network instead of building an explicit map, thus maintaining a constant size without suffering from the linear-growth issue (Table 2). PoseNet[109] first applies a convolutional neural network (CNN) for 6-DoF camera localization, followed by a series of extensions with uncertainty reasoning[114] and geometric constraints[115]. Walch et al. took this one step further, performing feature selection for dimension reduction with an LSTM[116]. However, as illustrated in Figure 10a, an interesting theory was recently proposed by Sattle et al.[112]. They argued that current approaches likely involve image retrieval but not accurate pose estimation. Hence, additional studies are needed for further exploring the task.
Representative relative pose regression approaches
Supervisory Reference Input data Architecture Use of optical flow Novel loss function
Unsupervised SfMLearner[93] Image triplet CNN

Training-data rejection with mean

optical flow magnitude

Warping loss
Depth-VO-Feat[94] Image pair CNN - Feature reconstruction loss
UnDeepVO[95] Image pair CNN - Stereo imagery for supervision
Vid2Depth[96] Image pair CNN - 3D ICP loss
GeoNet[97] Image pair CNN

Joint estimation of rigid flow and

residual flow

Geometric consistency check
UnDeMoN[98] Image pair CNN - Charbonnier penalty
GANVO[99] Image triplet RNN+GAN - GAN loss
Supervised P-CNN[100] Optical flow CNN Optical flow to pose Root mean squared loss
DeepVO[101] Video RNN FlowNet[102] encoder Mean squared loss
ESP-VO[103] Video RNN FlowNet encoder Covariance incorporation
DeMoN[104] Image pair CNN Optical flow as a supervised output -
DeepTAM[105] Image pair CNN Optical flow as an auxiliary task Uncertainty loss
GFS-VO[106] Video RNN FlowNet encoder Separate rotation and translation loss
L-VO[107] Image pair CNN 2.5D scene flow to pose Bivariate Gaussian loss
VOMachine[108] Video RNN FlowNet encoder Global and local losses
Notes: There is an obvious tendency for supervised methods to utilize temporal information compared with unsupervised methods.
In contrast, relative pose regression deals with the visual odometry task given at least two consecutive images. Representative methods mainly focus on designing a loss function to jointly constrain depth and pose estimation in an unsupervised manner, as illustrated in Figure 10b. Commonly used losses include the warping loss[93], 3D differentiable iterative closest point (ICP) loss[96], spatial and temporal consistency check[95], feature reconstruction loss[94], and modified warping loss with a Charbonnier penalty function[98]. GeoNet[97] jointly learns the depth, camera pose, and optical flow, exhibiting potential for applications in a dynamic scene. The constraints for a geometry-based method are commonly adopted as self-supervision. However, the continuous nature of visual odometry is not explicitly revealed.
4.2.1   Flow-based pose regression
Compared with the standard pose regression networks, some methods realize the continuous nature of the SLAM problem and take advantage of the temporal cues. For the absolute pose regression problem, an issue arises: a single image is visually ambiguous owing to similar textures or appearance changes[39]. With available sequential data, temporal constraints can be enforced for better robustness: VidLoc[117] utilizes a bidirectional LSTM to obtain spatiotemporal features, and MapNet[110] enforces temporal smoothness with a relative pose constraint through pose graph optimization. The use of a temporal constraint increases the accuracy, similar to the advances described in Section 2.2.
With regard to the relative pose regression problem, several supervised methods take the continuous nature inherent in a visual odometry task into consideration, as indicated by Table 2. Pose can be regressed directly from 2.5D scene flow[107], as the estimated flow establishes temporal correspondence explicitly. Owing to the computationally expensive optical flow estimation, the encoder of FlowNet[102] is used for spatiotemporal feature extraction[101,103,106,108]. These methods share a similar video sequence input and a recursive neural network (RNN) architecture for temporal information storage, while focusing on different perspectives, such as uncertainty reasoning[103] and feature selection[106]. In contrast, DeMoN[104] and DeepTAM[105] pay more attention to better depth recovery. Benefiting from a motion stereo constraint, the depth estimation results of both methods are visually appealing and quantitatively promising. Recently, Xue et al. followed the traditional SLAM pipeline to leverage temporal information for robust relative pose estimation[108]. They introduced a memory module for local map construction and a refining module for global pose optimization. Benefiting from the selected temporal information propagation, the method achieves top-tier performance.
5 Recent trends and open problems
Thus far, we described how state-of-the-art SLAM algorithms benefit from the flow basis. In this section, we mainly discuss recent trends in the SLAM community and other frontiers in relevant areas. Additionally, we present our take on existing challenges and remaining issues, exploring topics worthy of further investigation for exploiting the temporally continuous property.
5.1 Sensor fusion
As mentioned previously, good motion prediction and good feature correspondence are essential for the SLAM problem. Visual information is ambiguous. One key property of SLAM is the availability of multi-sensor fusion[8]. The incorporation of other sensors, e.g., inertial sensors and event cameras, can compensate for the deficiency with barely visual observations. Additionally, the high frame rate ensures a good linear characteristic and is best suited for a flow-based SLAM problem.
5.1.1   Inertial sensor
Visual-inertial fusion is not a new topic, as it has been studied for a long time[118]. Nevertheless, owing to the low cost, portability, and reliable short-term motion constraints, this field has recently attracted research attention with convenient applications for unmanned aerial vehicles and handheld mobile devices. Visual cues and inertial signals are complementary: inertial sensors provide more accurate short-term constraints, which ensure reliable state propagation; visual cues allow long-term correction, hence eliminating the error accumulation. Notably, inertial-aided visual SLAM is more robust than conventional SLAM in challenging cases, e.g., illumination changes, low texture, motion blurring, and abrupt motion.
According to experimental results, the involvement of inertial sensors significantly improves the performance of methods that rely greatly on temporal propagation. We refer readers to a previous work[13] for a comprehensive evaluation. Usenko et al.[119] proposed the integration of an inertial measurement unit (IMU) with stereo LSD-SLAM[16] for enhanced performance. A similar trend is observed in VI-DSO[120] with a direct front end. Notably, the filter-based VIO methods[121] achieve results comparable to those of optimization-based methods[122]. This is mainly due to the IMU-driven state propagation that produces an accurate motion model for pose prediction.
Apart from this flow basis, we also find an interesting case: although VIORB[123] tackles the challenging V2_03_difficult sequence, the performance compared with the original ORB[14] is not overwhelming. This validates our previous argument that ORB-SLAM is more like an online SfM method than a flow-based SLAM method. The delicately designed system ensures a global optimal solution in most cases. Nonetheless, the inertial sensor improves the robustness and provides an absolute scale.
5.1.2   Event camera
Event cameras capture an asynchronous sequence of intensity changes with precise timestamps. This type of sensor is known for its low cost and latency, along with its wide dynamic range, and makes a SLAM system applicable under extreme conditions with regard to illumination and motion. The sparse and binary event signals avoid redundant data and allow extremely efficient computation with a low bandwidth and memory cost. Additionally, events mainly occur around the image edges[124] and are highly motion-aware. These properties make the event camera a perfect complement to the current visual SLAM system.
Recent methods indicate that a semi-dense SLAM system can be achieved with a single event camera[127,128] or by incorporating traditional visual cues[124,129]. Although the output of an event camera differs significantly from that of traditional visual sensors, which may trigger a paradigm shift[8], recent studies show that benefiting from the high temporal resolution and salient characteristic, long-term feature tracking[126], as illustrated in Figure 11, and sequence-based loop detection[130] can be performed. Hence, the optimal sensor for the visual SLAM community remains to be identified.
5.2 Continuous-time trajectory estimation
Taking the motion continuity into consideration, a series of continuous-time trajectory modeling methods arise. Continuous-time modeling allows the fusion of asynchronous sensors, such as inertial sensors[131] or event cameras[132], and the compensation of motion-distorted observations, such as those obtained using scanning laser rangefinders[133] or rolling shutter cameras[134]. As high-order differentiable representations are inherently more expressive, existing continuous-time trajectory estimation methods mainly leverage a temporal basis function or a non-parametric Gaussian process (GP) representation.
5.2.1   Temporal basis function
The use of a temporal basis function is commonly performed in a batch spline fitting manner[135], as the spline in Lie Algebra closely matches torque-minimal trajectories[136]. Compared with a discrete-time representation, it produces a smooth trajectory with fewer state variables to be updated. The accuracy of a basis-function representation relies greatly on the model selection[137]. Additional issues are addressed with regard to the noise model[131] and temporal density[138]. Although it deals with practical issues that discrete-time SLAM methods encounter, a fair comparison to choose the optimal knot and basis has not been presented.
5.2.2   Gaussian process
Modeling with a GP maintains a trajectory distribution and hence takes uncertainty into account[137]. Although the standard kernel makes the problem computationally expensive, Barfoot et al.[139,140] showed that applying a linear or nonlinear time-varying stochastic differential equation can lead to an exactly sparse matrix, yielding an efficient solution. From this perspective, the sparse nature of the SLAM problem along with the factor graph structure is maintained in a GP regression. This framework was extended to an incremental solution[141] and a Lie group parametrization[142].
5.3 Open problems
As a task-driven problem, challenging scenarios in practical applications of modern SLAM systems are widely studied, such as illumination changes, low texture, and motion anomalies. Herein, we highlight several fundamental aspects where we believe better utilizing the temporal continuity may lead to enhanced performance.
5.3.1   Trackable and structural feature
As mentioned previously, ambiguous data association due to sensor noise is one of the most critical issues for the SLAM problem, and the trackability of features significantly affects the performance of the corresponding algorithm. Existing pointwise features and descriptors are cumbersome and discard structural connectivity. In real-world scenes, structural features such as lines and planes can be stably observed, and the continuous motion leads to steady feature changes. Hence, the temporal trackability of these structural features along with other high-level features is worthy of further investigation.
5.3.2   Predictable motion model
The continuous property makes SLAM a fully predictable problem, where prior information can provide reliable guidance. However, the prediction capability is not well studied. For data association, the motion model is assumed to be constant for a coarse initialization; continuous-time trajectory modeling approximates a smooth pose trajectory. However, existing methods focus more on the parametrization of the estimated trajectory for parameter space reduction or synchronization instead of the ability to predict the next move. Recent progress in visual-inertial fusion indicates that reliable motion prediction (pre-integration of inertial signals) can ensure convergence and enhance both the accuracy and efficiency. From this perspective, a reasonable motion predictor deserves attention.
5.3.3   Map updating and map representation
A map serves as a storage of flow information, and map updating captures the spatiotemporal invariance through streamed data. Hence, studies on temporal continuities indicate what type of map is needed in a specific scenario. A good method of map representation helps to retain more informative data for better robustness and allows efficient reasoning of sensor positions. In general, the map representation determines to what extent the historical information can be preserved, thus determining the perceptual ability and the self-awareness of the system in an unknown environment. Demands of an active and top-down system arise, revealing the potential of map-guided and map-dominated sensor exploration. We believe that a more informative and adaptive map can cause a paradigm shift for the SLAM problem.
5.3.4   Online learning
Current learning techniques perform batch training in a supervised or unsupervised manner. In contrast, SLAM deals with a continuously observing problem. We observe the advantages arising from sequential learning due to additional constraints derived from historic information. However, training with an image sequence cannot be implemented on-the-fly in real time unless online learning is achieved. Moreover, the generalization ability is vital to SLAM. Batch training only leads to a fixed generative model, while online learning brings adaptiveness to the network. We argue that a real-time, self-supervised, and incrementally updated learning method is one route toward robust perception, and online and lifelong learning may be a key that combines the perceptual tool of deep learning and the continuous nature of the SLAM problem.
6 Conclusion
We surveyed existing methods from a novel flow-based perspective. Owing to the practically applicable requirement, current SLAM systems focus on real-time performance and robustness. Approaches driven by this research target well address information utilization and propagation, where temporal continuity is commonly utilized. Major advantages arising from the temporally continuous property include temporal guidance, incremental computation, and the additional local constraint, which were comprehensively reviewed in the present paper, from geometry computation to learning-related methods, along with emerging sensor fusion tendencies and new modeling techniques.
Although the predictable property of the SLAM problem is profitable and offers great potential, the reliance on the temporal propagation inevitably leads to error accumulation due to noise, approximation, and erroneous estimation, which slowly corrupt the system. An uncertainty-aware and self-adaptive system is far from being reached. Moreover, existing learning-related methods hardly benefit from the continuous and incremental nature of the SLAM problem. We find that there is great room for improvement in this area, and we believe that a better understanding of the flow basis will be a driving force for research in the field.



Schonberger J L, Radenovic F, Chum O, Frahm J M. From single image query to detailed 3D reconstruction. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015 DOI:10.1109/cvpr.2015.7299148


Engel J, Schöps T, Cremers D. LSD-SLAM: large-scale direct monocular SLAM//Computer Vision―ECCV 2014. Cham: Springer International Publishing, 2014, 834–849 DOI:10.1007/978-3-319-10605-2_54


Newcombe R A, Lovegrove S J, Davison A J. DTAM: Dense tracking and mapping in real-time. In: 2011 International Conference on Computer Vision. Barcelona, Spain, IEEE, 2011 DOI:10.1109/iccv.2011.6126513


Saputra M R U, Markham A, Trigoni N. Visual SLAM and structure from motion in dynamic environments. ACM Computing Surveys, 2018, 51(2): 1–36 DOI:10.1145/3177853


Strasdat H, Montiel J M M, Davison A J. Real-time monocular SLAM: Why filter? In: 2010 IEEE International Conference on Robotics and Automation. Anchorage, AK, NewYork, USA, IEEE, 2010 DOI:10.1109/robot.2010.5509636


Bailey T, Durrant-Whyte H. Simultaneous localization and mapping (SLAM): Part II. IEEE Robotics & Automation Magazine, 2006, 13(3): 108–117 DOI:10.1109/mra.2006.1678144


Durrant-Whyte H, Bailey T. Simultaneous localization and mapping: Part I. IEEE Robotics & Automation Magazine, 2006, 13(2): 99–110 DOI:10.1109/mra.2006.1638022


Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, Reid I, Leonard J J. Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on Robotics, 2016, 32(6): 1309–1332 DOI:10.1109/tro.2016.2624754


Dellaert F, Kaess M. Factor graphs for robot perception. Foundations and Trends in Robotics, 2017, 6(1/2): 1–139 DOI:10.1561/2300000043


Grisetti G, Kummerle R, Stachniss C, Burgard W. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine, 2010, 2(4): 31–43 DOI:10.1109/mits.2010.939925


Bresson G, Alsayed Z, Yu L, Glaser S. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 2017, 2(3): 194–220 DOI:10.1109/tiv.2017.2749181


Haarbach A, Birdal T, Ilic S. Survey of higher order rigid body motion interpolation methods for keyframe animation and continuous-time trajectory estimation. In: 2018 International Conference on 3D Vision (3DV). Verona, NewYork, USA, IEEE, 2018 DOI:10.1109/3dv.2018.00051


Li J, Yang B, Chen D, Wang N, Zhang G F, Bao H J. Survey and evaluation of monocular visual-inertial SLAM algorithms for augmented reality. Virtual Reality and Intelligent Hardware, 2019, 1(1): 386–410 DOI:10.1016/j.vrih.2019.07.002


Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31(5): 1147–1163 DOI:10.1109/tro.2015.2463671


Zhou H Z, Zou D P, Pei L, Ying R D, Liu P L, Yu W X. StructSLAM: visual SLAM with building structure lines. IEEE Transactions on Vehicular Technology2015, 64(4): 1364–1375 DOI:10.1109/tvt.2015.2388780


Engel J, Stuckler J, Cremers D. Large-scale direct SLAM with stereo cameras. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany, IEEE, 2015 DOI:10.1109/iros.2015.7353631


Whelan T, Salas-Moreno R F, Glocker B, Davison A J, Leutenegger S. ElasticFusion: Real-time dense SLAM and light source estimation. The International Journal of Robotics Research, 2016, 35(14): 1697–1716 DOI:10.1177/0278364916669237


Davison A J, Reid I D, Molton N D, Stasse O. MonoSLAM: real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 1052–1067 DOI:10.1109/tpami.2007.1049


Klein G, Murray D. Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Nara, Japan, IEEE, 2007 DOI:10.1109/ismar.2007.4538852


Pumarola A, Vakhitov A, Agudo A, Sanfeliu A, Moreno-Noguer F. PL-SLAM: Real-time monocular visual SLAM with points and lines. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, Singapore, IEEE, 2017 DOI:10.1109/icra.2017.7989522


von Gioi R G, Jakubowicz J, Morel J M, Randall G. LSD: A fast line segment detector with a false detection control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(4): 722–732 DOI:10.1109/tpami.2008.300


Zhang L L, Koch R. An efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency. Journal of Visual Communication and Image Representation, 2013, 24(7): 794–805 DOI:10.1016/j.jvcir.2013.05.006


Qin T, Li P L, Shen S J. VINS-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 2018, 34(4): 1004–1020 DOI:10.1109/tro.2018.2853729


Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision. Proceedings of the 7th international joint conference on Artificial intelligence,1981, 2, 674–679


Engel J, Koltun V, Cremers D. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(3): 611–625 DOI:10.1109/tpami.2017.2658577


Mur-Artal R, Tardos J D. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 2017, 33(5): 1255–1262 DOI:10.1109/tro.2017.2705103


Galvez-López D, Tardos J D. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 2012, 28(5): 1188–1197 DOI:10.1109/tro.2012.2197158


Fischler M A, Bolles R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981, 24(6): 381–395 DOI:10.1145/358669.358692


Baker S, Matthews I. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 2004, 56(3): 221–255 DOI:10.1023/b:visi.0000011205.11775.fd


Peasley B, Birchfield S. Replacing projective data association with Lucas-kanade for KinectFusion. In: 2013 IEEE International Conference on Robotics and Automation. Karlsruhe, Germany, IEEE, 2013 DOI:10.1109/icra.2013.6630640


Vidal A R, Rebecq H, Horstschaefer T, Scaramuzza D. Ultimate SLAM? combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robotics and Automation Letters, 2018, 3(2): 994–1001 DOI:10.1109/lra.2018.2793357


Forster C, Pizzoli M, Scaramuzza D. SVO: Fast semi-direct monocular visual odometry. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). HongKong, China, IEEE, 2014 DOI:10.1109/icra.2014.6906584


Kerl C, Sturm J, Cremers D. Robust odometry estimation for RGB-D cameras. In: 2013 IEEE International Conference on Robotics and Automation. Karlsruhe, Germany, 2013 DOI:10.1109/icra.2013.6631104


Park J, Zhou Q Y, Koltun V. Colored point cloud registration revisited. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, NewYork, USA, IEEE, 2017 DOI:10.1109/iccv.2017.25


Zhou Q Y, Park J, Koltun V. Fast global registration//Computer Vision―ECCV 2016. Cham: Springer International Publishing, 2016, 766–782 DOI:10.1007/978-3-319-46475-6_47


Glocker B, Shotton J, Criminisi A, Izadi S. Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding. IEEE Transactions on Visualization and Computer Graphics, 2015, 21(5): 571–583 DOI:10.1109/tvcg.2014.2360403


Cummins M, Newman P. FAB-MAP: probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 2008, 27(6): 647–665 DOI:10.1177/0278364908090961


Klein G, Murray D. Improving the agility of keyframe-based SLAM//Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, 802–815 DOI:10.1007/978-3-540-88688-4_59


Lowry S, Sunderhauf N, Newman P, Leonard J J, Cox D, Corke P, Milford M J. Visual place recognition: A survey. IEEE Transactions on Robotics, 2016, 32(1): 1–19 DOI:10.1109/tro.2015.2496823


Maddern W, Milford M, Wyeth G. CAT-SLAM: probabilistic localisation and mapping using a continuous appearance-based trajectory. The International Journal of Robotics Research, 2012, 31(4): 429–451 DOI:10.1177/0278364912438273


Milford M J, Wyeth G F. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In: 2012 IEEE International Conference on Robotics and Automation. StPaul, MN, USA, IEEE, 2012 DOI:10.1109/icra.2012.6224623


Angeli A, Filliat D, Doncieux S, Meyer J A. Fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions on Robotics, 2008, 24(5): 1027–1037 DOI:10.1109/tro.2008.2004514


Kawewong A, Tongprasit N, Tangruamsub S, Hasegawa O. Online and incremental appearance-based SLAM in highly dynamic environments. The International Journal of Robotics Research, 2011, 30(1): 33–55 DOI:10.1177/0278364910371855


Khan S, Wollherr D. IBuILD: Incremental bag of Binary words for appearance based loop closure detection. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). Seattle, WA, USA, IEEE, 2015 DOI:10.1109/icra.2015.7139959


Tsintotas K A, Bampis L, Gasteratos A. Assigning visual words to places for loop closure detection. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, QLD, IEEE, 2018 DOI:10.1109/icra.2018.8461146


Davison A J. Real-time simultaneous localisation and mapping with a single camera. In: Proceedings Ninth IEEE International Conference on Computer Vision. Nice, France, IEEE, 2003 DOI:10.1109/iccv.2003.1238654


Holmes S, Klein G, Murray D W. A square root unscented kalman filter for visual monoSLAM. In: 2008 IEEE International Conference on Robotics and Automation. Pasadena, CA, USA, IEEE, 2008 DOI:10.1109/robot.2008.4543780


Du J J, Carlone L, Kaouk Ng M, Bona B, Indri M. A comparative study on active SLAM and autonomous exploration with Particle Filters. In: 2011 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). Budapest, Hungary, IEEE, 2011 DOI:10.1109/aim.2011.6027142


Sibley G, Matthies L, Sukhatme G. Sliding window filter with application to planetary landing. Journal of Field Robotics, 2010, 27(5): 587–608 DOI:10.1002/rob.20360


Bailey T, Nieto J, Guivant J, Stevens M, Nebot E. Consistency of the EKF-SLAM algorithm. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. Beijing, China, IEEE, 2006 DOI:10.1109/iros.2006.281644


Dissanayake G, Huang S D, Wang Z, Ranasinghe R. A review of recent developments in Simultaneous Localization and Mapping. In: 2011 6th International Conference on Industrial and Information Systems. Kandy, SriLanka, IEEE, 2011 DOI:10.1109/iciinfs.2011.6038117


Huang S D, Dissanayake G. A critique of current developments in simultaneous localization and mapping. International Journal of Advanced Robotic Systems, 2016, 13(5): 172988141666948 DOI:10.1177/1729881416669482


Huang G P, Mourikis A I, Roumeliotis S I. A first-estimates Jacobian EKF for improving SLAM consistency// Experimental Robotics. Berlin, Heidelberg, 2009, 373–382 DOI:10.1007/978-3-642-00196-3_43


Montemerlo M, Thrun S, Koller D, Wegbreit B. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In: National Conf. on Artificial Intelligence (AAAI), 2002, 593598


Thrun S, Liu Y F, Koller D, Ng A Y, Ghahramani Z, Durrant-Whyte H. Simultaneous localization and mapping with sparse extended information filters. The International Journal of Robotics Research, 2004, 23(7/8): 693–716 DOI:10.1177/0278364904045479


Huang S D, Wang Z, Dissanayake G. Sparse local submap joining filter for building large-scale maps. IEEE Transactions on Robotics, 2008, 24(5): 1121–1130 DOI:10.1109/tro.2008.2003259


Lenac K, Ćesić J, Marković I, Petrović I. Exactly sparse delayed state filter on Lie groups for long-term pose graph SLAM. The International Journal of Robotics Research, 2018, 37(6): 585–610 DOI:10.1177/0278364918767756


Thrun S, Burgard W, Fox D. Probabilistic robotics. MIT press, 2005


Dellaert F, Kaess M. Square root SAM: simultaneous localization and mapping via square root information smoothing. The International Journal of Robotics Research, 2006, 25(12): 1181–1203 DOI:10.1177/0278364906072768


Kummerle R, Grisetti G, Strasdat H, Konolige K, Burgard W. G2o: A general framework for graph optimization. In: 2011 IEEE International Conference on Robotics and Automation. Shanghai, China, IEEE, 2011 DOI:10.1109/icra.2011.5979949


Agarwal S, K.Others Mierle, CeresSolver. 2015


Sibley G, Sukhatme G S, Matthies L. Constant time sliding window filter SLAM as a basis for metric visual perception. In: IEEE International Conference on Robotics and Automation Workshop. 2007


Dong-Si T C, Mourikis A I. Motion tracking with fixed-lag smoothing: Algorithm and consistency analysis. In: 2011 IEEE International Conference on Robotics and Automation2. Shanghai, China, IEEE, 2011 DOI:10.1109/icra.2011.5980267


Huang G P, Mourikis A I, Roumeliotis S I. An observability-constrained sliding window filter for SLAM. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. SanFrancisco, CA, USA, IEEE, 2011 DOI:10.1109/iros.2011.6095161


Kaess M, Ranganathan A, Dellaert F. ISAM: incremental smoothing and mapping. IEEE Transactions on Robotics, 2008, 24(6): 1365–1378 DOI:10.1109/tro.2008.2006706


Wang X P, Marcotte R, Ferrer G, Olson E. ApriISAM: real-time smoothing and mapping. In: 2018 IEEE International Conference on Robotics and Automation (ICRA).risbane, QLD. New York, USA: IEEE, 2018 DOI:10.1109/icra.2018.8461072


Kaess M, Johannsson H, Roberts R, Ila V, Leonard J J, Dellaert F. ISAM2: Incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research, 2012, 31(2): 216–235 DOI:10.1177/0278364911430419


Ila V, Polok L, Solony M, Svoboda P. SLAM++-A highly efficient and temporally scalable incremental SLAM framework. The International Journal of Robotics Research, 2017, 36(2): 210–230 DOI:10.1177/0278364917691110


Hinton G E. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507 DOI:10.1126/science.1127647


LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444 DOI:10.1038/nature14539


Detone D, Malisiewicz T, Rabinovich A. Toward geometric deep SLAM. arXiv preprint arXiv:1707.07410, 2017


Tang J X, Ericson L, Folkesson J, Jensfelt P. GCNv2: efficient correspondence prediction for real-time SLAM. IEEE Robotics and Automation Letters, 2019: 1 DOI:10.1109/lra.2019.2927954


Zeng A, Song S R, NieBner M, Fisher M, Xiao J X, Funkhouser T. 3DMatch: learning local geometric descriptors from RGB-D reconstructions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.29


Yi K M, Trulls E, Lepetit V, Fua P. LIFT: learned invariant feature transform// Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016, 467–483 DOI:10.1007/978-3-319-46466-4_28


DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: self-supervised interest point detection and description. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvprw.2018.00060


Verdie Y, Yi K M, Fua P, Lepetit V. TILDE: A temporally invariant learned DEtector. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015 DOI:10.1109/cvpr.2015.7299165


Jayaraman D, Grauman K. Learning image representations tied to ego-motion. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015 DOI:10.1109/iccv.2015.166


Agrawal P, Carreira J, Malik J. Learning to see by moving. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015 DOI:10.1109/iccv.2015.13


Schmidt U, Roth S. Learning rotation-aware features: From invariant priors to equivariant descriptors. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA, IEEE, 2012. DOI:10.1109/cvpr.2012.6247909


Lenc K, Vedaldi A. Understanding image representations by measuring their equivariance and equivalence. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015 DOI:10.1109/cvpr.2015.7298701


Luo W, Li Y, Urtasun R, Zemel R. Understanding the effective receptive field in deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS). 2016, 4898–4906


Lianos K N, Schönberger J L, Pollefeys M, Sattler T. VSO: visual semantic odometry// Computer Vision―ECCV 2018. Cham: Springer International Publishing, 2018, 246–263 DOI:10.1007/978-3-030-01225-0_15


Barsan I A, Liu P D, Pollefeys M, Geiger A. Robust dense mapping for large-scale dynamic environments. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia, IEEE, 2018 DOI:10.1109/icra.2018.8462974


Sunderhauf N, Pham T T, Latif Y, Milford M, Reid I. Meaningful maps with object-oriented semantic mapping. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vancouver, Canada, IEEE, 2017 DOI:10.1109/iros.2017.8206392


Dame A, Prisacariu V A, Ren C Y, Reid I. Dense reconstruction using 3D object shape priors. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013 DOI:10.1109/cvpr.2013.170


Salas-Moreno R F, Newcombe R A, Strasdat H, Kelly P H J, Davison A J. SLAM++: simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013 DOI:10.1109/cvpr.2013.178


McCormac J, Handa A, Davison A, Leutenegger S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, Singapore, IEEE, 2017 DOI:10.1109/icra.2017.7989538


Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. Proceedings of the 27th International Conference on Neural Information Processing Systems,2014, 2: 2366–2374


Tateno K, Tombari F, Laina I, Navab N. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.695


Yin X C, Wang X W, Du X G, Chen Q J. Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural Fields. In: 2017 IEEE International Conference on Computer Vision (ICCV. Venice, Italy, IEEE, 2017 DOI:10.1109/iccv.2017.625


Tang J X, Folkesson J, Jensfelt P. Sparse2Dense: from direct sparse odometry to dense 3-D reconstruction. IEEE Robotics and Automation Letters, 2019, 4(2): 530–537 DOI:10.1109/lra.2019.2891433


Bloesch M, Czarnowski J, Clark R, Leutenegger S, Davison A J. CodeSLAM―learning a compact, optimisable representation for dense visual SLAM. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00271


Zhou T H, Brown M, Snavely N, Lowe D G. Unsupervised learning of depth and ego-motion from video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.700


Zhan H Y, Garg R, Weerasekera C S, Li K J, Agarwal H, Reid I M. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00043


Li R H, Wang S, Long Z Q, Gu D B. UnDeepVO: monocular visual odometry through unsupervised deep learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, USA, IEEE, 2018 DOI:10.1109/icra.2018.8461251


Mahjourian R, Wicke M, Angelova A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00594


Yin Z C, Shi J P. GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00212


Madhu Babu V, Das K, Majumdar A, Kumar S. UnDEMoN: unsupervised deep network for depth and ego-motion estimation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain, USA, IEEE, 2018 DOI:10.1109/iros.2018.8593864


Almalioglu Y, Saputra M R U, de Gusmao P P B, Markham A, Trigoni N. GANVO: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: 2019 International Conference on Robotics and Automation (ICRA). Montreal, QC, Canada, IEEE, 2019 DOI:10.1109/icra.2019.8793512


Costante G, Mancini M, Valigi P, Ciarfuglia T A. Exploring representation learning with CNNs for frame-to-frame ego-motion estimation. IEEE Robotics and Automation Letters, 2016, 1(1): 18–25 DOI:10.1109/lra.2015.2505717


Wang S, Clark R, Wen H K, Trigoni N. DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, Singapore, IEEE, 2017 DOI:10.1109/icra.2017.7989236


Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D, Brox T. FlowNet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision. Santiago, Chile, IEEE, 2015 DOI:10.1109/iccv.2015.316


Wang S, Clark R, Wen H K, Trigoni N. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 2018, 37(4/5): 513–542 DOI:10.1177/0278364917734298


Ummenhofer B, Zhou H Z, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T. DeMoN: depth and motion network for learning monocular stereo. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.596


Zhou H Z, Ummenhofer B, Brox T. DeepTAM: deep tracking and mapping// Computer Vision – ECCV 2018. Cham: Springer International Publishing, 2018, 851–868 DOI:10.1007/978-3-030-01270-0_50


Xue F, Wang Q Y, Wang X, Dong W, Wang J Q, Zha H B. Guided feature selection for deep visual odometry// Computer Vision―ACCV 2018. Cham: Springer International Publishing, 2019, 293–308 DOI:10.1007/978-3-030-20876-9_19


Zhao C, Sun L, Purkait P, Duckett T, Stolkin R. Learning monocular visual odometry with dense 3D mapping from dense 3D flow. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain, IEEE, 2018 DOI:10.1109/iros.2018.8594151


Xue F, Wang X, Li S, Wang Q, Wang J, Zha H. Beyond tracking: selecting memory and refining poses for deep visual odometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 8575–8583


Kendall A, Grimes M, Cipolla R. PoseNet: A convolutional network for real-time 6-DOF camera relocalization. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015 DOI:10.1109/iccv.2015.336


Brahmbhatt S, Gu J W, Kim K, Hays J, Kautz J. Geometry-aware learning of maps for camera localization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00277


Sattler T, Leibe B, Kobbelt L. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(9): 1744–1756 DOI:10.1109/tpami.2016.2611662


Sattler T, Zhou Q, Pollefeys M, Leal-Taixe L. Understanding the limitations of CNN-based absolute camera pose regression. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2019, 3302–3312


Lai H, Tsai Y, Chiu W. Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2019, 1890–1899


Kendall A, Cipolla R. Modelling uncertainty in deep learning for camera relocalization. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). Stockholm, Sweden, IEEE, 2016 DOI:10.1109/icra.2016.7487679


Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.694


Walch F, Hazirbas C, Leal-Taixe L, Sattler T, Hilsenbeck S, Cremers D. Image-based localization using LSTMs for structured feature correlation. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy, IEEE, 2017 DOI:10.1109/iccv.2017.75


Clark R, Wang S, Markham A, Trigoni N, Wen H K. VidLoc: A deep spatio-temporal model for 6-DoF video-clip relocalization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, Italy, IEEE, 2017 DOI:10.1109/cvpr.2017.284


Mourikis A I, Roumeliotis S I. A multi-state constraint kalman filter for vision-aided inertial navigation. In: Proceedings 2007 IEEE International Conference on Robotics and Automation. Rome, Italy, IEEE, 2007 DOI:10.1109/robot.2007.364024


Usenko V, Engel J, Stuckler J, Cremers D. Direct visual-inertial odometry with stereo cameras. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). Stockholm, Sweden, 2016 DOI:10.1109/icra.2016.7487335


von Stumberg L, Usenko V, Cremers D. Direct sparse visual-inertial odometry using dynamic marginalization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia, IEEE, 2018 DOI:10.1109/icra.2018.8462905


Bloesch M, Burri M, Omari S, Hutter M, Siegwart R. Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback. The International Journal of Robotics Research, 2017, 36(10): 1053–1072 DOI:10.1177/0278364917728574


Leutenegger S, Lynen S, Bosse M, Siegwart R, Furgale P. Keyframe-based visual–inertial odometry using nonlinear optimization. The International Journal of Robotics Research, 2015, 34(3): 314–334 DOI:10.1177/0278364914554813


Mur-Artal R, Tardos J D. Visual-inertial monocular SLAM with map reuse. IEEE Robotics and Automation Letters, 2017, 2(2): 796–803 DOI:10.1109/lra.2017.2653359


Weikersdorfer D, Hoffmann R, Conradt J. Simultaneous localization and mapping for event-based vision systems// Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, 133–142 DOI:10.1007/978-3-642-39402-7_14


Brandli C, Berner R, Yang M H, Liu S C, Delbruck T. A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits, 2014, 49(10): 2333–2341 DOI:10.1109/jssc.2014.2342715


Kueng B, Mueggler E, Gallego G, Scaramuzza D. Low-latency visual odometry using event-based feature tracks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Daejeon, SouthKorea, IEEE, 2016 DOI:10.1109/iros.2016.7758089


Kim H, Leutenegger S, Davison A J. Real-time 3D reconstruction and 6-DoF tracking with an event camera// Computer Vision―ECCV 2016. Cham: Springer International Publishing, 2016, 349–364 DOI:10.1007/978-3-319-46466-4_21


Rebecq H, Horstschaefer T, Gallego G, Scaramuzza D. EVO: A geometric approach to event-based 6-DOF parallel tracking and mapping in real time. IEEE Robotics and Automation Letters, 2017, 2(2): 593–600 DOI:10.1109/lra.2016.2645143


Weikersdorfer D, Adrian D B, Cremers D, Conradt J. Event-based 3D SLAM with a depth-augmented dynamic vision sensor. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). HongKong, China, IEEE, 2014 DOI:10.1109/icra.2014.6906882


Milford M, Kim H, Leutenegger S, Davison A. Towards visual SLAM with event-based cameras place recognition on event data using SeqSLAM. In: The problem of mobile sensors workshop in conjunction with RSS. 2015


Ovren H, Forssen P E. Spline error weighting for robust visual-inertial fusion. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00041


Mueggler E, Gallego G, Rebecq H, Scaramuzza D. Continuous-time visual-inertial odometry for event cameras. IEEE Transactions on Robotics, 2018, 34(6): 1425–1440 DOI:10.1109/tro.2018.2858287


Anderson S, Barfoot T D. Towards relative continuous-time SLAM. In: 2013 IEEE International Conference on Robotics and Automation. Karlsruhe, Germany, IEEE, 2013 DOI:10.1109/icra.2013.6630700


Kerl C, Stuckler J, Cremers D. Dense continuous-time tracking and mapping with rolling shutter RGB-D cameras. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015 DOI:10.1109/iccv.2015.261


Furgale P, Tong C H, Barfoot T D, Sibley G. Continuous-time batch trajectory estimation using temporal basis functions. The International Journal of Robotics Research, 2015, 34(14): 1688–1710 DOI:10.1177/0278364915585860


Lovegrove S, Patron-Perez A, Sibley G. Spline Fusion: A continuous-time representation for visual-inertial fusion with application to rolling shutter cameras. In: Procedings of the British Machine Vision Conference 2013. Bristol, UK, 2013 DOI:10.5244/c.27.93


Tong C H, Furgale P, Barfoot T D. Gaussian Process Gauss–Newton for non-parametric simultaneous localization and mapping. The International Journal of Robotics Research, 2013, 32(5): 507–525 DOI:10.1177/0278364913478672


Anderson S, Dellaert F, Barfoot T D. A hierarchical wavelet decomposition for continuous-time SLAM. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). HongKong, China, IEEE, 2014 DOI:10.1109/icra.2014.6906884


Anderson S, Barfoot T D, Tong C H, Särkkä S. Batch nonlinear continuous-time trajectory estimation as exactly sparse Gaussian process regression. Autonomous Robots, 2015, 39(3): 221–238 DOI:10.1007/s10514-015-9455-y


Barfoot T, Hay Tong C, Sarkka S. Batch continuous-time trajectory estimation as exactly sparse Gaussian process regression. In: Robotics: Science and Systems X, Robotics: Science and Systems Foundation, 2014 DOI:10.15607/rss.2014.x.001


Yan X Y, Indelman V, Boots B. Incremental sparse GP regression for continuous-time trajectory estimation and mapping. Robotics and Autonomous Systems, 2017, 87, 120–132 DOI:10.1016/j.robot.2016.10.004


Anderson S, Barfoot T D. Full STEAM ahead: Exactly sparse Gaussian process regression for batch continuous-time trajectory estimation on SE(3). In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany, IEEE, 2015 DOI:10.1109/iros.2015.7353368