Monocular Visual Odometry on a Jetson Tegra X1

Sean Buckley and David Zhu

Project GitHub

Summary

We plan to implement several stages of the monocular visual odometry pipeline on a portable platform in order to localize in real time.

Background

Standard odometry relies on tools such as IMUs, GPS, encoders, and control feedback for its localization. However, there are many times when these sensors are not reliable: IMUs drift over time, encoders can drift due to wheel slippage, GPS is blocked by inclement weather and structures, and control feedback becomes useless if the encoders are not reliable. As such, standard odometry becomes less reliable in buildings, tunnels, and dense urban areas, locations which we shall refer to as "dark zones." Visual odometry, however, is not subject to the same issues. The camera feed can be referenced against a map in the world frame, and loop closure can make it resilient to drift. And unlike GPS, visual odometry is great in buildings because its main constraint, lighting conditions, is often kept constant by the humans living and working in those buildings. As such, visual odometry has great potential as an indoor localization system.

The NVIDIA TX1 is a powerful embedded platform which could easily fit in most mobile robots. Its 256 CUDA core GPU is efficacious when applied to computer vision problems, and with good parallel algorithms could perform real-time odometry with a live video feed. Coupled with the fact that visual odometry has great potential indoors, we can utilize the TX1 to build a mobile platform for localizing in "dark zones."

There are several stages to the pipeline in a standard visual odometry system. These generally include capturing the image, undistorting the image, applying Gaussian blurs, tracking features in the image, and determining rotation and translation from movement of features.

Many of these stages in the pipeline could potentially benefit from parallelization. We have determined that corner detection (for features) is a potentially parallelizable step as it requires many independent calculations in small sub-areas of the image. Tracking the features across images is another potentially parallelizable substep. Another parallelizable step involves the RANSAC algorithm for estimating motion from feature correspondences, since each iteration of the algorithm is independent, it should be trivially parallelizable.

Challenges

Visual odometry involves identifying features, mapping them to references, calculating the offset, and several other computationally intensive steps. And we plan to do all of this in real-time on a small embedded platform. The first challenge will be porting relevant computer vision algorithms into CUDA. Much of this will not be too difficult to parallelize; for many of these algorithms there are few dependencies and work can be spatially decomposed.

Most challenges will come from efficient memory utilization (since we are working with relatively high resolution video), finding ways to fully utilize our system (we have a four-core CPU which can do some work as well), and efficient time usage (constrained by real-time). The biggest challenge, however, will come from our limited experience. Despite our interest in computer vision and robotics, both of us have never written visual odometry algorithms before. It will be exciting, nonetheless, to dive into the deep-end of mobile robotics and build a platform like this.

Resources

We will be using a Jetson Tegra X1 as the platform for our system.

We will be compiling and working off several different existing sequential algorithms and implementations of visual odometry found in papers and other sources. We have not yet located all of the sources required but preliminary research has shown that existing research in this area is sufficiently developed.

For some stages of the pipeline we are also planning on adapting OpenCV code for our system.

Goals and Deliverables

Plan to achieve goals: Determine translational and rotational movement in real time, at a high frequency on the Jetson TX1.

Hope to achieve goals: We would like to build a full map of one floor of Gates, and be able to localize anywhere on that floor. If we succeed with that, we'd like to continue onto other floors. If the location permits, we would like to do a live demo of our localizer during the presentation.

Platform Choice

We have chosen the Jetson Tegra X1 because of its small form factor and powerful computational abilities. It makes sense to use this system for the workload we have chosen because it can be incorporated into a mobile autonomous system; this requires the device to be small enough to not hinder motion and fast enough to provide realtime odometry data to allow for improved path planning.

Final Approach and Results

Final Approach

The final approach used to parallelise the feature detector was to write an implementation of FAST using CUDA. We parallelized the algorithm over pixels into arbitrary blocks as there was no shared work being done within blocks. We also determined that the main bottleneck was I/O in the algorithm and we managed to reduce execution time significantly by removing intermediary arrays and data structures to reduce I/O. The algorithm itself was also optimized heavily, however since computation was not the bottleneck this did not improve our execution times significantly.

Like our implementation of FAST, our implementation of RANSAC was written using CUDA kernels. The Nister five-point solver which sat in the middle of our RANSAC implementation itself was rather linear and so it benefited little from attempts to parallelize it, outside of the parallelization of matrix operations. As such, we focused our efforts on the outer RANSAC operation. Since each iteration of RANSAC is independent and there's enough work per iteration to justify the overhead of running CUDA kernels, it was prime for our optimizations. Once each CUDA kernel returned its best essential matrix estimation and an evaluation of that model (the quality of a model in RANSAC is based on the number of inliers the model can get), we used the THRUST library to pick the best results.

Final Results

As we can see, for FAST we managed to get a speedup of 2x from the reference sequential OpenCV implementation for the 720p video and a speedup of about 5x from the reference on the 1080p video. Our conjecture is that because the arithmetic intensity goes up for the larger image, and since this is a memory bound operation, we are able to get much larger speedups in general.

The data shown on the graph is displayed in seconds per 100 iterations on average. We obtained this data by timing both implementations on the same test videos for 1000 iterations each. The videos were prerecorded and fed into the algorithms in a controlled manner.

The limits to our speedup were due to memory overheads. The algorithm had almost no dependencies so the bottleneck was transferring data back and forth between the device and host as well as creating intermediary data structures. We were able to mitigate this to some degree by removing some of the intermediary data structures and reducing the total amount of memory that needed to be copied back and forth.

Despite our best attempts to parallelize RANSAC, it was unsuccessful due to the complications between OpenCV and CUDA. While we had the framework for the operation fully written, compatibility issues in types meant that all matrix operations would have needed to be rewritten as was done for our implementation of FAST. Since we did not have enough time to rewrite everything, we were unable to fully evaluate our RANSAC implementation.

An optimization that both of our implementations were missing is the utilization of shared memory. We saw in Assignment 2 that operations on large, commonly used data sets (i.e. pixels) were much faster when those data sets were stored in shared memory. This would have especially benefited our FAST implementation, which we said was a memory bound operation.

Work Done

CUDA FAST was implemented by David
Sequential reference pipeline implemented by Sean and David.
Fivepoint and RANSAC implemented by Sean.

Schedule

Date Goal Actual
4/18/2017 Capture test images/videos with location annotations for tests, Study/Understand FAST and other pipeline algorithms, Image undistortion Captured test images/videos, Studied pipeline algorithms
4/25/2017 (Note: Carnival is the preceding week. Limited work will be done during this time.) Essential matrix estimation, Feature (edge) detection in images, Feature Tracking Image Undistortion, Capture more test videos, Serial Feature detection
5/2/2017 Get OpenCV3 working, Serial Essential matrix estimation, Parallel Feature Detection Serial Pipeline completed
5/12/2017 Parallel Essential matrix estimation, Writeup, Gather speed/accuracy data Parallel Feature Detection completed, Parallel essential matrix estimation attempted, data gathered

References

References:
Avi Singh's Monocular Visual Odometry Blogpost
OpenCV Documentation
Cuda Documentation
SIFT Paper
Nister's Five Point Algorithm Paper