Structure from Motion
3D from standard cameras for automotive applications
The World Health Organization estimates than 1.2 million people die in traffic worldwide. Drivers in the USA spend about five years of their lives in a car, and due to the high cost of traffic incidents, insurance has risen to about $0.10 per mile. Automotive Advanced Driver Assistance Systems (ADAS) have the potential to make a big impact: saving lives, saving time, and saving cost by aiding the driver in operating the vehicle, or even take over control of the vehicle completely.
One of the key tasks of an ADAS is to have an understanding of the vehicle’s surroundings. Besides recognizing pedestrians, other vehicles, lanes and obstacles, the system should be aware of where other objects are, in full 3D space. This 3D information enables the ADAS to understand the distance of objects, along with their size, direction and speed, allowing it to take appropriate action. It’s common to think that us humans use our two eyes to sense depth information. At the same time though, we can easily catch a ball with one eye closed. Research has shown that humans actually primarily use monocular vision to sense depth, using motion parallax. This is a depth cue that results from movement. As we move, objects that are closer to us move farther across our field of view than objects that are more distant. The same mechanism, called Structure from Motion can be used in to sense depth using standard video cameras. There are different ways to sense depth using special cameras. Lidar measures distance by illuminating a target with a laser and analyzing the reflected light. Time-of-flight cameras measure the delay of a light signal between the camera and the subject for each point of the image. Another method is to project a pattern of light onto the scene. Capturing this distored pattern with a camera allows the extraction of depth information. Using Structure from Motion has a few key advantages over these approaches. Firstly, there’s no active illumination of the scene required. Such active lighting limits range and outdoor use. In addition, a standard off-the-shelf camera suffices, instead of a specialized depth-sensing camera. This reduces cost, since the standard rear-view or surround-view cameras can be reused, and no active lighting components are needed.
Structure from Motion algorithm
The Structure from Motion algorithm consists of three steps:
- • 1. Detection of feature points in view
- • 2. Tracking feature points from one frame to the next frame
- • 3. Robust estimation of 3D position of these points, based on their motion
The first step is to identify points in the image that can be robustly tracked from one frame to the next. Features on textureless patches, like blank walls, are nearly impossible to localize. Areas with large contrast changes (gradients), like lines, are easier to localize, but lines suffer from the aperture problem, i.e., it is only possible to align patches along the direction of the line, not in a single position. This renders lines not useful to track from frame to frame either. Locations where you find gradients in two significantly different orientations are the good feature points that can be tracked from one frame to the next. Such features show up in the image as corners, where two lines come together. There are different feature detection algorithms, and they’ve been widely researched in the computer vision community. In our application, we use the Harris feature detector. The next step is to track these feature points from frame to frame, to find how much they moved in the image. We use the Lucas Kanade optical flow algorithm for this. This algorithm first builds a multiscale image pyramid, where each image is a smaller scaled image of the originally captured image. The algorithm then searches around the previous frame’s feature point location for a match. When the match is found, it reuses this position as an initial estimate with the larger image in the pyramid, traveling down the pyramid until the original image resolution is reached. This way, larger displacements can also be tracked. The result is two lists of feature points; one for the previous image and one for the current image. Based on these point pairs, you can define and solve a linear system of equations that finds the camera motion, and consequently the distance of each point from the camera. The result is a sparse 3D pointcloud covering the camera’s viewpoint. This pointcloud can then be used for different applications such as automated parking, obstacle detection, or even accurate indoor positioning for mobile phone applications.
Videantis has been working together with Viscoda to implement the Structure from Motion algorithm on the videantis v-MP4280HDX vision processor. The Viscoda Structure from Motion algorithm has been proven to be very robust in reconstructing a 3D point cloud under a wide variety of situations, whether it is in low light conditions or for complex scenes. The videantis vision processor is licensed to semiconductor companies for integration into their systems-on-chips that target a wide variety of automotive applications. The processor architecture has specifically been optimized to run computer vision algorithms at high-performance and with very-low-power consumption. The multi-core architecture can scale from just a few cores to many cores, enabling it to address different performance points: from chips that can be integrated into low-cost cameras, all the way up to very high-performance applications such as multi-camera systems that run a many computer vision algorithms concurrently. The combined solution runs the Viscoda Structure from Motion algorithm on the videantis processor architecture. The resulting combined implementation is small and low power enough to be integrated into smart cameras for automotive applications, making our rides safer and enabling us to let go of the wheel and pedals.