Neural Stereo Depth Matching Onboard the Luxonis OAK-D
November 14, 2022Recently, the Luxonis OAK-D has come up as a contender as a go-to RGB-D sensor. It’s cheap, has high frame rate global shutter image sensors, a high quality rolling shutter RGB camera and it comes with onboard compute through the Intel Movidius chip.
Depth is computed by computing disparity from the stereo image pairs. Disparity is basically how many pixels away along the x-axis each pixel in the right frame from it’s pixel coordinate in the left frame. This can be converted into metric depth (how far away things are in meters) by triangulating using the focal length of the cameras and the distance between the two image sensors.
By default, the OAK-D computes disparity using a stereo block matching algorithm which is based on the algorithm by [1]. This works by mapping pixel blocks through a feature transform and comparing each pixel block to possibly matching blocks in the other image. Below is an example of disparity using the standard algorithm.
 Disparity computed using the block matching algorithm.
Disparity computed using the block matching algorithm.
While this works reasonably well, depth is still quite sparse, noisy and matching is hard on surfaces that do not have clear textures or distinct points.
Since this algorithm was invented, many deep learning based methods have been developed that can do stereo matching. Most algorithms are based on a similar approach of mapping pixels to features, computing a cost volume and finding the features that best match (if any) in the other image.
I went out looking for an algorithm that would be lightweight enough to run onboard the embedded hardware on the Luxonis camera. One such algorithm, is an algorithm developed at the Toyota Research Institute [2].
Using the default settings, running it onboard the device is not quite feasible, as it takes up more memory than is available on the device. After a few tweaks to get it to fit in memory, I found out that running it at the full 1280x800 resolution is still too expensive and frame rates are well below 1 per second.
At half the resolution (640x400), the network runs at about 2 frames per second and produces reasonably good results. I trained the algorithm on three different datasets: FlyingThings, Middlebury and ETH3D.
 
 Neural disparity computed onboard the OAK-D.
Neural disparity computed onboard the OAK-D.
The neural version is quite a bit more computationally heavy, but does yield much denser and less noisy disparity. There are still some visible artifacts, some of which could probably be removed using an architecture designed for this resolution.
If you want to try it for yourself on your own OAK-D, you can follow the instructions on my fork of the TRI version here. There is also some interesting discussion in the Depthai SDK GitHub repository, where people are discussing and trying different algorithms to the problem.
[1] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):328–341, 2008.
[2] Shankar, Krishna, et al. “A learned stereo depth system for robotic manipulation in homes.” IEEE Robotics and Automation Letters 7.2 (2022): 2305-2312.