008 : Real-time VR controller pose estimation using a wide FOV stereo camera (2021)

In this project, a real-time pose estimation algorithm was designed which tracks VR controller pose relative to a wide FOV stereo camera. The VR controller is useful when wearing a VR headset where the user may not be able to see his hands directly. Instead they are tracked relative to headset and visual objects can be drawn on top of them, such as swords, guns or laser pointers. The outline of the project was to start from simulation and then proceed to product phase with a real VR controller.

">

A snapshot of the final product captured from RGB stereo camera and IR stereo camera

The first step was to implement a benchmarking software in C++ which is able to playback ground truth videos and compare the pose estimates given by novel algorithm to the ground truth. The benchmarking software basically parsed metadata tracks from Blender scripts and loaded the stereo images which were saved as indexed frames. The algorithm parameters were loaded from JSON file and result report was produced for each batch containing statistics on the precision and milliseconds from different computational steps. The basic algorithm used for generating poses was cv::solvePnp in OpenCV.

">

VR controller model in Blender

The second step was to implement a data generation pipeline using Blender which produces realistic ground truth videos along with precise track information. Since the plan was to 3D print the physical VR controller using a 3D model, the same model could be used for generating simulated videos. The only question mark was the placement and number of LEDs needed. Therefore a larger number of LEDs were attached to the 3D model, from which a number of visible subsets could be selected rendered and benchmarked until the best configuration was found.

Blender is convenient as simulation environment, because it allows testing different types of interference. The physical VR headset had had wide FOV IR camera and RGB camera (which had slightly better resolution) attached and both could be utilized freely for tracking purposes. Therefore matching simulation videos were generated for both cameras. Primarily LED occlusion, distance and different viewing angles were tested with benchmarking. To model hand induced occlusion, a rigid hand was modeled and attached to the controller without the full kinematic arm. Mirrors were experimented by adding them into the surrounding scene to test how much they interfere tracking. Multiple controllers were tested with randomly occurring controller overlap. Also sub-surface scattering was tested with the hand model to see how much light the hand might reflect and whether it needs to be addressed on the algorithm side.

">

Interference simulations in Blender

With simulation and benchmarking working, it was time to start developing the tracking algorithm itself. First 2D points are extracted from the images representing the centroids of the observed LEDs. The LED locations are precisely known in 3D model, the only problem is to associate them with the 2D observations. The basic outline of the algorithm follows RANSAC, where the smallest subsets of 2D and 3D points are combinatorially tested and for the best candidate poses, the full 3D model is validated against all available 2D points to see the score. The best score wins and 3D pose is produced. Also 3D normals were stored in the 3D model, which are useful in limiting 3D point subsets into ones where the normals are pointing roughly towards the camera.

The first problem with cv::solvePnp() was low performance when combined with RANSAC algorithm which may produce thousands or even millions of test poses depending on the number of detected LEDs. cv::solvePnp requires 4 point pairs for producing the pose. Combinatorics can be reduced with 3 point pairs, when the measured points are represented in 3D. Stereo matching and reconstruction was implemented for performing pose estimation more quickly. Because stereo matched point pairs must satisfy geometric epipolar constraint, the input space of RANSAC can be dramatically reduced, which also increases robustness to outliers. When associating point sets in stereo views, the best 1:1 match was found by utilizing Hungarian algorithm for optimal matching (because the one-to-many matches should not exist!). The point-to-point cost metric used in Hungarian algo was the reprojection error from triangulation. Finally, optimal triangulation algorithm was used to produce the 3D points.

In 3D reconstruction 2D points must be undistorted, which means transforming them into normalized image space which does not depend on the lens. In this project, the Omnidir model in OpenCV was used for wide lens modeling and undistortion. Unfortunately OpenCV had a bug in in the undistortion routine which needed to be patched in OpenCV source code before it could be used (currently fixed also in OpenCV). The customer stored omnidir calibration parameters at HMD firmware, which could be read using a special unpublished APIs. Finally when they were available, the projections did not match very well with 30px offset to visual LEDs. This implies that either omnidir calibration had not been done for the device used in this project or the calibration process was not working very well.

For fixing this issue, omnidir fisheye calibration software was patched together on-the-fly. The customer had a checkerboard panel with IR backlight available at their office basement which enabled capturing sufficient number of checkerboard images. 2D corners can not be extracted from the image using a standard OpenCV checkerboard extractor, because the images have so wide FOV. Luckily autoCornerFinder utility was found from Internet which implements corner extractor for wide FOV images. However the source code used legacy C API in OpenCV which required some patching before it could be used. OpenCV examples on omnidir calibration were sufficient to build the rest of the calibration chain. autoCornerFinder citation:
Scaramuzza, D., Martinelli, A. and Siegwart, R. (2006), A Toolbox for Easily Calibrating Omnidirectional Cameras, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2006), Beijing, China, October 2006.

">

Calibration images captured using IR backlighted checkerboard pattern.

">

Omnidir lens model comparison between Blender and HMD stereo camera

To further avoid unnecessary combinatorics, a distance map was generated from 2D points in a pre-processing step, which allows O(1) time for finding the point distances during RANSAC. The inhouse team at customer side did not understand why distance maps are useful, and might be that this important step has been removed in the product for making the code easier to understand.

Multi-threading support was implemented using OpenMP to speed-up worst-case RANSAC executions. The search space was simply sliced for N threads, whose results were combined.

To avoid executing RANSAC completely, an incremental tracking mode was implemented where the 3D pose is assumed to deviate only little from the previous frame. In this case the 2D points are simply re-localized close to their previous locations and RANSAC can be executed with very limited search space. Optionally 3D pose can be predicted one frame forward using a linear displacement in angle and translation. The incremental mode takes only few milliseconds on any embedded device.

Basic OpenCV blob detector uses thresholded input images. However LEDs have brightness variations in grayscale and typically the center area of the LED is saturated to maximum intensity. To better take into account these special properties, the LED detection from images was re-implemented. The final output points are computed from the footprint area is above minimum threshold. With saturation the output coordinates are produced into the center of the saturated area. Without saturation the output is generated as intensity-weighted sum of the footprint pixels. The new algorithm allows dimmer LEDs to be used.

">

LED brightness weighted (saturation-aware) blob detection is more precise than using possibly merged blobs with thresholding.

As minor image processing performance improvement, a point density filter was implemented which passes through only image regions where local LED observation count is between 3 - N. Obviously point density filter depends on the distance of the controller. In case the controller is close to camera, all LED observations can be scattered across the full image, and filtering would not help. That is why controller distance range needs to be specified beforehand which fixes the allowed controller size in 2D. This distance range can then be scanned through using point density filters with various scales.

A coarse-to-fine approch was implemented which utilized both low reso IR cameras and higher reso RGB cameras. IR cameras suffer less from noise, but they had lower resolution vs the RGB stereo camera. To efficiently utilize both IR and RGB cameras, the IR stereo camera was first used to produce a rough 3D pose which was then refined using RGB stereo images. The baseline transform from IR -> RGB camera space was defined by the 3D model so it was immediately available. Unfortunately this approach was dropped by the customer, because they wanted to keep IR and RGB trackers fully independent of each other. This implies speed losses with RGB tracker because a priori information of VR controller pose is not available and full RANSAC must be executed more often.

Basic tracking updates only left side camera pose using given 3D measurements and when 3D points have error, that error will show up in pose error eventually. To increase tracking precision further, full bundle adjustment was used to optimize pose respect to all 2D points in a stereo view. The main upside with BA is that 3D reconstruction error can be removed when the optimization problem is defined for the direct 2D measurements. In BA a larger cost function is defined which contains motion parameters, 3D point parameters and they are matched with fixed 2D image points in a stereo view. Lens calibration parameters were kept fixed because they add quite many parameters to the problem too and usually they can be pre-calibration precisely. The BA solver could in future be extended by solving multiple subsequent poses with a motion constraint to make it even more robust.

In bundle adjustment, 3D points are part of the optimization problem and are made parameters. It is not clear how much 3D point error exists in real life with assumed model 3D LED locations, but BA can address it. The customer had Eigen and openSLAM libraries available: Eigen was used directly to avoid unnecessary cludge. OpenCV stereo calibration module contains messy source code example on how stereo BA can be implemented. This base code was re-written cleanly using Eigen and 3D point parameters were also added to the cost function with a switch to use them. By default only the camera motion parameters were optimized when executing BA. The math formulas involving BA were documented and provided to the customer to be able to follow the code better.

">

Ideal pose was perturbed by rotation and translation errors. The diagrams show for both cases how motion only optimization behaves vs full BA when the 3D point noise increases.

BA was benchmarked first using ideal input where noise was added to pose rotation and translation and 3D points. The initial pose having about 6 degrees, and 6mm of pose error, after 1-2mm of point error, using BA seemed to increase precision significantly.

">

Angular and positional error illustrated using final algorithm with Blender inputs measured at camera level where the VR controller is moving and rotating around the view for a period of time. The white dot at the bottom of each image correspond to the camera center point and the voxels are shown in top-down view. On top: IR camera precision, On bottom: RGB camera precision. RGB stereo camera produces much better precision due to higher image resolution. Final RGB stereo camera pose precision is at +-1deg +-2mm.

The tracking algoritm performance was very good at incremental tracking mode (~3ms per frame). With full search frames, RANSAC can take more time, but with 10-20 2D points per image, the performance was at real-time rates.

A few videos of the VR controller tracking:
Tracking 1 controller using Blender simulation images.
Tracking 2 controllers using Blender simulation images.

As result VR controller pose estimation became very precise with fast execution speed. The solution was provided to the customer's inhouse team consisting of two engineers. The main idea was to transfer technology to customer stack and port it from Linux to Windows. This step took surprisingly long time. Porting the code itself was fast to Windows side, but a lot of time was consumed in finding a pleasent code outline as part of the customer stack. Many optimizations were removed in this process, to reduce code complexity (and speed!). For example using distance map is performance-wise a good idea, but the customer had some issue in using it due to extra complexity. This also implied that real-time execution rate was not taken too seriously. IR and RGB camera tracking were also executed independently, which eliminated the possibility of speeding up the computation and improving precision by coarse-to-fine approach.

IMU sensor could constrain frame-to-frame tracking and also this VR controller had IMU sensor whose data could be streamed across Bluetooth. After some field testing with the customer's IMU sensor, it was clear that both the out-of-box calibration and transmission speed via Bluetooth was too far off to be useful. VR controller had a very vulnerable 3D printed housing, which broke down exactly like an egg shell when it dropped onto floor while testing. The customer believed that they had a precise process for calibrating the IMU, but that process did not show up during this project. Therefore IMU-based initial guess was taken into account in the implementation, but it could not be tested properly with existing IMU sensor and time allocation was missing to build calibration procedure for the IMU from scratch. To fix package drops, the customer finally bundled 3 packets together to be able to transfer most of the data through. This increased the transmission lag even further away from real-time use.

But all in all, this productization step was an interesting journey into various surprising problems customers may suffer from. Perhaps the most surprising problems were due to imprecise sensor calibrations, which alone can prevent all useful activity. During the prototyping IR+RGB wide stereo camera calibration software was written from scratch, which minimized in-house 30px reprojection errors by 1-2px with the IR camera for example, but as the project continued to product phase, the customer switched back to their in-house calibration procedures.