The problem we attempt to tackle is that of camera egomotion estimation in a stereo camera system. Estimation of camera pose in a world coordinate frame is commonly used in SLAM systems. We have designed our system to be similar to Parallel Tracking and Mapping (PTAM)(http://www.robots.ox.ac.uk/~gk/publications/KleinMurray2007ISMAR.pdf), with extra depth data from Kinect, which is commonly used for camera tracking as it is capable of performing real-time tracking on even low-end hardware. Here is the video of PTAM system presented by the authors of the paper(http://www.youtube.com/watch?feature=player_embedded&v=F3s3M0mokNc#!).
Our implementation consists predominantly of three separate components – one for generating a dense depth map from the stereo camera system, another for creating a map of the world from selected keyframes, and the third performing the actual tracking. For real-time operation, we run each of these components on a separate hardware thread asynchronously.
The thread responsible for depth map generation performs the simple task of edge detection, following which a sparse depth map is computed using stereo parameters. We then use a form of Gaussian weighting to make the depth map dense. For this, we search for points whose depth is already known in a window around a point. We then assign these points weights based on their distance from the central point similar to a Gaussian distribution centered on the centre point. We then compute the depth of the central point as a weighted average of the depths of all these points. We found this method give a lot better results in actual tracking than depth maps computed using spline interpolation etc.
The second thread performs the actual tracking, which is the basic purpose. The tracking module functions by first estimating a rough estimate of the pose of the camera and then refining this estimate using gradient descent error minimization. For estimation of a rough pose, a decaying motion model is used. The shift in the camera from the last frame is assumed to be a decayed value of the shift (or motion) of the camera in the previous two frames. Using this decayed value, a rough estimate of the motion is available that can be used as a seed transformation to the Iterative Closest Point (ICP) algorithm which can then find the actual pose of the camera using a gradient descent technique. ICP works by minimixzing the point-to-plane error by finding a 3D rigid transformation that best describes the relationship between two sets of 3D points. We extract points of interest from the incoming video frame using the SIFT feature extractor and then transform these points into the world coordinate frame using the rough camera pose we got from the motion model. We then use ICP to register this set of points with the map points contained in the map and compute the actual transformation between the two, thereby getting the actual camera pose.
The third thread is responsible for creating a map of the environment. This is a relatively slow process and is allowed to proceed at its own speed asynchronolously, without affecting the real-time nature of the tracking thread. The map basically consists of a set of points in a world cooridnate system. The tracking thread identifies certain frames as “keyframes” and pushes them to a queue. These keyframes are then popped from the queue, one at a time, and added to the map. Each keyframe object comes with a vector of keypoints in the frame, i.e., points of special interest extracted using the SIFT feature extractor. Additionally, each keyframe also consists of a transformation matrix which describes the transformation from the world coordinate frame to the frame of the camera when that keyframe was captured. This transformation matrix can be used to find the position of all the keypoints of the keyframe in the world coordinate system, and these can then be compared with the existing points in the map. When a point is found in the existing map that is not too far off from the calculated position of a keypoint in the world coordinate frame, the two points are deemed to be the same and the position of this “map point” is refined with the new information available. On the other hand, if no map point is found close to the calculated position of the keypoint, the keypoint is deemed to be fresh data and is added to the map as a new map point. In this manner, the map is constantly updated and refined as new information becomes available.
We have primarily made use of two packages to handle data and perform some basic processing. We use image OpenCV for image acquisition and edge detection. Once the dense depth map is available, we shift from images to point clouds. We make extensive use of Point Cloud Library (PCL) for data handling and SIFT feature extraction. Tracking and mapping are throughout performed on point clouds and the original images are no longer used.
The code contained in this folder has been modified to work with a single image instead of a stereo video feed as inclusion of such a feed would cause the file size to be too large. Also, for testing purposes we are using a Microsoft Kinect device to directly get point clouds and bypass the entire stereo matching process for better accuracy. Errors in the stereo module tend to propogate up and adversely affect tracking quality. On the day of the viva/presentation we plan to undo these changes to include video feed instead of images so that camera pose can be properly calculated.

