one of this year’s projects, and also the subject of my mini-research scholarship, was automating the process of obtaining an inverse perspective map (ipm) for an autonomous driving robot.
autonomous driving robots frequently use cameras for object and road detection. after detecting such elements in an image it must be able to locate them in the real world, or else the image serves no purpose. that’s the ipm: a means of associating points in an image to points in the car coordinate system. to do obtain it, one only has to understand the transformations that occur when points are projected in an image and undo all those steps, for each point of the image, thus obtaining the real coordinates of each pixel.
a little geometry:
the first transformation to consider is the one relative to the mechanics of the camera, the way it was constructed. correction of these values comprehends the calibration of the focal distance and the alignment of the lens relative to the projective plane, the image sensor. this requires finding 4 unknowns, that are used in this way:
the left part contains the corrected points and the right part the original point coordinates in the real world, multiplied by a matrix. this matrix describes a change in scale as well as a translation. this operation gives us an image that is equivalent to the one obtained if we were using a pinhole camera.
the camera is now perfect, but the lens is still uncorrected. usually lenses produce radial distortion in the projection. this distortion is quite intense if using fisheye lenses. that’s actually the case in the considered robot. due to mechanical constraints, the camera is quite close to the road, and using a normal lens would yield a fairly small field of view. on top of that, some might consider that in practice it is hard to have the lens perfectly parallel to the image sensor, which is a requirement for an undistorted projection. these two problems are solved with 5 new parameters, as corrected by the first few terms of a taylor expansion around the center of the lens [bradski, kaehler]:
after these two transformations we say that the camera is calibrated for the intrinsic parameters. at this point, if we know the position of the camera relative to the floor, elementary geometry allows us to solve the position of each point. usually, though, it’s hard to mount a camera with a specific angle to the floor, at a fixed height. to find these parameters we need to solve for the perspective transformation. this transformation basically will allow us to obtain a bird’s eye view of whatever the camera is capturing. this transform is given by the following operation [bevilacqua, gherardi, carozza]:
on the left we have the corrected image point, on the right the matrix that describes its rotations and translations, and the uncorrected image point. these rotations and translations are the unknowns, called the extrinsic parameters. once we have a bird’s eye view of the scene we can infer distances between points. we just need to locate one point, so we’ll be able to define a fixed coordinate for all points.
a little code:
when you don’t know these parameters, because cameras don’t have a datasheet stating the angle between the lens and the sensor, or how much the lens distorts the image, you can obtain them capturing know objects with identifiable points. chessboards are a good example. fortunately i had to do very little with this math introduction to calibrate the image, because opencv solved most of these problems with just a few function calls. anyway, i didn’t know that until i studied image projection, and it eventually came handy to know the inner works.
so, opencv provides me: a way of detecting chessboard corners, those referred easily identifiable points; given a series of chessboards, the intrinsic parameters; given four image points and their corrected projection, the extrinsic parameters.
detecting chessboard corners is a call away from cvFindChessboardCorners(). this accepts an image and returns a list of points. this list of points can be used, along with the image that produced them, to call cvFindCornerSubPix(), and get subpixel accuracy on those same points.
given a series of captures of different images from chessboards, which will produce a series of lists of points, opencv provides cvCalibrateCamera2(), that calculates the rotations and translations of the chessboard between captures, thus fixing all 9 intrinsic parameters. it should be noted that, for better results, at least 10 chessboard captures should be done. using cvUndistort2() one is able to see the result of correcting an image for the intrinsic parameters. this is easily evaluated for fisheye lenses, since after the calibration no visible distortion should be noticed. straight lines should remain straight [devernay, faugeras]. cvUndistortPoints() corrects a sparse set of points instead of a whole image.
once calibrated for the intrinsic parameters, we can find all 8 extrinsic parameters (the matrix is normalized so that ) using 4 collinear points (their ideal and distorted positions). one can do this laying a chessboard on the road and detecting the outermost corners. we know that these form a rectangle, so we can fix 4 coordinates that form a rectangle where those corners should be projected. opencv allows us to do this using cvGetPerspectiveTransform(). the result can be observed with a call to cvWarpPerspective(), which yields the referred bird’s eye view (or the result of any other calculated perspective transform). cvPerspectiveTransform() corrects a sparse set of points instead of a whole image.
so, since we now have a bird’s eye view, we only need to locate two points in the image and indicate their real coordinates. this will allow us to both find the relation between distances in pixels and distances in the real world and also fix the location of all the points.
conclusion:
this is pretty much an overview what i’ve been doing, and is greatly based on “learning opencv: computer vision with the opencv library“, from o’reilly media. further steps include making use of this map to abstract algorithms such as “lane detection”, “car positioning”, to work at a car coordinate level, instead of image coordinates. this will not only make it easier for the programmer to produce algorithms but will also facilitate porting solutions to other architectures, and different camera mounts.
the project code is not public per se, but i’m free to disclose any requested parts.


