Visual Servoing Platform  version 3.3.0 under development (2020-02-17)
Tutorial: Object detection and localization

Introduction

This tutorial will show you how to use keypoints to detect and estimate the pose of a known object using his cad model. The first step consists in detecting and learning keypoints located on the faces of an object, while the second step makes the matching between the detected keypoints in the query image with those previously learned. The pair of matches are then used to estimate the pose of the object with the knowledge of the correspondences between the 2D and 3D coordinates.

The next section presents a basic example of the detection of a teabox with a detailed description of the different steps.

Note that all the material (source code and video) described in this tutorial is part of ViSP source code and could be downloaded using the following command:

$ svn export https://github.com/lagadic/visp.git/trunk/tutorial/detection/object

Object detection using keypoints

Preamble

You are advised to read the following tutorials Tutorial: Markerless generic model-based tracking using a color camera and Tutorial: Keypoints matching if you are not aware of these concepts.

Principle of object detection using keypoints

A quick overview of the principle is summed-up in the following diagrams.

img-learning-step.jpeg
Learning step.

The first part of the process consists in learning the characteristics of the considered object by extracting the keypoints detected on the different faces. We use here the model-based tracker initialized given a known initial pose to have access to the cad model of the object. The cad model is then used to select only keypoints on faces that are visible and to calculate the 3D coordinates of keypoints.

Note
The calculation of the 3D coordinates of a keypoint is based on a planar location hypothesis. We assume that the keypoint is located on a planar face and the Z-coordinate is retrieved according to the proportional relation between the plane equation expressed in the normalized camera frame (derived from the image coordinate) and the same plane equation expressed in the camera frame, thanks to the known pose of the object.

In this example the learned data (the list of 3D coordinates and the corresponding descriptors) are saved in a file and will be used later in the detection part.

img-detection-step.jpeg
Detection step.

In a query image where we want to detect the object, we find the matches between the keypoints detected in the current image with those previously learned. The estimation of the pose of the object can then be computed with the 3D/2D information.

The next section presents an example of the detection and the pose estimation of a teabox.

Teabox detection and pose estimation

The following video shows the resulting detection and localization of a teabox that is learned on the first image of the video.

The corresponding code is available in tutorial-detection-object-mbt.cpp. It contains the different steps to learn the teabox object on one image (the first image of the video) and then detect and get the pose of the teabox in the rest of the video.

#include <visp3/core/vpConfig.h>
#include <visp3/core/vpIoTools.h>
#include <visp3/gui/vpDisplayGDI.h>
#include <visp3/gui/vpDisplayOpenCV.h>
#include <visp3/gui/vpDisplayX.h>
#include <visp3/io/vpVideoReader.h>
#include <visp3/mbt/vpMbGenericTracker.h>
#include <visp3/vision/vpKeyPoint.h>
int main(int argc, char **argv)
{
#if (VISP_HAVE_OPENCV_VERSION >= 0x020400)
try {
std::string videoname = "teabox.mpg";
for (int i = 0; i < argc; i++) {
if (std::string(argv[i]) == "--name")
videoname = std::string(argv[i + 1]);
else if (std::string(argv[i]) == "--help") {
std::cout << "\nUsage: " << argv[0] << " [--name <video name>] [--help]\n" << std::endl;
return 0;
}
}
std::string parentname = vpIoTools::getParent(videoname);
std::string objectname = vpIoTools::getNameWE(videoname);
if (!parentname.empty())
objectname = parentname + "/" + objectname;
std::cout << "Video name: " << videoname << std::endl;
std::cout << "Tracker requested config files: " << objectname << ".[init,"
#ifdef VISP_HAVE_PUGIXML
<< "xml,"
#endif
<< "cao or wrl]" << std::endl;
std::cout << "Tracker optional config files: " << objectname << ".[ppm]" << std::endl;
g.setFileName(videoname);
g.open(I);
#if defined(VISP_HAVE_X11)
vpDisplayX display;
#elif defined(VISP_HAVE_GDI)
vpDisplayGDI display;
#elif defined(VISP_HAVE_OPENCV)
vpDisplayOpenCV display;
#else
std::cout << "No image viewer is available..." << std::endl;
return 0;
#endif
display.init(I, 100, 100, "Model-based edge tracker");
bool usexml = false;
#ifdef VISP_HAVE_PUGIXML
if (vpIoTools::checkFilename(objectname + ".xml")) {
tracker.loadConfigFile(objectname + ".xml");
tracker.getCameraParameters(cam);
usexml = true;
}
#endif
if (!usexml) {
vpMe me;
me.setMaskSize(5);
me.setMaskNumber(180);
me.setRange(8);
me.setThreshold(10000);
me.setMu1(0.5);
me.setMu2(0.5);
tracker.setMovingEdge(me);
cam.initPersProjWithoutDistortion(839, 839, 325, 243);
tracker.setCameraParameters(cam);
tracker.setFarClippingDistance(100.0);
}
tracker.setOgreVisibilityTest(false);
if (vpIoTools::checkFilename(objectname + ".cao"))
tracker.loadModel(objectname + ".cao");
else if (vpIoTools::checkFilename(objectname + ".wrl"))
tracker.loadModel(objectname + ".wrl");
tracker.setDisplayFeatures(true);
tracker.initClick(I, objectname + ".init", true);
tracker.track(I);
#if (defined(VISP_HAVE_OPENCV_NONFREE) || defined(VISP_HAVE_OPENCV_XFEATURES2D))
std::string detectorName = "SIFT";
std::string extractorName = "SIFT";
std::string matcherName = "BruteForce";
std::string configurationFile = "detection-config-SIFT.xml";
#else
std::string detectorName = "FAST";
std::string extractorName = "ORB";
std::string matcherName = "BruteForce-Hamming";
std::string configurationFile = "detection-config.xml";
#endif
vpKeyPoint keypoint_learning;
if (usexml) {
#ifdef VISP_HAVE_PUGIXML
keypoint_learning.loadConfigFile(configurationFile);
#endif
} else {
keypoint_learning.setDetector(detectorName);
keypoint_learning.setExtractor(extractorName);
keypoint_learning.setMatcher(matcherName);
}
std::vector<cv::KeyPoint> trainKeyPoints;
double elapsedTime;
keypoint_learning.detect(I, trainKeyPoints, elapsedTime);
std::vector<vpPolygon> polygons;
std::vector<std::vector<vpPoint> > roisPt;
std::pair<std::vector<vpPolygon>, std::vector<std::vector<vpPoint> > > pair = tracker.getPolygonFaces(false);
polygons = pair.first;
roisPt = pair.second;
std::vector<cv::Point3f> points3f;
tracker.getPose(cMo);
vpKeyPoint::compute3DForPointsInPolygons(cMo, cam, trainKeyPoints, polygons, roisPt, points3f);
keypoint_learning.buildReference(I, trainKeyPoints, points3f);
keypoint_learning.saveLearningData("teabox_learning_data.bin", true);
for (std::vector<cv::KeyPoint>::const_iterator it = trainKeyPoints.begin(); it != trainKeyPoints.end(); ++it) {
vpDisplay::displayCross(I, (int)it->pt.y, (int)it->pt.x, 4, vpColor::red);
}
vpDisplay::displayText(I, 10, 10, "Learning step: keypoints are detected on visible teabox faces", vpColor::red);
vpDisplay::displayText(I, 30, 10, "Click to continue with detection...", vpColor::red);
vpKeyPoint keypoint_detection;
if (usexml) {
#ifdef VISP_HAVE_PUGIXML
keypoint_detection.loadConfigFile(configurationFile);
#endif
} else {
keypoint_detection.setDetector(detectorName);
keypoint_detection.setExtractor(extractorName);
keypoint_detection.setMatcher(matcherName);
keypoint_detection.setMatchingRatioThreshold(0.8);
keypoint_detection.setUseRansacVVS(true);
keypoint_detection.setUseRansacConsensusPercentage(true);
keypoint_detection.setRansacConsensusPercentage(20.0);
keypoint_detection.setRansacIteration(200);
keypoint_detection.setRansacThreshold(0.005);
}
keypoint_detection.loadLearningData("teabox_learning_data.bin", true);
double error;
bool click_done = false;
while (!g.end()) {
g.acquire(I);
vpDisplay::displayText(I, 10, 10, "Detection and localization in process...", vpColor::red);
if (keypoint_detection.matchPoint(I, cam, cMo, error, elapsedTime)) {
tracker.setPose(I, cMo);
tracker.display(I, cMo, cam, vpColor::red, 2);
vpDisplay::displayFrame(I, cMo, cam, 0.025, vpColor::none, 3);
}
vpDisplay::displayText(I, 30, 10, "A click to exit.", vpColor::red);
if (vpDisplay::getClick(I, false)) {
click_done = true;
break;
}
}
if (!click_done)
#if defined(VISP_HAVE_COIN3D) && (COIN_MAJOR_VERSION >= 2)
SoDB::finish();
#endif
} catch (const vpException &e) {
std::cout << "Catch an exception: " << e << std::endl;
}
#else
(void)argc;
(void)argv;
std::cout << "Install OpenCV and rebuild ViSP to use this example." << std::endl;
#endif
return 0;
}

You may recognize with the following lines the code used in tutorial-mb-edge-tracker.cpp to initialize the model-based tracker at a given pose and with the appropriate configuration.

try {
std::string videoname = "teabox.mpg";
for (int i = 0; i < argc; i++) {
if (std::string(argv[i]) == "--name")
videoname = std::string(argv[i + 1]);
else if (std::string(argv[i]) == "--help") {
std::cout << "\nUsage: " << argv[0] << " [--name <video name>] [--help]\n" << std::endl;
return 0;
}
}
std::string parentname = vpIoTools::getParent(videoname);
std::string objectname = vpIoTools::getNameWE(videoname);
if (!parentname.empty())
objectname = parentname + "/" + objectname;
std::cout << "Video name: " << videoname << std::endl;
std::cout << "Tracker requested config files: " << objectname << ".[init,"
#ifdef VISP_HAVE_PUGIXML
<< "xml,"
#endif
<< "cao or wrl]" << std::endl;
std::cout << "Tracker optional config files: " << objectname << ".[ppm]" << std::endl;
g.setFileName(videoname);
g.open(I);
#if defined(VISP_HAVE_X11)
vpDisplayX display;
#elif defined(VISP_HAVE_GDI)
vpDisplayGDI display;
#elif defined(VISP_HAVE_OPENCV)
vpDisplayOpenCV display;
#else
std::cout << "No image viewer is available..." << std::endl;
return 0;
#endif
display.init(I, 100, 100, "Model-based edge tracker");
bool usexml = false;
#ifdef VISP_HAVE_PUGIXML
if (vpIoTools::checkFilename(objectname + ".xml")) {
tracker.loadConfigFile(objectname + ".xml");
tracker.getCameraParameters(cam);
usexml = true;
}
#endif
if (!usexml) {
vpMe me;
me.setMaskSize(5);
me.setMaskNumber(180);
me.setRange(8);
me.setThreshold(10000);
me.setMu1(0.5);
me.setMu2(0.5);
tracker.setMovingEdge(me);
cam.initPersProjWithoutDistortion(839, 839, 325, 243);
tracker.setCameraParameters(cam);
tracker.setFarClippingDistance(100.0);
}
tracker.setOgreVisibilityTest(false);
if (vpIoTools::checkFilename(objectname + ".cao"))
tracker.loadModel(objectname + ".cao");
else if (vpIoTools::checkFilename(objectname + ".wrl"))
tracker.loadModel(objectname + ".wrl");
tracker.setDisplayFeatures(true);
tracker.initClick(I, objectname + ".init", true);
tracker.track(I);

The modifications made to the code start from now.

First, we have to choose about which type of keypoints will be used. SIFT keypoints are a widely type of keypoints used in computer vision, but depending of your version of OpenCV and due to some patents, certain types of keypoints will not be available. Here, we will use SIFT if available, otherwise a combination of FAST keypoint detector and ORB descriptor extractor.

#if (defined(VISP_HAVE_OPENCV_NONFREE) || defined(VISP_HAVE_OPENCV_XFEATURES2D))
std::string detectorName = "SIFT";
std::string extractorName = "SIFT";
std::string matcherName = "BruteForce";
std::string configurationFile = "detection-config-SIFT.xml";
#else
std::string detectorName = "FAST";
std::string extractorName = "ORB";
std::string matcherName = "BruteForce-Hamming";
std::string configurationFile = "detection-config.xml";
#endif

The following line declares an instance of the vpKeyPoint class :

vpKeyPoint keypoint_learning;

You can load the configuration (type of detector, extractor, matcher, ransac pose estimation parameters) directly with an xml configuration file :

#ifdef VISP_HAVE_PUGIXML
keypoint_learning.loadConfigFile(configurationFile);
#endif

Otherwise, the configuration must be made in the code.

keypoint_learning.setDetector(detectorName);
keypoint_learning.setExtractor(extractorName);
keypoint_learning.setMatcher(matcherName);

We then detect keypoints in the reference image with the object we want to learn :

std::vector<cv::KeyPoint> trainKeyPoints;
double elapsedTime;
keypoint_learning.detect(I, trainKeyPoints, elapsedTime);

But we need to keep keypoints only on faces of the teabox. This is done by using the model-based tracker to first eliminate keypoints which do not belong to the teabox and secondly to have the plane equation for each faces (and so to be able to compute the 3D coordinate from the 2D information).

std::vector<vpPolygon> polygons;
std::vector<std::vector<vpPoint> > roisPt;
std::pair<std::vector<vpPolygon>, std::vector<std::vector<vpPoint> > > pair = tracker.getPolygonFaces(false);
polygons = pair.first;
roisPt = pair.second;
std::vector<cv::Point3f> points3f;
tracker.getPose(cMo);
vpKeyPoint::compute3DForPointsInPolygons(cMo, cam, trainKeyPoints, polygons, roisPt, points3f);

The next step is the building of the reference keypoints. The descriptors for each keypoints are also extracted and the reference data consist of the lists of keypoints / descriptors and the list of 3D points.

keypoint_learning.buildReference(I, trainKeyPoints, points3f);

We save the learning data in a binary format (the other possibilitie is to save in an xml format but which takes more space) to be able to use it later.

keypoint_learning.saveLearningData("teabox_learning_data.bin", true);

We then visualize the result of the learning process by displaying with a cross the location of the keypoints:

for (std::vector<cv::KeyPoint>::const_iterator it = trainKeyPoints.begin(); it != trainKeyPoints.end(); ++it) {
vpDisplay::displayCross(I, (int)it->pt.y, (int)it->pt.x, 4, vpColor::red);
}
vpDisplay::displayText(I, 10, 10, "Learning step: keypoints are detected on visible teabox faces", vpColor::red);
vpDisplay::displayText(I, 30, 10, "Click to continue with detection...", vpColor::red);

We declare now another instance of the vpKeyPoint class dedicated this time to the detection of the teabox. The configuration is directly loaded from an xml file, otherwise this is done directly in the code.

vpKeyPoint keypoint_detection;
if (usexml) {
#ifdef VISP_HAVE_PUGIXML
keypoint_detection.loadConfigFile(configurationFile);
#endif
} else {
keypoint_detection.setDetector(detectorName);
keypoint_detection.setExtractor(extractorName);
keypoint_detection.setMatcher(matcherName);
keypoint_detection.setMatchingRatioThreshold(0.8);
keypoint_detection.setUseRansacVVS(true);
keypoint_detection.setUseRansacConsensusPercentage(true);
keypoint_detection.setRansacConsensusPercentage(20.0);
keypoint_detection.setRansacIteration(200);
keypoint_detection.setRansacThreshold(0.005);
}

The previously saved binary file corresponding to the teabox learning data is loaded:

keypoint_detection.loadLearningData("teabox_learning_data.bin", true);

We are now ready to detect the teabox in a query image. The call to the function vpKeyPoint::matchPoint() returns true if the matching was successful and permits to get the estimated homogeneous matrix corresponding to the pose of the object. The reprojection error is also computed.

if (keypoint_detection.matchPoint(I, cam, cMo, error, elapsedTime)) {

In order to display the result, we use the tracker initialized at the estimated pose and we display also the location of the world frame:

tracker.setPose(I, cMo);
tracker.display(I, cMo, cam, vpColor::red, 2);
vpDisplay::displayFrame(I, cMo, cam, 0.025, vpColor::none, 3);

The pose of the detected object can then be used to initialize a tracker automatically rather then using a human initialization; see Tutorial: Markerless generic model-based tracking using a color camera and Tutorial: Template tracking.

Quick explanation about some parameters used in the example

The content of the configuration file named detection-config-SIFT.xml and provided with this example is described in the following lines :

<?xml version="1.0"?>
<conf>
<detector>
<name>SIFT</name>
</detector>
<extractor>
<name>SIFT</name>
</extractor>
<matcher>
<name>BruteForce</name>
<matching_method>ratioDistanceThreshold</matching_method>
<matchingRatioThreshold>0.8</matchingRatioThreshold>
</matcher>
<ransac>
<useRansacVVS>1</useRansacVVS>
<useRansacConsensusPercentage>1</useRansacConsensusPercentage>
<ransacConsensusPercentage>20.0</ransacConsensusPercentage>
<nbRansacIterations>200</nbRansacIterations>
<ransacThreshold>0.005</ransacThreshold>
</ransac>
</conf>

In this configuration file, SIFT keypoints are used.

Let us explain now the configuration of the matcher:

  • a brute force matching will explore all the possible solutions to match a considered keypoints detected in the current image to the closest (in descriptor distance term) one in the reference set, contrary to the other type of matching using the library FLANN (Fast Library for Approximate Nearest Neighbors) which contains some optimizations to reduce the complexity of the solution set,
  • to eliminate some possible false matching, one technique consists of keeping only the keypoints whose are sufficienly discriminated using a ratio test.

Now, for the Ransac pose estimation part :

  • two methods are provided to estimate the pose in a robust way: one using OpenCV, the other method uses a virtual visual servoing approach using ViSP,
  • basically, a Ransac method is composed of two steps repeated a certain number of iterations: first we pick randomly 4 points and estimate the pose, the second step is to keep all points which sufficienly "agree" (the reprojection error is below a threshold) with the pose determinated in the first step. These points are inliers and form the consensus set, the other are outliers. If enough points are in the consensus set (here 20 % of all the points), the pose is refined and returned, otherwise another iteration is made (here 200 iterations maximum).

Below you will also find the content of detection-lconfig.xml configuration file, also provided in this example. It allows to use FAST detector and ORB extractor.

<?xml version="1.0"?>
<conf>
<detector>
<name>FAST</name>
</detector>
<extractor>
<name>ORB</name>
</extractor>
<matcher>
<name>BruteForce-Hamming</name>
<matching_method>ratioDistanceThreshold</matching_method>
<matchingRatioThreshold>0.8</matchingRatioThreshold>
</matcher>
<ransac>
<useRansacVVS>1</useRansacVVS>
<useRansacConsensusPercentage>1</useRansacConsensusPercentage>
<ransacConsensusPercentage>20.0</ransacConsensusPercentage>
<nbRansacIterations>200</nbRansacIterations>
<ransacThreshold>0.005</ransacThreshold>
</ransac>
</conf>

Additional functionalities

How to learn keypoints from multiple images

The following video shows an extension of the previous example where here we learn a cube from 3 images and then detect an localize the cube in all the images of the video.

The corresponding source code is given in tutorial-detection-object-mbt2.cpp. If you have a look on this file you will find the following.

Before starting with the keypoints detection and learning part, we have to set the correct pose for the tracker using a predefined pose:

tracker.setPose(I, initPoseTab[i]);

One good thing to do is to refine the pose by running one iteration of the model-based tracker:

tracker.track(I);

The vpKeyPoint::buildReference() allows to append the current detected keypoints with those already present by setting the function parameter append to true.

But before that, the same learning procedure must be done in order to train on multiple images. We detect keypoints on the desired image:

std::vector<cv::KeyPoint> trainKeyPoints;
double elapsedTime;
keypoint_learning.detect(I, trainKeyPoints, elapsedTime);

Then, we keep only keypoints that are located on the object faces:

std::vector<vpPolygon> polygons;
std::vector<std::vector<vpPoint> > roisPt;
std::pair<std::vector<vpPolygon>, std::vector<std::vector<vpPoint> > > pair = tracker.getPolygonFaces();
polygons = pair.first;
roisPt = pair.second;
std::vector<cv::Point3f> points3f;
tracker.getPose(cMo);
tracker.getCameraParameters(cam);
vpKeyPoint::compute3DForPointsInPolygons(cMo, cam, trainKeyPoints, polygons, roisPt, points3f);

And finally, we build the reference keypoints and we set the flag append to true to say that we want to keep the previously learned keypoints:

keypoint_learning.buildReference(I, trainKeyPoints, points3f, true, id);

How to display the matching when the learning is done on multiple images

In this section we will explain how to display the matching between keypoints detected in the current image and their correspondances in the reference images that are used during the learning stage, as given in the next video:

Warning
If you want to load the learning data from a file, you have to use a learning file that contains training images (with the parameter saveTrainingImages vpKeyPoint::saveLearningData() set to true when saving the file, by default it is).

Before showing how to display the matching for all the training images, we have to attribute an unique identifier (a positive integer) for the set of keypoints learned for a particular image during the training process:

keypoint_learning.buildReference(I, trainKeyPoints, points3f, true, id);

It permits to link the training keypoints with the correct corresponding training image.

After that, the first thing to do is to create the image that will contain the keypoints matching with:

keypoint_detection.createImageMatching(I, IMatching);

The previous line allows to allocate an image with the correct size according to the number of training images used.

Then, we have to update for each new image the matching image with the current image:

keypoint_detection.insertImageMatching(I, IMatching);
Note
The current image will be inserted preferentially at the center of the matching image if it possible.

And to display the matching we use:

keypoint_detection.displayMatching(I, IMatching);

We can also display the RANSAC inliers / outliers in the current image and in the matching image:

for (std::vector<vpImagePoint>::const_iterator it = ransacInliers.begin(); it != ransacInliers.end(); ++it) {
vpImagePoint imPt(*it);
imPt.set_u(imPt.get_u() + I.getWidth());
imPt.set_v(imPt.get_v() + I.getHeight());
}
for (std::vector<vpImagePoint>::const_iterator it = ransacOutliers.begin(); it != ransacOutliers.end(); ++it) {
vpImagePoint imPt(*it);
imPt.set_u(imPt.get_u() + I.getWidth());
imPt.set_v(imPt.get_v() + I.getHeight());
vpDisplay::displayCircle(IMatching, imPt, 4, vpColor::red);
}

The following code shows how to retrieve the RANSAC inliers and outliers:

std::vector<vpImagePoint> ransacInliers = keypoint_detection.getRansacInliers();
std::vector<vpImagePoint> ransacOutliers = keypoint_detection.getRansacOutliers();

Finally, we can also display the model in the matching image. For that, we have to modify the principal point offset of the intrinsic parameter. This is more or less an hack as you have to manually change the principal point coordinate to make it works.

cam2.initPersProjWithoutDistortion(cam.get_px(), cam.get_py(), cam.get_u0() + I.getWidth(),
cam.get_v0() + I.getHeight());
tracker.setCameraParameters(cam2);
tracker.setPose(IMatching, cMo);
tracker.display(IMatching, cMo, cam2, vpColor::red, 2);
vpDisplay::displayFrame(IMatching, cMo, cam2, 0.05, vpColor::none, 3);
Note
You can refer to the full code in the section How to learn keypoints from multiple images to have an example of how to learn from multiple images and how to display all the matching.