Visual Servoing Platform  version 3.4.0
Tutorial: Deep learning object detection

Introduction

This tutorial shows how to use vpDetectorDNN (DNN stands for Deep Neural Network) class to perform object detection with deep learning. This class is a small wrapper over the OpenCV DNN module.

It provides convenient ways to retrieve detection bounding boxes, class ids and confidence values. For other tasks such as image classification or more elaborate functionality, you should use directly the OpenCV DNN API.

In the next section you will find an example that shows how to perform face detection in a single image or in images acquired from a camera connected to your computer.

Note that all the material (source code and network model) described in this tutorial is part of ViSP source code and could be downloaded using the following command:

$ svn export https://github.com/lagadic/visp.git/trunk/tutorial/detection/dnn

Face detection

The following example also available in tutorial-dnn-object-detection-live.cpp detects human face.

#include <visp3/core/vpConfig.h>
#include <visp3/detection/vpDetectorDNN.h>
#include <visp3/gui/vpDisplayGDI.h>
#include <visp3/gui/vpDisplayOpenCV.h>
#include <visp3/gui/vpDisplayX.h>
int main(int argc, const char *argv[])
{
#if (VISP_HAVE_OPENCV_VERSION >= 0x030403) && defined(VISP_HAVE_OPENCV_DNN)
try {
int opt_device = 0;
std::string input = "";
std::string model = "opencv_face_detector_uint8.pb";
std::string config = "opencv_face_detector.pbtxt";
int inputWidth = 300, inputHeight = 300;
double meanR = 104.0, meanG = 177.0, meanB = 123.0;
double scaleFactor = 1.0;
bool swapRB = false;
float confThresh = 0.5f;
float nmsThresh = 0.4f;
std::string labelFile = "";
for (int i = 1; i < argc; i++) {
if (std::string(argv[i]) == "--device" && i+1 < argc) {
opt_device = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--input" && i+1 < argc) {
input = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--model" && i+1 < argc) {
model = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--config" && i+1 < argc) {
config = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--width" && i+1 < argc) {
inputWidth = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--height" && i+1 < argc) {
inputHeight = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--mean" && i+3 < argc) {
meanR = atof(argv[i+1]);
meanG = atof(argv[i+2]);
meanB = atof(argv[i+3]);
} else if (std::string(argv[i]) == "--scale" && i+1 < argc) {
scaleFactor = atof(argv[i+1]);
} else if (std::string(argv[i]) == "--swapRB") {
swapRB = true;
} else if (std::string(argv[i]) == "--confThresh" && i+1 < argc) {
confThresh = (float)atof(argv[i+1]);
} else if (std::string(argv[i]) == "--nmsThresh" && i+1 < argc) {
nmsThresh = (float)atof(argv[i+1]);
} else if (std::string(argv[i]) == "--labels" && i+1 < argc) {
labelFile = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--help" || std::string(argv[i]) == "-h") {
std::cout << argv[0] << " --device <camera device number> --input <path to image or video>"
" (camera is used if input is empty) --model <path to net trained weights>"
" --config <path to net config file>"
" --width <blob width> --height <blob height>"
" -- mean <meanR meanG meanB> --scale <scale factor>"
" --swapRB --confThresh <confidence threshold>"
" --nmsThresh <NMS threshold> --labels <path to label file>" << std::endl;
return EXIT_SUCCESS;
}
}
std::cout << "Model: " << model << std::endl;
std::cout << "Config: " << config << std::endl;
std::cout << "Width: " << inputWidth << std::endl;
std::cout << "Height: " << inputHeight << std::endl;
std::cout << "Mean: " << meanR << ", " << meanG << ", " << meanB << std::endl;
std::cout << "Scale: " << scaleFactor << std::endl;
std::cout << "Swap RB? " << swapRB << std::endl;
std::cout << "Confidence threshold: " << confThresh << std::endl;
std::cout << "NMS threshold: " << nmsThresh << std::endl;
cv::VideoCapture capture;
if (input.empty()) {
capture.open(opt_device);
} else {
capture.open(input);
}
#if defined(VISP_HAVE_X11)
#elif defined(VISP_HAVE_GDI)
#elif defined(VISP_HAVE_OPENCV)
#endif
dnn.readNet(model, config);
dnn.setInputSize(inputWidth, inputHeight);
dnn.setMean(meanR, meanG, meanB);
dnn.setScaleFactor(scaleFactor);
dnn.setSwapRB(swapRB);
dnn.setConfidenceThreshold(confThresh);
dnn.setNMSThreshold(nmsThresh);
std::vector<std::string> labels;
if (!labelFile.empty()) {
std::ifstream f_label(labelFile);
std::string line;
while (std::getline(f_label, line)) {
labels.push_back(line);
}
}
cv::Mat frame;
while (true) {
capture >> frame;
if (frame.empty())
break;
if (I.getSize() == 0) {
d.init(I);
vpDisplay::setTitle(I, "DNN object detection");
} else {
}
double t = vpTime::measureTimeMs();
std::vector<vpRect> boundingBoxes;
dnn.detect(I, boundingBoxes);
std::vector<int> classIds = dnn.getDetectionClassIds();
std::vector<float> confidences = dnn.getDetectionConfidence();
for (size_t i = 0; i < boundingBoxes.size(); i++) {
vpDisplay::displayRectangle(I, boundingBoxes[i], vpColor::red, false, 2);
std::ostringstream oss;
if (labels.empty())
oss << "class: " << classIds[i];
else
oss << labels[classIds[i]];
oss << " - conf: " << confidences[i];
vpDisplay::displayText(I, (int)boundingBoxes[i].getTop()-10, (int)boundingBoxes[i].getLeft()+10,
oss.str(), vpColor::red);
}
std::ostringstream oss;
oss << "Detection time: " << t << " ms";
vpDisplay::displayText(I, 20, 20, oss.str(), vpColor::red);
if (vpDisplay::getClick(I, false))
break;
}
} catch (const vpException &e) {
std::cout << e.what() << std::endl;
}
#else
(void)argc;
(void)argv;
#endif
return EXIT_SUCCESS;
}

The default behavior is to detect human face, but you can input another model to detect the objects you want. To see which are the options, run:

$ ./tutorial-dnn-object-detection-live --help

Default DNN model and config files perform human face detection.

std::string model = "opencv_face_detector_uint8.pb";
std::string config = "opencv_face_detector.pbtxt";

It is provided by OpenCV and has been trained with the following characteristics:

This is a brief description of training process which has been used to get res10_300x300_ssd_iter_140000.caffemodel. The model was created with SSD framework using ResNet-10 like architecture as a backbone. Channels count in ResNet-10 convolution layers was significantly dropped (2x- or 4x- fewer channels). The model was trained in Caffe framework on some huge and available online dataset.

More specifically, the model used (opencv_face_detector_uint8.pb) has been quantized (with the TensorFlow library) on 8-bit unsigned int to reduce the size of the training model (2.7 mo vs 10.7 mo for res10_300x300_ssd_iter_140000.caffemodel).

To create the DNN object detector:

dnn.readNet(model, config);
dnn.setInputSize(inputWidth, inputHeight);
dnn.setMean(meanR, meanG, meanB);
dnn.setScaleFactor(scaleFactor);
dnn.setSwapRB(swapRB);
dnn.setConfidenceThreshold(confThresh);
dnn.setNMSThreshold(nmsThresh);

model is the network trained weights, config is the network topology description.

inputWidth and inputHeight are the dimensions to resize the input image into the blob that is fed in entry of the network.

meanR, meanG and meanB are the values used for mean subtraction.

scaleFactor is used to normalize the data range.

swapRB should be set to true when the model has been trained on RGB data. Since OpenCV used the BGR convention, R and B channel should be swapped.

You can directly refer to the OpenCV model zoo for the parameters values.

confThresh is the confidence threshold used to filter the detections after inference.

nmsThresh is the Non-Maximum Threshold. It is used to filter multiple detections that can occur approximatively at the same locations.

After setting the correct parameters, you can easily detect object in an image with

std::vector<vpRect> boundingBoxes;
dnn.detect(I, boundingBoxes);

Class ids and detection confidence scores can be retrieved with

std::vector<int> classIds = dnn.getDetectionClassIds();
std::vector<float> confidences = dnn.getDetectionConfidence();

Object detection model zoo

You can find more models in the OpenCV model zoo.