Introduction
This tutorial shows how to use vpDetectorDNN (DNN stands for Deep Neural Network) class to perform object detection with deep learning. This class is a small wrapper over the OpenCV DNN module.
It provides convenient ways to retrieve detection bounding boxes, class ids and confidence values. For other tasks such as image classification or more elaborate functionality, you should use directly the OpenCV DNN API.
In the next section you will find an example that shows how to perform face detection in a single image or in images acquired from a camera connected to your computer.
Note that all the material (source code and network model) described in this tutorial is part of ViSP source code and could be downloaded using the following command:
Face detection
The following example also available in tutorial-dnn-object-detection-live.cpp detects human face.
#include <visp3/core/vpConfig.h>
#include <visp3/detection/vpDetectorDNN.h>
#include <visp3/gui/vpDisplayGDI.h>
#include <visp3/gui/vpDisplayOpenCV.h>
#include <visp3/gui/vpDisplayX.h>
int main(int argc, const char *argv[])
{
#if (VISP_HAVE_OPENCV_VERSION >= 0x030403) && defined(VISP_HAVE_OPENCV_DNN)
try {
int opt_device = 0;
std::string input = "";
std::string model = "opencv_face_detector_uint8.pb";
std::string config = "opencv_face_detector.pbtxt";
int inputWidth = 300, inputHeight = 300;
double meanR = 104.0, meanG = 177.0, meanB = 123.0;
double scaleFactor = 1.0;
bool swapRB = false;
float confThresh = 0.5f;
float nmsThresh = 0.4f;
std::string labelFile = "";
for (int i = 1; i < argc; i++) {
if (std::string(argv[i]) == "--device" && i+1 < argc) {
opt_device = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--input" && i+1 < argc) {
input = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--model" && i+1 < argc) {
model = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--config" && i+1 < argc) {
config = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--width" && i+1 < argc) {
inputWidth = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--height" && i+1 < argc) {
inputHeight = atoi(argv[i+1]);
} else if (std::string(argv[i]) == "--mean" && i+3 < argc) {
meanR = atof(argv[i+1]);
meanG = atof(argv[i+2]);
meanB = atof(argv[i+3]);
} else if (std::string(argv[i]) == "--scale" && i+1 < argc) {
scaleFactor = atof(argv[i+1]);
} else if (std::string(argv[i]) == "--swapRB") {
swapRB = true;
} else if (std::string(argv[i]) == "--confThresh" && i+1 < argc) {
confThresh = (float)atof(argv[i+1]);
} else if (std::string(argv[i]) == "--nmsThresh" && i+1 < argc) {
nmsThresh = (float)atof(argv[i+1]);
} else if (std::string(argv[i]) == "--labels" && i+1 < argc) {
labelFile = std::string(argv[i+1]);
} else if (std::string(argv[i]) == "--help" || std::string(argv[i]) == "-h") {
std::cout << argv[0] << " --device <camera device number> --input <path to image or video>"
" (camera is used if input is empty) --model <path to net trained weights>"
" --config <path to net config file>"
" --width <blob width> --height <blob height>"
" -- mean <meanR meanG meanB> --scale <scale factor>"
" --swapRB --confThresh <confidence threshold>"
" --nmsThresh <NMS threshold> --labels <path to label file>" << std::endl;
return EXIT_SUCCESS;
}
}
std::cout << "Model: " << model << std::endl;
std::cout << "Config: " << config << std::endl;
std::cout << "Width: " << inputWidth << std::endl;
std::cout << "Height: " << inputHeight << std::endl;
std::cout << "Mean: " << meanR << ", " << meanG << ", " << meanB << std::endl;
std::cout << "Scale: " << scaleFactor << std::endl;
std::cout << "Swap RB? " << swapRB << std::endl;
std::cout << "Confidence threshold: " << confThresh << std::endl;
std::cout << "NMS threshold: " << nmsThresh << std::endl;
cv::VideoCapture capture;
if (input.empty()) {
capture.open(opt_device);
} else {
capture.open(input);
}
#if defined(VISP_HAVE_X11)
#elif defined(VISP_HAVE_GDI)
#elif defined(VISP_HAVE_OPENCV)
#endif
std::vector<std::string> labels;
if (!labelFile.empty()) {
std::ifstream f_label(labelFile);
std::string line;
while (std::getline(f_label, line)) {
labels.push_back(line);
}
}
cv::Mat frame;
while (true) {
capture >> frame;
if (frame.empty())
break;
} else {
}
std::vector<vpRect> boundingBoxes;
for (size_t i = 0; i < boundingBoxes.size(); i++) {
std::ostringstream oss;
if (labels.empty())
oss << "class: " << classIds[i];
else
oss << labels[classIds[i]];
oss << " - conf: " << confidences[i];
}
std::ostringstream oss;
oss << "Detection time: " << t << " ms";
break;
}
std::cout << e.
what() << std::endl;
}
#else
(void)argc;
(void)argv;
#endif
return EXIT_SUCCESS;
}
The default behavior is to detect human face, but you can input another model to detect the objects you want. To see which are the options, run:
$ ./tutorial-dnn-object-detection-live --help
Default DNN model and config files perform human face detection.
std::string model = "opencv_face_detector_uint8.pb";
std::string config = "opencv_face_detector.pbtxt";
It is provided by OpenCV and has been trained with the following characteristics:
This is a brief description of training process which has been used to get res10_300x300_ssd_iter_140000.caffemodel. The model was created with SSD framework using ResNet-10 like architecture as a backbone. Channels count in ResNet-10 convolution layers was significantly dropped (2x- or 4x- fewer channels). The model was trained in Caffe framework on some huge and available online dataset.
More specifically, the model used (opencv_face_detector_uint8.pb
) has been quantized (with the TensorFlow library) on 8-bit unsigned int to reduce the size of the training model (2.7 mo vs 10.7 mo for res10_300x300_ssd_iter_140000.caffemodel
).
To create the DNN object detector:
model
is the network trained weights, config
is the network topology description.
inputWidth
and inputHeight
are the dimensions to resize the input image into the blob that is fed in entry of the network.
meanR
, meanG
and meanB
are the values used for mean subtraction.
scaleFactor
is used to normalize the data range.
swapRB
should be set to true
when the model has been trained on RGB data. Since OpenCV used the BGR convention, R and B channel should be swapped.
You can directly refer to the OpenCV model zoo for the parameters values.
confThresh
is the confidence threshold used to filter the detections after inference.
nmsThresh
is the Non-Maximum Threshold. It is used to filter multiple detections that can occur approximatively at the same locations.
After setting the correct parameters, you can easily detect object in an image with
std::vector<vpRect> boundingBoxes;
Class ids and detection confidence scores can be retrieved with
Object detection model zoo
You can find more models in the OpenCV model zoo.