Visual Servoing Platform  version 3.5.1 under development (2022-05-22)
Tutorial: Deep learning object detection on NVIDIA GPU with TensorRT


This tutorial shows how to run object detection inference using NVIDIA TensorRT inference SDK.

For this tutorial, you'll need ssd_mobilenet.onnx pre-trained model, and pascal-voc-labels.txt label's file containing the corresponding labels. These files can be found in visp-images dataset.

Note that the source code described in this tutorial is part of ViSP source code and could be downloaded using the following command:

$ svn export

Before running this tutorial, you need to install:

  • CUDA (version 10.2 or higher)
  • cuDNN (version compatible with your CUDA version)
  • TensorRT (version 7.1 or higher)
  • OpenCV built from source (version 4.5.2 or higher)

Installation instructions are provided in Prerequisites section.

The tutorial was tested on multiple hardwares of NVIDIA. The following table details the versions of CUDA and TensorRT used for each GPU:

NVIDIA hardware OS CUDA TensorRT CuDNN
Jetson TX2 Ubuntu 18.04 (JetPack 4.4) 10.2 7.1.3 8.0
GeForce GTX 1080 Ubuntu 16.04 11.0 8.0 GA 8.0
Quadro RTX 6000 Ubuntu 18.04 11.3 8.0 GA Update 1 8.2
Issues were encountered when using TensorRT 8.2 EA with CUDA 11.3 on NVIDIA Quadro RTX 6000, the tutorial didn't work as expected. There were plenty of bounding boxes in any given image.


Install CUDA

CUDA is a parallel computing platform and programming model invented by NVIDIA.

  • To know if CUDA NVidia driver is already installed on your machine, on Ubuntu you can use nvidia-smi
    $ nvidia-smi | grep CUDA
    | NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3 |
    Here the output shows that CUDA NVidia driver version 11.3 is installed.
  • To know if CUDA toolkit is installed, run:
    $ cat /usr/local/cuda/version.{txt,json}
    "cuda" : {
    "name" : "CUDA SDK",
    "version" : "11.3.20210326"
    Here it shows that CUDA toolkit 11.3 is installed.
    We recommend that NVidia CUDA Driver and CUDA Toolkit have the same version.
  • To install NVidia CUDA Driver and Toolkit on your machine, please follow this step-by-step guide.

Install cuDNN

Installation instructions are provided here.

For example, when downloading "cuDNN Runtime Library for Ubuntu18.04 x86_64 (Deb)", you can install it running:

$ sudo dpkg -i libcudnn8_8.2.0.53-1+cuda11.3_amd64.deb

Install TensorRT

TensorRT is a C++ library that facilitates high-performance inference on NVIDIA GPUs. To download and install TensorRT, please follow this step-by-step guide.

Let us consider the installation of TensorRT 8.0 GA Update 1 for x86_64 Architecture. In that case you need to download "TensorRT 8.0 GA Update 1 for Linux x86_64 and CUDA 11.0, CUDA 11.1, CUDA 11.2, 11.3" TAR Package and extract its content in VISP_WS.

$ ls $VISP_WS
TensorRT- ...

Following the installation instructions:

  • Add the absolute path to the TensorRTlib directory to the environment variable LD_LIBRARY_PATH:
  • Install the Python TensorRT wheel file.
    $ sudo apt-get install python3-pip
    $ cd $VISP_WS/TensorRT-
    $ python3 -m pip install tensorrt-
  • Install the Python UFF wheel file. This is only required if you plan to use TensorRT with TensorFlow.
    $ cd $VISP_WS/TensorRT-
    $ python3 -m pip install uff-0.6.9-py2.py3-none-any.whl
  • Install the Python graphsurgeon wheel file.
    $ cd $VISP_WS/TensorRT-
    $ python3 -m pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
  • Install the Python onnx-graphsurgeon wheel file.
    $ cd $VISP_WS/TensorRT-
    $ python3 -m pip install onnx_graphsurgeon-0.3.10-py2.py3-none-any.whl

Install OpenCV from source

To be able to run the tutorial, you should install OpenCV from source, since some extra modules are required (cudev, cudaarithm and cudawarping are not included in libopencv-contrib-dev package). To do so, proceed as follows:

  • In VISP_WS, clone opencv and opencv_contrib repos:
    $ cd $VISP_WS
    $ git clone
    $ git clone
  • Create build directory in opencv directory
    $ cd opencv && mkdir build && cd build
  • To install opencv with extra modules, execute the following command:
    $ cmake -DOPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules \
    -DBUILD_opencv_cudev=ON \
    -DBUILD_opencv_cudaarithm=ON \
    -DBUILD_opencv_cudawarping=ON \
    -DCMAKE_INSTALL_PREFIX=$VISP_WS/opencv/install ../
    Note here that installation folder is set to $VISP_WS/opencv/install instead of the default /usr/local. This allows to preserve any other existing OpenCV installation on your machine.
  • Note that if you want a more advanced way to configure the build process, you can use ccmake:
    $ ccmake -DOPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules ../
  • At this point, you can check if cudev, cudaarithm and cudawarping extra modules are enabled as expected:
    $ grep cudev version_string.tmp
    "    To be built:                 ... cudev ...
    $ grep cudaarithm version_string.tmp
    "    To be built:                 ... cudaarithm ...
    $ grep cudawarping version_string.tmp
    "    To be built:                 ... cudawarping ...
    If this is not the case, it means that something is wrong, either in CUDA installation, either in OpenCV configuration with cmake.
  • Launch build process:
    $ make -j$(nproc)
    $ sudo make install
  • Modify LD_LIBRARY_PATH to find OpenCV libraries
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$VISP_WS/opencv/install/lib

Build ViSP with TensorRT support

Next step is here to build ViSP from source enabling TensorRT support. As described in Get ViSP source code, we suppose here that you have ViSP source code in ViSP workspace folder: $VISP_WS. If you follow Prerequisites, you should also find TensorRT and OpenCV in the same workspace.

$ ls $VISP_WS
visp opencv TensorRT-

Now to ensure that ViSP is build TensorRT, create and enter build folder before configuring ViSP with TensorRT and OpenCV path

$ mkdir visp-build; cd visp-build
$ cmake ../visp \

Tutorial description

In the following section is a detailed description of the tutorial. The complete source code is available in tutorial-dnn-tensorrt-live.cpp file.

Include header files

Include header files for required extra modules to handle CUDA.

#include <opencv2/core/cuda.hpp>
#include <opencv2/cudaarithm.hpp>
#include <opencv2/cudawarping.hpp>
#include <opencv2/dnn.hpp>

Include cuda_runtime_api.h header file that defines the public host functions and types for the CUDA runtime API.

#include <cuda_runtime_api.h>

Include TensorRT header files. NvInfer.h is the top-level API file for TensorRT. NvOnnxParser.h is the API for the ONNX Parser.

#include <NvInfer.h>
#include <NvOnnxParser.h>


Prepare input image for inference with OpenCV. First, upload image to GPU, resize it to match model's input dimensions, normalize with meanR meanG meanB being the values used for mean substraction. Transform data to tensor (copy data to channel by channel to gpu_input). In the case of ssd_mobilenet.onnx, the input dimension is 1x3x300x300.

void preprocessImage(cv::Mat &img, float *gpu_input, const nvinfer1::Dims &dims, float meanR, float meanG, float meanB)
if (img.empty()) {
std::cerr << "Image is empty." << std::endl;
cv::cuda::GpuMat gpu_frame;
// Upload image to GPU
// input_dims is in NxCxHxW format.
auto input_width = dims.d[3];
auto input_height = dims.d[2];
auto channels = dims.d[1];
auto input_size = cv::Size(input_width, input_height);
// Resize
cv::cuda::GpuMat resized;
cv::cuda::resize(gpu_frame, resized, input_size, 0, 0, cv::INTER_NEAREST);
// Normalize
cv::cuda::GpuMat flt_image;
resized.convertTo(flt_image, CV_32FC3);
cv::cuda::subtract(flt_image, cv::Scalar(meanR, meanG, meanB), flt_image, cv::noArray(), -1);
cv::cuda::divide(flt_image, cv::Scalar(127.5f, 127.5f, 127.5f), flt_image, 1, -1);
// To tensor
std::vector<cv::cuda::GpuMat> chw;
for (int i = 0; i < channels; ++i)
chw.emplace_back(cv::cuda::GpuMat(input_size, CV_32FC1, gpu_input + i * input_width * input_height));
cv::cuda::split(flt_image, chw);


After running the inference, depending on the model used, you will get different results dimensions on the output. These results should be post processed. In the case of ssd_mobilenet.onnx, there is 2 outputs:

  • scores of dimension : 1x3000x21
  • boxes of dimension : 1x3000x4

In fact, the model will output 3000 guesses of boxes (bounding boxes) with 21 scores each (1 score for each class). The result of the inference being on the GPU, we should first proceed by copying it to the CPU. Post processing consists of filtering the predictions where we're not sure about the class detected and then merging multiple detections that can occur approximately at the same locations. confThresh is the confidence threshold used to filter the detections after inference. nmsThresh is the Non-Maximum Threshold. It is used to merge multiple detections being in the same location approximately.

std::vector<cv::Rect> postprocessResults(std::vector<void *> buffers, const std::vector<nvinfer1::Dims> &output_dims,
int batch_size, int image_width, int image_height, float confThresh,
float nmsThresh, std::vector<int> &classIds)
// private variables of vpDetectorDNN
std::vector<cv::Rect> m_boxes, m_boxesNMS;
std::vector<int> m_classIds;
std::vector<float> m_confidences;
std::vector<int> m_indices;
// copy results from GPU to CPU
std::vector<std::vector<float> > cpu_outputs;
for (size_t i = 0; i < output_dims.size(); i++) {
cpu_outputs.push_back(std::vector<float>(getSizeByDim(output_dims[i]) * batch_size));
cudaMemcpy(cpu_outputs[i].data(), (float *)buffers[1 + i], cpu_outputs[i].size() * sizeof(float),
// post process
int N = output_dims[0].d[1], C = output_dims[0].d[2]; // (1 x N x C format); N: Number of output detection boxes
// (fixed in the model), C: Number of classes.
for (int i = 0; i < N; i++) // for all N (boxes)
uint32_t maxClass = 0;
float maxScore = -1000.0f;
for (int j = 1; j < C; j++) // ignore background (classId = 0).
const float score = cpu_outputs[0][i * C + j];
if (score < confThresh)
if (score > maxScore) {
maxScore = score;
maxClass = j;
if (maxScore > confThresh) {
int left = (int)(cpu_outputs[1][4 * i] * image_width);
int top = (int)(cpu_outputs[1][4 * i + 1] * image_height);
int right = (int)(cpu_outputs[1][4 * i + 2] * image_width);
int bottom = (int)(cpu_outputs[1][4 * i + 3] * image_height);
int width = right - left + 1;
int height = bottom - top + 1;
m_boxes.push_back(cv::Rect(left, top, width, height));
cv::dnn::NMSBoxes(m_boxes, m_confidences, confThresh, nmsThresh, m_indices);
for (size_t i = 0; i < m_indices.size(); ++i) {
int idx = m_indices[i];
m_boxesNMS[i] = m_boxes[idx];
classIds = m_classIds; // Returning detected objects class Ids.
return m_boxesNMS;

Parse ONNX Model

Parse ONNX model.

bool parseOnnxModel(const std::string &model_path, TRTUniquePtr<nvinfer1::ICudaEngine> &engine,
TRTUniquePtr<nvinfer1::IExecutionContext> &context)

model_path is the path to onnx file.

engine is used for executing inference on a built network.

context is used for executing inference.

To parse ONNX model, we should first proceed by initializing TensorRT Context and Engine. To do this, we should create an instance of Builder. With Builder, we can create Network that can create the Parser.

If we already have the GPU inference engine loaded once, it will be serialized and saved in a cache file (with .engine extension). In this case, the engine file will be loaded, then inference runtime created, engine and context loaded.

if (vpIoTools::checkFilename(cache_path)) {
char *engineStream = NULL;
size_t engineSize = 0;
// determine the file size of the engine
struct stat filestat;
stat(cache_path, &filestat);
engineSize = filestat.st_size;
// allocate memory to hold the engine
engineStream = (char *)malloc(engineSize);
// open the engine cache file from disk
FILE *cacheFile = NULL;
cacheFile = fopen(cache_path, "rb");
// read the serialized engine into memory
const size_t bytesRead = fread(engineStream, 1, engineSize, cacheFile);
if (bytesRead != engineSize) // Problem while deserializing.
std::cerr << "Error reading serialized engine into memory." << std::endl;
return false;
// close the plan cache
// Recreate the inference runtime
TRTUniquePtr<nvinfer1::IRuntime> infer{nvinfer1::createInferRuntime(gLogger)};
engine.reset(infer->deserializeCudaEngine(engineStream, engineSize, NULL));
return true;

Otherwise, we should parse the ONNX model (for the first time only), create an instance of builder. The builder can be configured to select the amount of GPU memory to be used for tactic selection or FP16/INT8 modes. Create engine and context to be used in the main pipeline, and serialize and save the engine for later use.

else {
if (!vpIoTools::checkFilename(model_path)) {
std::cerr << "Could not parse ONNX model. File not found" << std::endl;
return false;
TRTUniquePtr<nvinfer1::IBuilder> builder{nvinfer1::createInferBuilder(gLogger)};
TRTUniquePtr<nvinfer1::INetworkDefinition> network{
builder->createNetworkV2(1U << (uint32_t)nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)};
TRTUniquePtr<nvonnxparser::IParser> parser{nvonnxparser::createParser(*network, gLogger)};
// parse ONNX
if (!parser->parseFromFile(model_path.c_str(), static_cast<int>(nvinfer1::ILogger::Severity::kINFO))) {
std::cerr << "ERROR: could not parse the model." << std::endl;
return false;
TRTUniquePtr<nvinfer1::IBuilderConfig> config{builder->createBuilderConfig()};
// allow TRT to use up to 1GB of GPU memory for tactic selection
config->setMaxWorkspaceSize(32 << 20);
// use FP16 mode if possible
if (builder->platformHasFastFp16()) {
engine.reset(builder->buildEngineWithConfig(*network, *config));
TRTUniquePtr<nvinfer1::IHostMemory> serMem{engine->serialize()};
if (!serMem) {
std::cout << "Failed to serialize CUDA engine." << std::endl;
return false;
const char *serData = (char *)serMem->data();
const size_t serSize = serMem->size();
// allocate memory to store the bitstream
char *engineMemory = (char *)malloc(serSize);
if (!engineMemory) {
std::cout << "Failed to allocate memory to store CUDA engine." << std::endl;
return false;
memcpy(engineMemory, serData, serSize);
// write the cache file
FILE *cacheFile = NULL;
cacheFile = fopen(cache_path, "wb");
fwrite(engineMemory, 1, serSize, cacheFile);
return true;

Main pipeline

Start by parsing the model and creating engine and context.

// Parse the model and initialize the engine and the context.
TRTUniquePtr<nvinfer1::ICudaEngine> engine{nullptr};
TRTUniquePtr<nvinfer1::IExecutionContext> context{nullptr};
if (!parseOnnxModel(model_path, engine, context)) // Problem parsing Onnx model
std::cout << "Make sure the model file exists. To see available models, plese visit: "
<< std::endl;
return 0;

Using engine, we can get the dimensions of the input and outputs, and create buffers respectively.

for (int i = 0; i < engine->getNbBindings(); ++i) {
auto binding_size = getSizeByDim(engine->getBindingDimensions(i)) * batch_size * sizeof(float);
cudaMalloc(&buffers[i], binding_size);
if (engine->bindingIsInput(i)) {
} else {
if (input_dims.empty() || output_dims.empty()) {
std::cerr << "Expect at least one input and one output for network" << std::endl;
return -1;

Create a grabber to retrieve image from webcam (or external camera) or read images from image or video.

cv::VideoCapture capture;
if (input.empty()) {;
} else {;
if (!capture.isOpened()) { // check if we succeeded
std::cout << "Failed to open the camera" << std::endl;
int cap_width = (int)capture.get(cv::CAP_PROP_FRAME_WIDTH);
int cap_height = (int)capture.get(cv::CAP_PROP_FRAME_HEIGHT);
capture.set(cv::CAP_PROP_FRAME_WIDTH, cap_width / opt_scale);
capture.set(cv::CAP_PROP_FRAME_HEIGHT, cap_height / opt_scale);
  • Capture a new frame from the grabber,
  • Convert this frame to vpImage used for display,
  • Call preprocessImage() function to copy the frame to GPU and store in input buffer,
  • Perform inference with context->enqueue(),
  • Call postprocessResults() function to filter the outputs,
  • Display the image with the bounding boxes.
    while (!vpDisplay::getClick(I, false)) {
    // get frame.
    capture >> frame;
    // preprocess
    preprocessImage(frame, (float *)buffers[0], input_dims[0], meanR, meanG, meanB);
    // inference.
    context->enqueue(batch_size,, 0, nullptr);
    // post-process
    boxesNMS = postprocessResults(buffers, output_dims, batch_size, width, height, confThresh, nmsThresh, classIds);
    // display.
    vpDisplay::displayText(I, 10, 10, std::to_string(stop - start), vpColor::red);
    for (unsigned int i = 0; i < boxesNMS.size(); i++) {
    vpDisplay::displayRectangle(I, vpRect(boxesNMS[i].x, boxesNMS[i].y, boxesNMS[i].width, boxesNMS[i].height),
    vpColor::red, false, 2);
    vpDisplay::displayText(I, boxesNMS[i].y - 10, boxesNMS[i].x, labels[classIds[i]], vpColor::red);


To use this tutorial, you need an USB webcam and you should have downloaded an onnx file of a model with its corresponding labels in txt file format. To start, you may download the ssd_mobilenet.onnx model and pascal-voc-labels.txt file from here or install Install ViSP data set cloning Github repository.

To see the options, run:

$ ./tutorial-dnn-tensorrt-live --help

Consider you downloaded the files (model and labels), to run object detection on images from webcam, run:

$ ./tutorial-dnn-tensorrt-live --model ssd_mobilenet.onnx --labels pascal-voc-labels.txt

Running the above example on an image will show results like the following:


An example of the object detection can be viewed in this video.