loading

Business Impact

  • 42%

    reduction in latency

  • 73%

    reduction in memory usage

Customer Key Facts

  • Country : US
  • Industry : Robotics & Waste Management

Reduce Latency and Memory Utilization of Object Detection Model

The client is a leading robotics company that detects and classifies different categories of waste materials for better recycling. They have trained an Object Detection Model to detect the waste products and deployed it on an NVIDIA RTX 2080ti using the TF-TRT framework. Their deployment strategy has high latency and memory utilization, preventing them from running more models on the same GPU.

Challenges

  • High inference time and memory usage
  • Deployment of only one model at a time
Challenges

Technologies Used

NVIDIA TensorRT
NVIDIA Triton Inference Server

Solution

Quantiphi has developed a new deployment strategy using TensorRT and Triton Inference Server to optimize the bottlenecks in the model. We created a scalable pipeline that reduces the model’s memory footprint by 73% and improves the latency by 42%. This has enabled the client to run more models on the same GPU and increase productivity.

Results

  • Faster and improved inference speeds at 1/4th of the original memory footprint.
  • Deployable and scalable within the inference pipeline with minimal changes to it as the solution is delivered as docker container

Looking for similar project?

Let's Talk

Get your digital transformation started

Let's Talk