case study

Optimizing Large CV models using TensorRT and Triton Inference Server

Robotics and Waste Management

Business Impacts

42%

reduction in latency

73%

reduction in memory usage

Customer Key Facts

  • Country : US
  • Industry : Robotics & Waste Management

Reduce Latency and Memory Utilization of Object Detection Model

The client is a leading robotics company that detects and classifies different categories of waste materials for better recycling. They have trained an Object Detection Model to detect the waste products and deployed it on an NVIDIA RTX 2080ti using the TF-TRT framework. Their deployment strategy has high latency and memory utilization, preventing them from running more models on the same GPU.

Challenges

  • High inference time and memory usage
  • Deployment of only one model at a time

Technologies Used

NVIDIA TensorRT

NVIDIA TensorRT

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server

Solution

Quantiphi has developed a new deployment strategy using TensorRT and Triton Inference Server to optimize the bottlenecks in the model. We created a scalable pipeline that reduces the model's memory footprint by 73% and improves the latency by 42%. This has enabled the client to run more models on the same GPU and increase productivity.

Results

  • Faster and improved inference speeds at 1/4th of the original memory footprint.
  • Deployable and scalable within the inference pipeline with minimal changes to it as the solution is delivered as docker container

Thank you for reaching out to us!

Our experts will be in touch with you shortly.

In the meantime, explore our insightful blogs and case studies.

Something went wrong!

Please try it again.

Share