Optimizing Large CV models using TensorRT and Triton Inference Server

Business Impacts

42%

reduction in latency

73%

reduction in memory usage

Customer Key Facts

Country : US
Industry : Robotics & Waste Management

Reduce Latency and Memory Utilization of Object Detection Model

The client is a leading robotics company that detects and classifies different categories of waste materials for better recycling. They have trained an Object Detection Model to detect the waste products and deployed it on an NVIDIA RTX 2080ti using the TF-TRT framework. Their deployment strategy has high latency and memory utilization, preventing them from running more models on the same GPU.

Challenges

High inference time and memory usage
Deployment of only one model at a time

Technologies Used

NVIDIA TensorRT

NVIDIA Triton Inference Server

Solution

Quantiphi has developed a new deployment strategy using TensorRT and Triton Inference Server to optimize the bottlenecks in the model. We created a scalable pipeline that reduces the model's memory footprint by 73% and improves the latency by 42%. This has enabled the client to run more models on the same GPU and increase productivity.

Results

Faster and improved inference speeds at 1/4th of the original memory footprint.
Deployable and scalable within the inference pipeline with minimal changes to it as the solution is delivered as docker container

Optimizing Large CV models using TensorRT and Triton Inference Server

Business Impacts

42%

73%

Customer Key Facts

Challenges

Technologies Used

Solution

Results

Start Your Next Gen AI Journey Today

Partners

Solutions

Industries

Resources

Company