Optimizing Large CV models using TensorRT and Triton Inference Server
Robotics and Waste ManagementBusiness Impacts
42%
reduction in latency
73%
reduction in memory usage
Customer Key Facts
- Country : US
- Industry : Robotics & Waste Management
Reduce Latency and Memory Utilization of Object Detection Model
The client is a leading robotics company that detects and classifies different categories of waste materials for better recycling. They have trained an Object Detection Model to detect the waste products and deployed it on an NVIDIA RTX 2080ti using the TF-TRT framework. Their deployment strategy has high latency and memory utilization, preventing them from running more models on the same GPU.
Challenges
- High inference time and memory usage
- Deployment of only one model at a time
Technologies Used
NVIDIA TensorRT
NVIDIA Triton Inference Server
Solution
Quantiphi has developed a new deployment strategy using TensorRT and Triton Inference Server to optimize the bottlenecks in the model. We created a scalable pipeline that reduces the model's memory footprint by 73% and improves the latency by 42%. This has enabled the client to run more models on the same GPU and increase productivity.
Results
- Faster and improved inference speeds at 1/4th of the original memory footprint.
- Deployable and scalable within the inference pipeline with minimal changes to it as the solution is delivered as docker container