reduction in latency
reduction in memory usage
The client is a leading robotics company that detects and classifies different categories of waste materials for better recycling. They have trained an Object Detection Model to detect the waste products and deployed it on an NVIDIA RTX 2080ti using the TF-TRT framework. Their deployment strategy has high latency and memory utilization, preventing them from running more models on the same GPU.
Quantiphi has developed a new deployment strategy using TensorRT and Triton Inference Server to optimize the bottlenecks in the model. We created a scalable pipeline that reduces the model’s memory footprint by 73% and improves the latency by 42%. This has enabled the client to run more models on the same GPU and increase productivity.