Deploying AI at the Edge to Improve Railroad Safety using NVIDIA Jetson and Metropolis

Cities and industries around the world are investing in vision AI technologies to automate and improve safety and operations efficiency in our physical spaces. Whether it is improving traffic congestion, providing checkout-free shopping, or automating industrial inspection — the power of AI-enabled computer vision and edge computing is critical.

In this blog post, we will discuss a vision AI solution to automate operations for freight trains and explain how we leveraged the NVIDIA Metropolis platform to develop and deploy the application. We trained our deep learning algorithms using the NVIDIA TensorRT framework to accurately detect trespassers on railroads, and identify railroad signs and markings. We used NVIDIA Jetson Xavier AGX to run our AI models at the edge.

Introduction to the Solution

We developed an end-to-end solution to improve the safety of freight trains by detecting trespassers on railroad tracks, identifying signals and providing insights to aid the operator during transit.

Our solution was deployed on the NVIDIA Jetson Xavier AGX – a computing device with a small form factor and optimized for energy efficiency while providing a powerful hardware stack for running end-to-end applications. The device boasts a 512-core Volta GPU with Tensor Cores, 32 GB of RAM, 2 Deep Learning Accelerators, 1 Vision Accelerator and support for hardware based video encoding and decoding. You can get more information on the device’s specifications here.

The TensorRT framework, as shown in the picture below is an SDK for optimizing models for high performance deep learning. It provides optimizations such as reduced mixed precision and layer fusion to decrease latency and improve model throughput.

Figure 1. The TensorRT framework provides model optimization by 1) Model Quantization, 2) Layer and Tensor Fusion, 3 – 6) Additional low-level optimizations

Problem Statement

According to the US Department of Transportation, there are about 5,800 train-car crashes each year in the United States, most of which occur at railroad crossings. These accidents cause 600 deaths and injure about 2,300. The key challenge we were trying to solve was to improve safety by aiding train operators in monitoring the environment, especially during long transit times. The environment the train passes through is complex as it may include unpredictable events, such as trespassers or objects on the tracks. In addition, the operator must also understand a variety of traffic rules, such as flashing signals or specific letters close to the signals that convey additional information to the operator.

Figure 2. The end-to-end solution was designed to meet the requirements as shown in the figure. Detect and track signals, any associated letter or number markings, and then map this to a signal rule to give information of what it means. Additionally, the system must also identify trespassers and raise an alert if one is identified.

This presents many challenges as the solution has to be capable of detecting and tracking people as well as signals across time frames. The solution should also determine whether the detected person is a worker on site or a trespasser, as well as associated signals and their meaning. We built our solution using the NVIDIA Metropolis application stack to incorporate these capabilities.

Figure 3. Object Detection, Classification and Tracking Machine Learning models were required for identifying railroad signals and people for the end-to-end solution.

Solution Approach

Solution Requirements

In order to detect and monitor people and signals, we needed the end-to-end solution to be deployed on a single Jetson Xavier AGX device, with an end-to-end latency of no less than 10 FPS as well as a web-based UI for visualization and analytic purposes.

Therefore, we had to consider a number of tradeoffs – picking deep learning models for object detection, such as MobileNet instead of ResNet50, that allowed a smaller memory footprint, significantly lower inference latency, but at the expense of a decrease in accuracy. Furthermore, deep learning models perform exceptionally well with large quantities of data, however, an additional challenge was collecting enough data. In the following sections, we detail how we addressed these challenges.

Figure 4. When building the solution, the client had the following constraints that provided additional challenges when building and deploying the solution.

Software Architecture and Design Principles

The high-level software architecture that shows the flow of information between each of the modules is shown in Figure 5. The flow of information begins with an input image from an RTSP stream or video file being passed into the 1st stage detector. To keep things modular, we designed specifications on what the input and output should look like with an example given in code block 1 below.

    'object_detector': {
        'input': [
            'name' : 'image',
            'shape': (batch_size, 1280, 720, 3),
            'dtype': np.float32
        'output': [
            'name' : 'bbox',
            'shape': (batch_size, num_objs, 4),
            'dtype': np.float32
            'name' : 'class',
            'shape': (batch_size, num_objs),
            'dtype': np.float32
            'name' : 'scores',
            'shape': (batch_size, num_objs),
            'dtype': np.float32

Code Block 1. To ensure modularity, we designed specifications on what the input and output of each module should look like to ensure that changing the internal algorithm or model does not affect the functionality of the pipeline. The code block above shows a sample of what the input and output of the 1st stage detector looks like.

To ensure each of the modules are non-blocking due to the limitations imposed by Python’s Global Interpreter Lock (GIL), we used separate processes for all of the modules.

After objects are tracked with the object tracking module, a pre-processing logic is used to separate the objects for each of the subsequent modules. For example, only persons detected are passed to the trespasser classification module, while the signals detected are passed to the traffic light detector. The signal modifier classifier detects the color of the traffic lights and the corresponding number or letter code on the signal.

As there can be a number of different traffic signals in any given frame (see Figure 2), a heuristic algorithm is used to associate the correct traffic signal to the train track with the Signal Track Association module. The Traffic Signal Algorithm along with the Business Logic takes this information, identifies if the signals are flashing, any trespassers detected, and outputs a corresponding rule code that the operators can interpret to make further decisions. This information is passed to operators via a MQTT Message Broker.

Figure 5. Software architecture showing the different modules that were part of the system. Left hand side: Each module is a python class or function. Right-hand side: The design principles used when building the application.

ML Models – Optimization and Benchmarking

Figure 6 highlights the network architecture, resolution and inference speed for each of the corresponding python modules in Figure 5. We opted for a ResNet50 backbone for the first stage detector because of the need for an accurate object detector. Subsequent modules used a lighter backbone due to GPU memory and end-to-end latency constraints.

The inference speed listed for the FP32 and FP16 precision modes is for the models converted using the TensorRT framework. As some of the operations are not supported in TensorRT, we used a hybrid TensorFlow-TensorRT model. This allowed for optimizing the operations that are supported by TensorRT while running unsupported operations in TensorFlow. INT8 precision mode was not used due to the significant drop in accuracy and the large initialization time.

Figure 6. Inference benchmarks for the ML models that were used in our pipeline (see Figure 5). Using FP16 precision mode provided a drastic reduction in inference time with a minimal impact on accuracy.

Figure 7 shows the FPS of each ML model and the corresponding end-to-end FPS. We achieved an FPS greater than the minimum FPS required by the client despite the large number of deployed models.

Figure 7. Average latency and FPS of each ML model and the corresponding end-to-end FPS.

What’s Next?

AI is transforming the freight transportation industry into a more-efficient, greener, and safer operation through large-scale computer vision systems trained on NVIDIA Jetson AGX Xavier device and deployed onboard, taking railways a step closer to becoming autonomous.

We proposed a solution built as a set of Python modules, serving various object detection, tracking and classification machine learning models that were trained and then optimized for deployment using the TensoRT framework using strategies such as FP16 precision and layer fusion. We also showed that despite the challenges of deploying a large number of models on an edge device, we were able to achieve an end-to-end inference speed of more than 10 FPS – within the constraints set out by the client.

To learn more on leveraging NVIDIA’s stack of SDKs, see NVIDIA’s TAO Toolkit for model training and optimization, DeepStream SDK for deploying efficient scalable AI pipelines, and NVIDIA’s Metropolis platform for scalable solutions across a variety of different industries.

If you’re interested in learning more about this solution, watch Quantiphi’s on-demand session from NVIDIA GTC Spring 2021 or get in touch with our experts.

Written byMuhammad Omar Abid

Get your digital transformation started

Let's Talk