overview

AI Applications • October 13, 2022

Generating Accurate Domain-Specific Audio Transcriptions and Translations Using NVIDIA Riva

Arturo Álvarez, an ardent soccer fan and Real Madrid supporter, is thrilled as his home team takes on Manchester United. His friends gather in his apartment for the match. Everyone’s excited! They switch on the TV and find the commentary is all English. They can’t seem to get the Spanish audio and find it begrudgingly hard to keep up with the fast-paced English prattle. 

After struggling for a minute, Arturo turns on the Spanish Closed Captions available on his OTT subscription. Baffled, they watch the soccer action as it gets translated into Spanish subtitles LIVE. 

“How did you do it?” Arturo is asked. He says, “la magia esta en la tecnologia” in his thick Venezuelan accent. “The magic is in the technology.”

He is referring to an AI-based closed captioning system that can convert live commentary into translated captions in real time. This quick and in-the-moment translation makes it possible for fans across the world, especially non-English speakers, to enjoy soccer in their native language.

Let’s understand how this works and how broadcasters can expand their viewership beyond linguistic boundaries using AI-generated closed captions. 

Closed captioning is embedding a transcription file into a video file. One may use the terms captioning and subtitling interchangeably but there is a slight difference. While captioning simply refers to the transcription of the audio, subtitling is intended to translate for someone who doesn’t speak or understand the language used in the video. 

In broadcasting, captioning is mandated according to the Federal Communications Commission or FCC for audiences who are deaf or hard of hearing. Subtitling, on the other hand, enables broadcasters to make their media fully accessible and more engaging to everyone beyond geographic boundaries. In 2021, The FCC imposed a $3.5 million fine on ViacomCBS and its subsidiary Pluto TV for violating closed captioning rules meant for their video streaming broadcast. 

Closed captioning can be easily achieved through speech AI, and this form of AI relies on a process known as Automatic Speech Recognition (ASR). ASR is the capability of a machine to decipher spoken words and phrases from audio and transcribe them into written text. 

This post discusses how NVIDIA Riva achieves world-class speech recognition for live captioning and subtitling.

Key Performance Metrics of an ASR Solution

NVIDIA Riva is a speech AI SDK for developing real-time applications such as transcription, virtual assistants, and chatbots. It includes pre-trained state-of-the-art models in NGC, the TAO toolkit for fine-tuning models on your domain, and optimized skills for high-performance inference. 

For a live captioning application, the speech segment is streamed to the Riva server, and the text result is streamed back in real time as shown in Fig.1. 

Fig. 1.: (1) Real-time closed captioning solution flow with Riva ASR (2) Speaker Diarization—speakers identification (3) NeMo translation

The input stream received from the live commentary is split into two streams – the ASR stream and the speaker diarization stream. ASR takes care of the speech-to-text conversion process whereas diarization is used for tagging the captions as per the commentator who is speaking. The converted text is then translated into the native language of the viewer and included as closed captions in the live video stream being broadcasted. The entire transcription cycle takes place almost instantaneously, enabling viewers to enjoy live action in real-time. 

A typical Riva ASR pipeline includes a feature extractor that extracts audio features, an acoustic model and a beam search decoder based on n-gram language models for text prediction, and a punctuation model for text readability. 

What makes NVIDIA RIVA beneficial to translators and broadcasters?  

One can measure a solution’s performance under three metrics: accuracy, latency, and cost.

  • Accuracy: The Riva ASR pipeline models are trained on an expanding dataset with thousands of hours of open and real-world data representing media and entertainment, telecom, finance, healthcare, and education. This model training enables enterprises to achieve the highest accuracy possible. ASR accuracy is measured in Word Error Rate or WER—the number of errors divided by the total number of spoken words. For example, if there are five errors in a total of 50 spoken words, the WER would be 10%. Our custom solution provides a WER of 5% while industry-standard tools offer 10 to 15%
  • Latency: To maintain an engaging experience, the caption should be delivered at a latency of no more than 2 seconds 
  • Cost: Employing a custom solution is inexpensive when compared to hiring a team of stenographers and translators. The manual effort is directly proportional to the workload – the more the number of parallel live streams, the bigger the team of interpreters
  • Cost of Customization: Riva allows you to fine-tune models on domain-specific datasets, and bring in your own decoder as well as punctuation models 

Customization for Better Accuracy and Effective Communication

We built a custom model that could understand soccer-specific jargon, on proper nouns like Ronaldo and Diego Costa and words specific to the sport, in order to accurately generate captions.

We first generate a custom dataset of sports commentary, which is developed by scraping the football commentary from sports websites such as ESPN-Football. We have combined the sports commentary dataset with a partial amount of Librispeech LM normalized dataset to use as training data. Before we send the data for training we clean the data by converting it to lowercase, normalizing it, and removing punctuation. The final data size is 308.3 MB.

Sample Data:
stuart armstrong draws a foul in the penalty area
match ends norwich city <num> tottenham hotspur <num>
harry kane tottenham hotspur wins a free kick on the left wing
tony springett norwich city wins a free kick in the attacking half

Fine-tuning the Language Model to Lower Word Error Rate (WER)

Next, we train and fine-tune. Training a model using TAO Toolkit is as simple as configuring your spec file and running the ”train” command. The code cell below uses the train.yaml spec file available for users as reference. The spec file configurations can easily be overridden using the tao-launcher CLI as shown below. For instance, below we override `model.order`, `model.pruning` and `training_ds.data_file` configurations to suit our needs. 

We use the `tao n_gram train` command with the following args:
- `-e`: Path to the spec file
- `-k`: User specified encryption key to use while saving/loading the model
- `-r`: Path to a folder where the outputs should be written. Make sure this is mapped in tlt_mounts.json
- Any overrides to the spec file eg. `model.order`

More details about these arguments are present in the TAO Toolkit guide to getting started. 

Note: All file paths correspond to the destination-mounted directory that is visible in the TAO Toolkit docker container used in the backend.

The train command produces 3 files called `train_n_gram.arpa`, `train_n_gram.vocab` and `train_n_gram.kenlm_intermediate` saved at `$RESULTS_DIR/train/checkpoints`.

The export command will convert the trained language model from `.arpa` to `.binary` with the option of quantizing the model binary. We will set `export_format` in the spec file to `RIVA` to create a `.riva` file containing the language model binary and its corresponding vocabulary.

The model is exported as `exported-model.binary` which is in a format suited for deployment in Riva.

In the following snapshot, you’ll see how word-boosting helps capture more accurately

Speech recognition is not as easy as it sounds. Developing speech recognition is full of challenges, such as understanding multiple accents, dialects, and domain-specific jargon, transcribing the context accurately, and even in noisy environments.

Simple example: What do you hear in this audio clip?

Audio Clip

To give you context, this clip is from a FIFA match where the commentator says:

“Pulisic, Hudson-Odoi towards Lukaku. Romelu Lukaku makes the breakthrough for Chelsea”.

And here is what the out-of-the-box Riva ASR captured: 

“Pool sick, Hudson or doy to lak roman lea ko makes the breakthrough for Chelsea”. 

This difference begs the need for customization and domain-specific training. 

Quantiphi’s custom and noise-robust ASR solution generates high-quality transcriptions by Riva ASR pipeline customization. With Riva ASR, it is possible to fine-tune acoustic, decoder/language, and punctuation and capitalization models. In addition to this, customizations at the training level,  in Riva customizations are also available at the inference level such as word boosting. With word boosting, words of interest have a higher score when decoding the output of the acoustic model so that the ASR pipeline is empowered to recognize them easily (Fig. 3.).

Fig.3.: Higher accuracy achieved with word-boosting customization of ASR pipeline 

Translation for All Industries

Closed captioning enables video content to be consumed by a wider audience and with Riva’s translation capabilities, it can further be distributed to varied geographies. Riva provides support for translation to and from English to five languages in early access. The evolving list of languages includes English, Spanish, Russian, German, Mandarin, and French. 

Speaker Identification

Beyond Transcription: The Role of Speaker Diarization in Various Industries

Above and beyond accurately transcribing and translating audio, it is also important to know who are the speakers of specific closed captioning lines. This is achieved with speaker diarization — the process of segmenting audio recordings by speaker labels—that aims to answer the question “Who is speaking?”. This can be achieved by leveraging NeMo, an open-source conversational AI toolkit that NVIDIA offers.

AI-led closed captioning finds potential applications in many industries that can substitute human transcriptionists and translators. 

For example, the law industry can benefit by replacing stenographers and translators in courtrooms and facilitating effective communication in meeting rooms and town halls. The movie industry can benefit by transcribing and translating film, TV, and OTT content for viewers of different languages. 

Most importantly, each industry has its own jargon that must be transcribed accurately with a low Word Error Rate and delivered with low latency to be serviceable for its end-users. 

Get in touch with our experts at Quantiphi to explore AI-first closed captioning opportunities for your organization. We help you to speed up the entire process of developing a customized, accurate, and cost-effective closed captioning system for your business use-case on NVIDIA LaunchPad.

Ravi Konkimalla And Divya Varghese

Author

Ravi Konkimalla And Divya Varghese

Start Your Next Gen AI Journey Today

Discover how Quantiphi’s AI-powered solutions can transform your business. Fill out the form, and we’ll help you explore tailored AI strategies to unlock new opportunities for growth.

Thank you for reaching out to us!

Our experts will be in touch with you shortly.

In the meantime, explore our insightful blogs and case studies.

Something went wrong!

Please try it again.

Share