Transcribing large audio files with wav2vec

Pankaj GuptaPankaj Gupta|

Last week, we introduced the BaseTen model zoo, a set of state-of-the-art models that can be deployed and used in applications in minutes. Sounds great in theory, but how about in practice?

Here, we’ll share how one of these pre-trained model zoo models, wav2vec, was incorporated into a user-facing audio transcription application (currently being used by the content moderation team at a large consumer tech startup!). We’ll also explain how we extended the out-of-the-box functionality of wav2vec to solve the challenge of transcribing large audio files.

Simple audio transcription

The BaseTen model zoo includes a wav2vec speech transcription model, a version of wav2vec 2.0, implemented on the huggingface transformer library, learned using self-supervision, and fine-tuned on transcribed speech. This model utilizes GPU for better performance.

First, we simply deployed the wav2vec model from the BaseTen model zoo.

wav2vec model zoo model

Next, we built a worklet to define the application’s business logic. In fact, when we deployed wav2vec from the model zoo, we were given the option of automatically creating a sample app. Sample apps can be great starting places to build off of. In the case of wav2vec, the sample app consists of a worklet that does two things:

  • Selects a WAV file to transcribe
  • Invokes the wav2vec model to transcribe that WAV file

Finally, we built a simple user-facing view that allows a user to do the following:

  • Input a WAV file as a media URL
  • Click a Transcribe button—this triggers the worklet where the magic happens
  • View the output of the wav2vec modal: transcribed audio!

Try transcribing an audio file yourself with this simple demo app.

Audio transcription application

Going beyond simple audio transcription

The application we built using the standard wav2vec model is pretty good at transcription, but has a few issues we can improve on:

  • wav2vec expects audio samples as an array of audio values at a fixed frame rate, which the user may not have easy access to.
  • The model crashes for large audio files because it ends up using a lot of resources and ultimately running out of them.

Let’s walk through how we resolved these issues with custom code, which is easy to add in BaseTen.

Audio file conversion

We converted the incoming audio file to a WAV file at the framerate that the wav2vec model expects. This BaseTen Python environment comes bundled with packages that are commonly used for machine learning, including ffmpeg, designed for processing audio and video files.

We simply added the ffmpeg function to the existing worklet:

def convert_to_wav_and_resample(audio_path, temp_dir_name) -> str:
   """Convert to wav, to be able to stream blocks, use the same sampling rate as that of model."""
   wav_path = f'{temp_dir_name}/audio.wav'
   subprocess.call(['ffmpeg', '-i', audio_path, '-ar', str(MODEL_SAMPLING_RATE), wav_path])
   return wav_path

Now the model can take in an MP3 or even an MP4—anything that can be converted to WAV by the ffmpeg package.

Handling large audio files

At a high level, here’s how we approached the problem of the application running out of resources and crashing when attempting to transcribe large audio files:

  • Read audio from the file in chunks of around 30 seconds each
  • Transcribe each chunk independently by feeding it to the wav2vec model
  • Stitch all the chunks of transcribed audio back together

Reading audio in chunks

We recognized that reading the whole audio file in memory to generate chunks would increase the memory footprint of our Python process, resulting in it running out of memory. To avoid this, we read the file in a streaming manner using librosa. This kept the memory footprint fairly low, completely avoiding any memory issues for all practical purposes.

We also wanted to avoid splitting words into separate chunks because then they would almost certainly get transcribed incorrectly. While there are libraries that help break audio at silence, we found that a simple heuristic worked well. Here’s what we do to detect points of silence between words:

  • Take the logarithm of the audio values. Perceived audio strength corresponds roughly to the logarithm of amplitude values, that’s why audio is measured in Decibel(Db) which is logarithmic.
  • Find the minimum value in the window within which we want to break the file into chunks. This way, we’re breaking the clip into chunks at the most silent point, most likely between words.
  • Transcribe each audio chunk independently and then stitch them back together.

Once we figured out how to split the audio into chunks, we updated our worklet. We needed to call the wav2vec model in a loop, using the context.invoke_model API, in a single Python node.

Our resulting worklet has one (big!) node that does the following:

  • Takes in an audio file URL
  • Identifies 30 second chunks in the audio file, splitting based on silence
  • Runs each chunk through the wav2vec model
  • Returns a transcript of each 30 second chunk
  • Concatenates the returned transcripts to form the full transcript
Audio transcription worklet

Because transcribing a large audio file could take more than 10 minutes, we marked the worklet for background execution. Background execution ensures we don’t run into any timeout issues, even if transcription takes many hours. It also ensures that the worklet is reliably executed even if the Python process, the Python container, or the Kubernetes pod where the process is running dies for any reason.

Let us know what you think

Try out this simple audio transcription app yourself. And if you’re ready to start building on BaseTen, join our waitlist here.