Transcribing Phone Calls in Real-time with Twilio and Vosk

This week I published a tutorial on the Twilio blog showing how you can transcribe phone calls in real-time in ASP.NET Core, using an open-source speech recognition engine called Vosk.

It was a really fun project to put together, and allowed me to make use of NAudio to convert the incoming mu-law audio into linear PCM resampled to 16kHz. I kept the audio signal chain description fairly succinct in the article, but for those interested in a slightly longer explanation, here's a bit more detail of how I created the AudioConverter class.

It was a little bit of an unconventional signal path for NAudio, as we needed to receive audio in a byte[] and get it out as a short[], but to do the resampling, needed to go to to floating point (as resampling involves the need to do some filtering).

The signal chain I came up with started with a BufferedWaveProvider because that allows us to put blocks of samples in as they arrive, and then uses ToSampleProvider to convert into an NAudio ISampleProvider which uses floating point samples. This means it can be passed into the WdlResamplingSampleProvider which I chose because it is a fully managed implementation so it can be used cross-platform.

But then we needed to get back to 16 bit samples, and I did that with the ToWaveProvider extension method to get us back to an IWaveProvider and then wrapped that in a WaveFloatTo16Provider. We still weren't done, as IWaveProvider writes into a byte[], and we wanted a short[] array. Fortunately NAudio has the extremely helpful WaveBuffer utility which allows us to effectively do a "reinterpret" cast between arrays of different types.

Here's the constructor for the AudioConverter (which we create for each call we want to transcribe). This sets up the signal chain I just described.

private readonly BufferedWaveProvider bufferedWaveProvider;
private readonly IWaveProvider outputProvider;
private readonly byte[] outputBuffer;
private readonly WaveBuffer outputWaveBuffer;
public AudioConverter()
{
    bufferedWaveProvider = new BufferedWaveProvider(new WaveFormat(8000, 1));
    var resampler = new WdlResamplingSampleProvider(
        bufferedWaveProvider.ToSampleProvider(), 16000);
    outputProvider = new WaveFloatTo16Provider(resampler.ToWaveProvider());
    outputBuffer = new byte[16000*2]; // one second of audio should be plenty
    outputWaveBuffer = new WaveBuffer(outputBuffer);
}

With this signal chain in place, every time a buffer of audio comes in, we first use the MuLawDecoder to convert from mu-law to linear 16 bit PCM. And then put the block of samples into the front of our signal chain with the AddSamples method on the BufferedWaveProvider.

Then, because I am upsampling from 8kHz to 16kHz I know that for every sample that goes in, I will expect two samples out, so we just read that number of samples out the end of our signal chain, and take advantage of the WaveBuffer to return the data as a short[] which is what Vosk is expecting.

Here's the ConvertBuffer method which converts each incoming block of audio from the Twilio media streams API.

public (short[],int) ConvertBuffer(byte[] input)
{
    var samples = input.Length;

    // ulaw 8000 bitrate to Linear 8kHz bitrate
    for (int i = 0; i < input.Length; i++)
    {
        outputWaveBuffer.ShortBuffer[i] = 
            MuLawDecoder.MuLawToLinearSample(input[i]);
    }

    bufferedWaveProvider.AddSamples(outputWaveBuffer.ByteBuffer, 0, samples*2);
    var convertedBytes = samples * 4; // to PCM and to 16kHz
    var outRead = outputProvider.Read(outputBuffer, 0, convertedBytes);
    return (outputWaveBuffer.ShortBuffer, outRead / 2);
}

I didn't do loads of extensive testing, but I was impressed with the accuracy of the transcription from the smallest Vosk model. And it was very straightforward to configure Twilio to call a webhook for each incoming phonecall, allowing us to subscribe to the audio stream.

The source code for the entire project is available here on GitHub and of course check out the blog post for a detailed explanation of how it all works.