Transcribing Phone Calls in Real-time with Twilio and Vosk
This week I published a tutorial on the Twilio blog showing how you can transcribe phone calls in real-time in ASP.NET Core, using an open-source speech recognition engine called Vosk.
It was a really fun project to put together, and allowed me to make use of NAudio to convert the incoming mu-law audio into linear PCM resampled to 16kHz. I kept the audio signal chain description fairly succinct in the article, but for those interested in a slightly longer explanation, here's a bit more detail of how I created the AudioConverter
class.
It was a little bit of an unconventional signal path for NAudio, as we needed to receive audio in a byte[]
and get it out as a short[]
, but to do the resampling, needed to go to to floating point (as resampling involves the need to do some filtering).
The signal chain I came up with started with a BufferedWaveProvider
because that allows us to put blocks of samples in as they arrive, and then uses ToSampleProvider
to convert into an NAudio ISampleProvider
which uses floating point samples. This means it can be passed into the WdlResamplingSampleProvider
which I chose because it is a fully managed implementation so it can be used cross-platform.
But then we needed to get back to 16 bit samples, and I did that with the ToWaveProvider
extension method to get us back to an IWaveProvider
and then wrapped that in a WaveFloatTo16Provider
. We still weren't done, as IWaveProvider
writes into a byte[]
, and we wanted a short[]
array. Fortunately NAudio has the extremely helpful WaveBuffer
utility which allows us to effectively do a "reinterpret" cast between arrays of different types.
Here's the constructor for the AudioConverter
(which we create for each call we want to transcribe). This sets up the signal chain I just described.
private readonly BufferedWaveProvider bufferedWaveProvider;
private readonly IWaveProvider outputProvider;
private readonly byte[] outputBuffer;
private readonly WaveBuffer outputWaveBuffer;
public AudioConverter()
{
bufferedWaveProvider = new BufferedWaveProvider(new WaveFormat(8000, 1));
var resampler = new WdlResamplingSampleProvider(
bufferedWaveProvider.ToSampleProvider(), 16000);
outputProvider = new WaveFloatTo16Provider(resampler.ToWaveProvider());
outputBuffer = new byte[16000*2]; // one second of audio should be plenty
outputWaveBuffer = new WaveBuffer(outputBuffer);
}
With this signal chain in place, every time a buffer of audio comes in, we first use the MuLawDecoder
to convert from mu-law to linear 16 bit PCM. And then put the block of samples into the front of our signal chain with the AddSamples
method on the BufferedWaveProvider
.
Then, because I am upsampling from 8kHz to 16kHz I know that for every sample that goes in, I will expect two samples out, so we just read that number of samples out the end of our signal chain, and take advantage of the WaveBuffer
to return the data as a short[]
which is what Vosk is expecting.
Here's the ConvertBuffer
method which converts each incoming block of audio from the Twilio media streams API.
public (short[],int) ConvertBuffer(byte[] input)
{
var samples = input.Length;
// ulaw 8000 bitrate to Linear 8kHz bitrate
for (int i = 0; i < input.Length; i++)
{
outputWaveBuffer.ShortBuffer[i] =
MuLawDecoder.MuLawToLinearSample(input[i]);
}
bufferedWaveProvider.AddSamples(outputWaveBuffer.ByteBuffer, 0, samples*2);
var convertedBytes = samples * 4; // to PCM and to 16kHz
var outRead = outputProvider.Read(outputBuffer, 0, convertedBytes);
return (outputWaveBuffer.ShortBuffer, outRead / 2);
}
I didn't do loads of extensive testing, but I was impressed with the accuracy of the transcription from the smallest Vosk model. And it was very straightforward to configure Twilio to call a webhook for each incoming phonecall, allowing us to subscribe to the audio stream.
The source code for the entire project is available here on GitHub and of course check out the blog post for a detailed explanation of how it all works.