As I mentioned in my previous blog post, I’m currently working on text-to-speech models. I’m taking the “scale-it-to-the-moon” approach, so I need a lot of data.
Fortunately, speech data is pretty easy to come by. Audio books, podcasts, YouTube and large archives of speeches and presentations are available all over the internet. The problem is that this audio generally isn’t transcribed.
As you may know from interacting with Alexa, her spin-offs, or your smartphone assistant, digital speech recognition software is pretty damned good these days. My goal was to leverage this to build a huge dataset of artificially transcribed audio clips.
Searching for good speech recognition
I was amazed at how hard it is to find something decent, though. I started out trying to figure out if I could use Google’s offerings. However, I am currently sitting on ~30,000 hours of audio that needs transcription. This would cost $40,000 at current market pricing. No thanks.
So lets go open source and try and do this on my own hardware. Mozilla’s DeepSpeech pops up near the top of the results, but it uses a deep learning tech that is 8 years old at this point, TensorFlow – which is a PITA to work with, and hasn’t been actively maintained for more than a year.
PaddleSpeech is another good option – it offers more recent models and is actively maintained. However, it is far from “easy” to extend, despite its claims. It also has a dependency list a mile long and strikes me as being targeted at a more Chinese audience.
I actually spent a decent amount of time learning how to train my own speech recognition model before I fell upon HuggingFace’s wav2vec2.0 implementation. This model is now packaged within the transformers library and has pretrained weights for models that have been fine-tuned on ASR by Facebook to performance as good as 2% WER (meaning the model only gets 2% of words it transcribes wrong. That’s near-human performance.)
After tinkering around a bit, I knew this was the right way to go.
Building a batch-transcriptor
While Patrick von Platen (the huggingface contributor for wav2vec2) has done most of the heavy lifting for us, we still need to do some coding to actually utilize this work.
To do so, I built ocotillo, a GitHub repo focused on extremely simple speech recognition. There are no bells or whistles here – this repo does one thing: speech transcription. I provide an script that will batch transcribe an entire folder of audio files on the GPU, or a API in case you want to do your own data loading or use multiple GPUs (or CPU). I also provide a colab which you can use to run speech recognition against your own voice using wav2vec using ocotillo’s API.
Please see the readme in the GitHub repo for more info.
Results
So how did this quest for mass-transcription turn out? Better than I could ever hope for, actually. On 3 RTX 3090 GPUs, I was able to transcribe my entire dataset in under 4 days. There’s your $40k….
Fun additional fact: wav2vec (and ocotillo) use CTC coding for transcription, which lends itself exceptionally well to text<->voice alignment. If you’re in the TTS world, you know that alignment is the single hardest problem of text-to-speech. Thanks to wav2vec & ocotillo, I now have alignment data for 30,000 hours of speech. I’m really excited to play with that. (Possibly) more to come…