I’ve updated the tortoise-tts repo with a script that automatically download model weights (thank to the HuggingFace Hub for hosting them!). I’ve also created a colab notebook if you want to try this out on Google hardware. Make sure you pick a GPU runtime.
Sample outputs can be found in the results/ folder of the GitHub repo. Find some handpicked generate below.
I’m not done with this project. It is clear to me that the autoregressive model does an extremely good job at producing realistic prosody. I will be making a few tweaks to make it less sensitive to the conditioning clips you provide, which should improve results for all voices.
I believe some serious improvements can be made with some tweaks to the diffusion model, which is responsible for the subpar audio quality. This will need to be completely re-trained, which will take a few weeks (and probably a month of testing before I make that investment).
Keep checking back. This is cool.