For my next project, I want to play around in the music generation space. I think it’ll be interesting to apply some of the lessons learned building Tortoise to music. The first step is building the musical equivalent of a vocoder: a model that will transform a MEL spectrogram to waveform data. That way the…
Author: jbetker
TorToiSe Architectural Design Doc
Overview TorToiSe is a text-to-speech (TTS) program which can mimic voices given 2-4 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. This document will first go into details about each of the five models that make up Tortoise, and will wrap up with a system-level…
Surrogate Losses for Diffusion Models
As I covered in my last post, I’m currently working on improving the quality of the diffusion model used to rebuild discretized audio signals for tortoise-tts. Since realizing that the diffusion model can work entirely with spectrograms, I have been re-structuring the model to be a flat transformer/resnet hybrid. One nifty thing about this set-up is…
Improving Diffusion Models for TTS
I’ve spent the majority of the last two months working on improving the diffusion model in Tortoise TTS. The model used in v1 had a few major shortcomings: Conditioning inputs were bottlenecked to a very small dimensional input into the main model, limiting their effectiveness. The model was trained on audio signals at 11kHz. To…
Tortoise TTS Update
I’ve updated the tortoise-tts repo with a script that automatically download model weights (thank to the HuggingFace Hub for hosting them!). I’ve also created a colab notebook if you want to try this out on Google hardware. Make sure you pick a GPU runtime. Sample outputs can be found in the results/ folder of the…
DALL E for TTS: TortoiseTTS
In an earlier post, I walked you through a project I’ve been working on, which I called “triforce” at the time. I’ve finished training a first pass on this collection of models and want to write about the results. Deploying this speech CLIP model on the outputs of my autoregressive speech token generator made all…
Batch speech transcription with ocotillo
As I mentioned in my previous blog post, I’m currently working on text-to-speech models. I’m taking the “scale-it-to-the-moon” approach, so I need a lot of data. Fortunately, speech data is pretty easy to come by. Audio books, podcasts, YouTube and large archives of speeches and presentations are available all over the internet. The problem is…
Triforce: A general recipe for kickass Generative Models
For the past two years, I’ve been tinkering around with generative models in my spare time. I think I’ve landed on an approach that produces by far the most compelling results available today, and which scales like big language models. I’d like to outline the approach here. First of all, I want to touch on…
Switched Convolutions – Spatial MoE for Convolutions
Switched Convolutions – Spatial MoE for Convolutions Abstract I present switched convolutions: a method for scaling the parameter count of convolutions by learning a mapping across the spatial dimension that selects the convolutional kernel to be used at each location. I show how this method can be implemented in a way that has only a…
SRGANs and Batch Size
Batch size is one of the oldest hyper parameters in SGD, but it doesn’t get enough attention for super-resolution GANs. The problem starts with the fact that most SR algorithms are notorious GPU memory hogs. This is because they generally operate on high-dimensional images at high convolutional filter counts. To put this in context, the…