For my next project, I want to play around in the music generation space. I think it’ll be interesting to apply some of the lessons learned building Tortoise to music. The first step is building the musical equivalent of a vocoder: a model that will transform a MEL spectrogram to waveform data. That way the main generator(s) can work in highly reduced spectrogram space, just like Tortoise. I could just train a new Univnet model. That probably would have been the wisest choice. However, I don’t really like training GANs and I have no experience training Univnet. Finally, I really…
Overview TorToiSe is a text-to-speech (TTS) program which can mimic voices given 2-4 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. This document will first go into details about each of the five models that make up Tortoise, and will wrap up with a system-level description of how they interoperate. The Autoregressive Decoder Reference Clips A list of reference clips are also provided to the model. The model uses these clips to figure out how to properly mimic the voice, intonation, prosody, etc of the speech it is expected to…
As I covered in my last post, I’m currently working on improving the quality of the diffusion model used to rebuild discretized audio signals for tortoise-tts. Since realizing that the diffusion model can work entirely with spectrograms, I have been re-structuring the model to be a flat transformer/resnet hybrid. One nifty thing about this set-up is that I can now concatenate the diffusion inputs with the conditioning signal from the discretized audio signal and feed the whole thing into the model. This is the same thing that multiple authors working on diffusion models have done with the low-resolution inputs for super-resolution…
I’ve spent the majority of the last two months working on improving the diffusion model in Tortoise TTS. The model used in v1 had a few major shortcomings: Conditioning inputs were bottlenecked to a very small dimensional input into the main model, limiting their effectiveness. The model was trained on audio signals at 11kHz. To make this feasible, I needed to chop up the signals into small clips which have limited context, and: The model itself was relatively shallow: the top layers only had a depth of 1 and model channels at the top levels were very restricting. No processing…
I’ve updated the tortoise-tts repo with a script that automatically download model weights (thank to the HuggingFace Hub for hosting them!). I’ve also created a colab notebook if you want to try this out on Google hardware. Make sure you pick a GPU runtime. Sample outputs can be found in the results/ folder of the GitHub repo. Find some handpicked generate below. I’m not done with this project. It is clear to me that the autoregressive model does an extremely good job at producing realistic prosody. I will be making a few tweaks to make it less sensitive to the…
In an earlier post, I walked you through a project I’ve been working on, which I called “triforce” at the time. I’ve finished training a first pass on this collection of models and want to write about the results. Deploying this speech CLIP model on the outputs of my autoregressive speech token generator made all of the difference. Outputs are consistently awesome, and almost always clearly convey the desired speech. Adding CLIP to the ensemble After training the three triforce models, I was having considerable difficulty with the autoregressive portion of the model. Specifically, while I would generate a lot…
As I mentioned in my previous blog post, I’m currently working on text-to-speech models. I’m taking the “scale-it-to-the-moon” approach, so I need a lot of data. Fortunately, speech data is pretty easy to come by. Audio books, podcasts, YouTube and large archives of speeches and presentations are available all over the internet. The problem is that this audio generally isn’t transcribed. As you may know from interacting with Alexa, her spin-offs, or your smartphone assistant, digital speech recognition software is pretty damned good these days. My goal was to leverage this to build a huge dataset of artificially transcribed audio…
For the past two years, I’ve been tinkering around with generative models in my spare time. I think I’ve landed on an approach that produces by far the most compelling results available today, and which scales like big language models. I’d like to outline the approach here. First of all, I want to touch on something that’ll become immediately obvious: this isn’t a novel architecture or anything. In fact, it is pretty much OpenAI’s DALL E with a diffusion upsampler attached. Instead, it’s a way of thinking how one can (1) improve upon DALL E and (2) universally model generative…
Switched Convolutions – Spatial MoE for Convolutions Abstract I present switched convolutions: a method for scaling the parameter count of convolutions by learning a mapping across the spatial dimension that selects the convolutional kernel to be used at each location. I show how this method can be implemented in a way that has only a small increase in computational complexity. I finally discuss applications of switched convolutions and show that applying them to a pre trained VAE results in large gains in performance. I have open sourced all of my work on switched convolutions. It can be found here. Background…
Batch size is one of the oldest hyper parameters in SGD, but it doesn’t get enough attention for super-resolution GANs. The problem starts with the fact that most SR algorithms are notorious GPU memory hogs. This is because they generally operate on high-dimensional images at high convolutional filter counts. To put this in context, the final intermediate tensor of the classic RRDB model has a shape of (<bs>x64x128x128) or over 33M floats at a batch size of 32. This one tensor consumes more than 10% of the models total memory usage! To cope with this high memory usage, SR papers…
SRFlow is a really neat adaptation of normalizing flows for the purpose of image super-resolution. It is particularly compelling because it potentially trains SR networks with only a single negative-log-likelihood loss. Thanks to a reference implementation from the authors or the paper, I was able to bring a trainable SRFlow network into DLAS. I’ve had some fun playing around with the models I have trained with this architecture, but I’ve also had some problems that I want to document here. First of all – the good First of all – SRFlow does work. It produces images that are perceptually better…
Abstract Modern image super-resolution techniques generally use multiple losses when training. Many techniques use a GAN loss to aid in producing high-frequency details. This GAN loss comes at a cost of producing high-frequency artifacts and distortions on the source image. In this post, I propose a simple regularization method for reducing those artifacts in any SRGAN model. Background on SR Losses Most SR models use composite losses to achieve realistic outputs. A pixel-wise loss and/or a perceptual loss coerces the generator to produce images that look structurally similar to the input low-resolution image. With only these losses, the network converges…
At the beginning of this year, I started working on image super-resolution on a whim: could I update some old analog-TV quality videos I have archived away to look more like modern videos? This has turned out to be a rabbit hole far deeper than I could have imagined. It started out by learning about modern image super-resolution techniques. To this end, I started with a popular GitHub repo called ‘mmsr’. This repo no longer exists, and has since been absorbed into mmediting, but at the time it was a very well-written ML trainer library containing all of the components…
Computing optical flow is an important part of video understanding. There are many ways to train a model to compute this, but one of the more compelling methods is to: Feed a model an image pair Have it predict optical flow Apply that optical flow to the original image Compute a pixel-wise loss against the second image. In order to use this algorithm, however, you need a differentiable way to do step (3), typically called an “image warp”. Tensorflow has just such an operation in contrib, but to my knowledge Pytorch does not. After digging around for awhile today, I…
Batch normalization has a simple goal: stabilize the gradients of large computational graphs. In doing so, this technique has enabled the deep learning renaissance that almost every major ML breakthrough in the last 5 years has relied on. The concept is sound: by regularizing the mean and variance of the inputs of nearly every layer in a neural network, the gradients of that network rarely explode backward pass. The end result is that many neural networks can be easily trained with gradient techniques that would otherwise have never converged. So why am I calling it a hack? Let’s dig in….
After finishing my last project, I wanted to understand generative networks a bit better. In particular, GANs interest me because there doesn’t seem to be much research on them going on in the language modeling space. To build up my GAN chops, I decided to try to figure out image repair and super-resolution. My reasoning was actually pretty simple: I have a large collection of old VHS quality Good Eats episodes that I enjoy watching with my family. Modern flat screens really bring out how inadequate the visual quality of these types of old videos are, however. Wouldn’t it be…
About a month ago, I decided to take the plunge into learning how to fine tune a language generation model. One use-case of language generation that I found particularly compelling was abstractive document summarization. A lot of the papers currently available that deal with abstractive summarization and transformers work by truncating the input text to the maximum sequence length of the model. In the post-transformer XL world, I thought it’d be neat to fix that limitation. XLNet and TransformerXL are the two recurrent language models currently available in the Transformers NLP library. “Recurrent” in this context means that they were…
My desire to understand how the mind works started when I was choosing what I wanted to do in college, in 2000. Back then I was a nerdy kid who was pretty good with computers, but who had grown an insatiable interest for figuring out how the mind ticked. Not knowing a whole lot about the world, I figured my way into progressing this puzzle was the field of psychology. As a result, I joined UCSB as a biology major, with an expressed interest in both psychology as well as psychiatry. Two years later, my passion for working with computers…