Obligatory: the views and opinions expressed in this post are my own and do not represent the views and opinions of my employer. In light of all the hype going around about ChatGPT, I wanted to offer my “hot take” on what the next 2-5 years of the web look like. One aspect of the rise of generative models that isn’t getting the right amount of attention is the long-term effects on the information economy. I think that being able to automatically produce arbitrary content that is indistinguishable from human-generated content at scale is the death knell of the web…
I’m going to take a stab at nailing down what I believe to be the five fundamental components of a deep neural network. I think there’s value in understanding complex systems at a simple, piecewise level. If you’re new to the field, I hope that these understandings I’ve built up over the last few years help you! Data Representation The unit of data representation in a DNN is a vector. Vectors are called many different things: embeddings, tensors, activations, hidden states. They’re all just a list of floating point numbers that represent some single thing. Storage The learned weights of…
Since joining OpenAI, I’ve had the distinct pleasure of interacting with some of the smartest people on the planet on the subject of generative models. In these conversations, I am often struck by how many different ways there are to “understand” how diffusion works. I don’t think most folk’s understanding of this paradigm is “right” or “wrong”: they are just different. I think there is a distinct value in having a different viewpoints here: an engineers perspective might be more useful to deploy these things to real products, whereas a mathematicians conceptualization may aid improvements in the core technology. I’d…
I’ve been meaning to write this for a couple of months now, but simply haven’t found the time. Life has gotten quite busy for me lately, and I hope to explain why. First, the elephant in the room – I have left Google and finally stepped into the ML industry. I’ve accepted a position as a research engineer at OpenAI. To say that I am over the moon about this would be to understate it. This is, quite literally, my dream job. Somehow I have convinced someone to pay me to do the exact thing that I spend most of…
In machine learning research, there is often a stated desire to build “end to end” training pipelines, where all of the models cohesively learn from a single training objective. In the past, it has been demonstrated that such models perform better than ones which are trained from multiple components, each with their own loss. The reasoning behind this notion is sound: every time you break up a model into different parts, you must necessarily introduce a new lossy medium: The prevailing theory is that these losses build up and produce an altogether inferior model at the end of the pipeline….
Lab notes is a way for me to openly blog about the things I am building. I intend to talk about things I am building and the methods I plan to use to build them. Everything written here should be treated with a healthy amount of skepticism. I’ve been researching something this week that shows a lot of promise, and I really wanted to write about it. I call them “cheater latents”. They’re inspired by something I observed from Tortoise: in an early version of Tortoise, I trained the AR model by using the output clip itself as the conditioning…
Lab notes is a way for me to openly blog about the things I am building. I intend to talk about things I am building and the methods I plan to use to build them. Everything written here should be treated with a healthy amount of skepticism. I wanted to write about something I built about a month ago that I think is really neat and I would like to return to someday. A quick disclaimer is that I think there is a strong probability that formal research on this idea already exists: I vaguely recall reading something similar about…
A lot of people have asked about the computers I used to train TorToiSe. I’ve been meaning to snap some pictures, but it’s never “convenient” to turn these servers off so I keep procrastinating. We had some severe thunderstorms today here in the front range which forced me to shut down my servers. I took the opportunity to take some photos. A little history to start Building out my servers has been a long, multi-year effort. I started the process right around the time that NVIDIA launched their Ampere GPU lineup. I was fortunate to be able to grab 6…
For my next project, I want to play around in the music generation space. I think it’ll be interesting to apply some of the lessons learned building Tortoise to music. The first step is building the musical equivalent of a vocoder: a model that will transform a MEL spectrogram to waveform data. That way the main generator(s) can work in highly reduced spectrogram space, just like Tortoise. I could just train a new Univnet model. That probably would have been the wisest choice. However, I don’t really like training GANs and I have no experience training Univnet. Finally, I really…
Overview TorToiSe is a text-to-speech (TTS) program which can mimic voices given 2-4 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. This document will first go into details about each of the five models that make up Tortoise, and will wrap up with a system-level description of how they interoperate. The Autoregressive Decoder Reference Clips A list of reference clips are also provided to the model. The model uses these clips to figure out how to properly mimic the voice, intonation, prosody, etc of the speech it is expected to…
As I covered in my last post, I’m currently working on improving the quality of the diffusion model used to rebuild discretized audio signals for tortoise-tts. Since realizing that the diffusion model can work entirely with spectrograms, I have been re-structuring the model to be a flat transformer/resnet hybrid. One nifty thing about this set-up is that I can now concatenate the diffusion inputs with the conditioning signal from the discretized audio signal and feed the whole thing into the model. This is the same thing that multiple authors working on diffusion models have done with the low-resolution inputs for super-resolution…
I’ve spent the majority of the last two months working on improving the diffusion model in Tortoise TTS. The model used in v1 had a few major shortcomings: Conditioning inputs were bottlenecked to a very small dimensional input into the main model, limiting their effectiveness. The model was trained on audio signals at 11kHz. To make this feasible, I needed to chop up the signals into small clips which have limited context, and: The model itself was relatively shallow: the top layers only had a depth of 1 and model channels at the top levels were very restricting. No processing…
I’ve updated the tortoise-tts repo with a script that automatically download model weights (thank to the HuggingFace Hub for hosting them!). I’ve also created a colab notebook if you want to try this out on Google hardware. Make sure you pick a GPU runtime. Sample outputs can be found in the results/ folder of the GitHub repo. Find some handpicked generate below. I’m not done with this project. It is clear to me that the autoregressive model does an extremely good job at producing realistic prosody. I will be making a few tweaks to make it less sensitive to the…
In an earlier post, I walked you through a project I’ve been working on, which I called “triforce” at the time. I’ve finished training a first pass on this collection of models and want to write about the results. Deploying this speech CLIP model on the outputs of my autoregressive speech token generator made all of the difference. Outputs are consistently awesome, and almost always clearly convey the desired speech. Adding CLIP to the ensemble After training the three triforce models, I was having considerable difficulty with the autoregressive portion of the model. Specifically, while I would generate a lot…
As I mentioned in my previous blog post, I’m currently working on text-to-speech models. I’m taking the “scale-it-to-the-moon” approach, so I need a lot of data. Fortunately, speech data is pretty easy to come by. Audio books, podcasts, YouTube and large archives of speeches and presentations are available all over the internet. The problem is that this audio generally isn’t transcribed. As you may know from interacting with Alexa, her spin-offs, or your smartphone assistant, digital speech recognition software is pretty damned good these days. My goal was to leverage this to build a huge dataset of artificially transcribed audio…
For the past two years, I’ve been tinkering around with generative models in my spare time. I think I’ve landed on an approach that produces by far the most compelling results available today, and which scales like big language models. I’d like to outline the approach here. First of all, I want to touch on something that’ll become immediately obvious: this isn’t a novel architecture or anything. In fact, it is pretty much OpenAI’s DALL E with a diffusion upsampler attached. Instead, it’s a way of thinking how one can (1) improve upon DALL E and (2) universally model generative…
Switched Convolutions – Spatial MoE for Convolutions Abstract I present switched convolutions: a method for scaling the parameter count of convolutions by learning a mapping across the spatial dimension that selects the convolutional kernel to be used at each location. I show how this method can be implemented in a way that has only a small increase in computational complexity. I finally discuss applications of switched convolutions and show that applying them to a pre trained VAE results in large gains in performance. I have open sourced all of my work on switched convolutions. It can be found here. Background…
Batch size is one of the oldest hyper parameters in SGD, but it doesn’t get enough attention for super-resolution GANs. The problem starts with the fact that most SR algorithms are notorious GPU memory hogs. This is because they generally operate on high-dimensional images at high convolutional filter counts. To put this in context, the final intermediate tensor of the classic RRDB model has a shape of (<bs>x64x128x128) or over 33M floats at a batch size of 32. This one tensor consumes more than 10% of the models total memory usage! To cope with this high memory usage, SR papers…
SRFlow is a really neat adaptation of normalizing flows for the purpose of image super-resolution. It is particularly compelling because it potentially trains SR networks with only a single negative-log-likelihood loss. Thanks to a reference implementation from the authors or the paper, I was able to bring a trainable SRFlow network into DLAS. I’ve had some fun playing around with the models I have trained with this architecture, but I’ve also had some problems that I want to document here. First of all – the good First of all – SRFlow does work. It produces images that are perceptually better…
Abstract Modern image super-resolution techniques generally use multiple losses when training. Many techniques use a GAN loss to aid in producing high-frequency details. This GAN loss comes at a cost of producing high-frequency artifacts and distortions on the source image. In this post, I propose a simple regularization method for reducing those artifacts in any SRGAN model. Background on SR Losses Most SR models use composite losses to achieve realistic outputs. A pixel-wise loss and/or a perceptual loss coerces the generator to produce images that look structurally similar to the input low-resolution image. With only these losses, the network converges…