After finishing my last project, I wanted to understand generative networks a bit better. In particular, GANs interest me because there doesn’t seem to be much research on them going on in the language modeling space.
To build up my GAN chops, I decided to try to figure out image repair and super-resolution. My reasoning was actually pretty simple: I have a large collection of old VHS quality Good Eats episodes that I enjoy watching with my family. Modern flat screens really bring out how inadequate the visual quality of these types of old videos are, however. Wouldn’t it be great if I could use machine learning to “fix” these videos to provide a better experience for myself and my family? How hard could it be?
Turns out, really hard.
State of SISR
SISR stands for single-image super-resolution. It is the most basic form of super-resolution that has been around for decades. It is appealing because it is extremely easy to collect data for it: just find a source of high quality images, downsample them and train a model to reverse that operation.
SISR has gone through the usual trends of data science. Methods run the spectrum from simple mathematic upsampling to PSNR-trained convolutional neural networks to GAN approaches. I decided to start with the latter, specifically a technique that Wang et al call “ESRGAN”.
This choice was driven primarily by the existence of the excellent ESRGAN Github project. This code is well designed and documented and has been a pleasure to work on top of.
Although my goal is eventually video super-sampling, my initial investigation into the field showed that video SR is just a subset of image SR (big shocker!). Therefore, I decided to start by really understanding SISR.
Challenges of Super Resolution (and image generation)
Training a deep GAN on image super-resolution is a hardware-challenged problem. I plan to dive into this a bit more in a future article, but TL;DR: these models benefit from training on large images, but large images consume utterly insane amounts of GPU memory during the training passes. Thus, we are forced to train on small snippets of the images. When you take small snippets, you lose context that the model would otherwise use to make better SR “decisions”.
This is coupled with the fact that convolutional networks are typically parameter-poor. Put another way: they can be hard to train because the models just don’t have the capacity and structure to generalize to the enormous variety found in the world of images.
The result of this is often hidden away by research papers. They present only the best results of highly-specialized networks that can do one thing very well, but absolutely fail on anything else. The famous StyleGAN, for example, can only produce one type of image (and one subset of those images to boot). Edge cases produce atrocious results.
Super-resolution does not have the luxury of specialization. An effective SR model must adapt to a wide variety of image contents. Even though you can restrict the domain of the images you are upsampling (for example, Good Eats frames in my case), the variety will still be staggering.
The ESRGAN authors wisely worked around this problem by specifically designing their model to recognize and reconstruct image textures. This can produce great results for the majority of an image, but begins to fall apart when you attempt to super-resolve high-frequency parts of an image – like hair or eyes that have no detail in the LR image.
Super Resolution for Pre-trained Image Models
One facet of SR that is particularly interesting to me is the possibility that, as a technique, it might be used to train models on image understanding. Large NLP models are largely trained on next token prediction, and you can consider SR to be the image-analog to this task.
I can’t help but shake the feeling that natural image understanding is fundamentally limited by our current image processing techniques. I feel that the whole field is on the cusp of a breakthrough, and SR might very well be the basis of that breakthrough.
Of course, there’s a caveat: images are insanely complex. The adage “an image is worth a thousand words” comes to mind here. If effective NLP models require billions of parameters – how many parameters are required for true image understanding?
I started my deep dive into SISR just as the COVID pandemic began to take off in North America in 2020. I’m writing this a little more than 3 months in, and I feel that I’ve learned a lot in the process.
You’re probably wondering what the point of this article is. It’s an introduction into a series of articles on musings, findings, and explorations into the world of SR. For such an obvious field of ML application, SR doesn’t have a whole lot of documentation. My hope is that the things I’ve learned can be useful to others exploring the field. Stay tuned!