Super Resolution

Switched Convolutions – Spatial MoE for Convolutions

Switched Convolutions – Spatial MoE for Convolutions


I present switched convolutions: a method for scaling the parameter count of convolutions by learning a mapping across the spatial dimension that selects the convolutional kernel to be used at each location. I show how this method can be implemented in a way that has only a small increase in computational complexity. I finally discuss applications of switched convolutions and show that applying them to a pre trained VAE results in large gains in performance.

I have open sourced all of my work on switched convolutions. It can be found here.


Despite the growing popularity of autoregressive models based on Transformers for image processing tasks, CNNs remain the most efficient way to perform image processing. 

One disadvantage of CNNs is that it is difficult to effectively scale their parameter count. This is normally done by either increasing the depth of the network or increasing the number of channels in the intermediate states. The problem with scaling either of these numbers is that doing so increases computational complexity by O(n^2) for 2-D convolutions because every parameter is repeatedly applied across every spatial index.

Another option for scaling is to move back to stacked dense layers for processing images. The problem with this approach is it does not encode the translational invariance bias that gives convolutions their prowess at processing diverse images.

In the language modeling space, an interesting idea was put forward by the Mixture of Experts (MoE) paper: scale the parameter count of a model by “deactivating” most of the parameters for any given input. A second paper, “Switch Transformers” extends this idea by proposing modifications that allow a MoE model to scale parameters while achieving a near fixed computational cost. The resulting model is termed “sparse” – it uses the inputs to dynamically select which parameters to use for any given computation and most parameters are unused for every input.

I aim to apply the MoE paradigm to convolutions.

Switched Convolutions

A switched convolution is a convolution which is composed of b independent kernels. Computing the convolution is similar to a standard convolution, except that each spatial input location uses a single one of the b kernels.

Ideally, the mechanism that selects which kernel to use for each spatial location would be learned. I adapt the sparse routing mechanism from Switch Transformers to achieve this, and propose a novel normalization layer that promotes proportional usage of all kernels. 

This drawing visualizes how a switched conv works:


The selector is a parameterized function responsible for producing a discrete mapping from the input space to the kernel selection space. It basically converts an input image into a set of spatially-aligned integers, which will be used to select which convolutional kernel to be used at each image location.

The selector can be attached to any input, but in the experiments discussed in this post, I always attach it to the previous layer in the network. It is worth noting that I have tried using separate networks for generating selector inputs, but they have proven difficult to train and do not produce better results.

 Here is a sketch of the internals of a selector:

Switch Processing Network

A NN is embedded within the switch to allow it to segment the image into like zones which will use the same convolutional kernels. It can be useful to think of this network like the dense layers applied to the transformer attention inputs.

The switch NN can be implemented using any type or number of NN layers capable of adjusting the input channel count, for example a 1×1 convolution, a lambda layer or even a transformer.

Switch Norm

The objective of the switch norm is to promote load balancing across all kernels. Without a switch norm, switched convolutions tend to collapse into using a single kernel. The switch norm replaces the load balancing loss proposed in the MoE and Switch Transformers paper. I tried a similar load balancing loss with switched convolutions, but found the normalization method superior.

The switch norm works similar to a batch normalization across the selector dimension, except instead of operating across a batch, it operates across a large accumulated set of outputs, p. Every time the switch norm produces a new output, it adds that output to p. To keep memory in check, the accumulator is implemented as a rotating buffer of size q.

Effectively, this simple norm ensures that the average usage of each kernel across q samples and the entire spatial domain of the input is even. As long as q is big enough, there is still ample room for specialization from the selector, but no one kernel will ever dominate a switched conv. 

I used a value of q=256 for most of my experiments. Future work should explore adjusting this hyperparameter as I did not tinker with it much.

It is important to note that the rotating buffer p becomes a parameter for any network using switch normalization. Even though gradients do not flow to it, it develops a characteristic signal over time. Attempting to perform inference without using a saved p always produces poor results.

A reference implementation for the switch norm can be found here.

Differentiable Argmax or Hard Routing

The argmax function, which returns the integer index of the greatest element along the specified axis, is not normally a differentiable function. In implementing switched convolutions, I produce a “differentiable argmax” function.

The forward pass behaves identically to the standard numpy argmax() function. The numeric value of the input that was fed into diff_argmax is recorded.

In the backwards pass, the gradients are first divided by the input recorded by the forward pass. Then, the gradient is set to zero for all but the max element along the specified axis.

The gradients coming out of diff_argmax are a bit odd: they are exceptionally sparse and you might think that entire kernels would “die” off. This is what the switch norm prevents, however.

A reference diffargmax implementation for Pytorch can be found here.

Switched Convolution

The actual switched convolution iterates across each spatial location and uses the output of the selector to determine which convolutional kernel to apply at that location.

Naive Implementation

A simple way to compute the switched convolution output is to perform k standard convolutions for each kernel k, then multiply them by the one-hot output of the selector:

Such a method can even be used without hard routing. In my experiments this does not perform much better than hard routing.

CUDA implementation

It is worth noting that since only one kernel is active per spatial location, the switched convolution only needs to calculate one dot product per spatial location – exactly the same as a standard convolution.

In contrast to Switch Transformers, which require distributed training processes to start seeing a scaling advantage, switched convolutions can be optimized on a single GPU. However, the larger kernel size and pseudo-random access into the kernel has a significant effect on how quickly a switched convolution can run. 

A naive CUDA kernel that implements this can be found here. This custom kernel could use a significant amount of optimization (for example, it does not use tensor cores) but currently operates at ~15% the speed of a normal convolution when accounting for both the forward and backward passes with b=8. This means it is net-faster than the naive implementation at b=8, and improves linearly from there. It also has significantly better memory utilization properties because it saves considerably less intermediate tensors for backprop.


Training models with switched convolutions works best with large batch sizes. This makes sense: switched convolutions are very sparse and their parameters will only accrue meaningful gradients across a large set of examples. For example, if b=8, each parameter in the switched conv is generally only receiving about 1/8th of the gradient signal.

While it is possible to train a model incorporating switched convolutions from scratch, it is tedious since the signals that the selector function feeds off of are exceptionally noisy in the early stages of training.

For this reason, I use a different, staged approach to training models with switched convolutions: first, train a standard CNN model. After this has converged, I convert a subset of the convolutions in that model to switched convolutions and continue training. This has several advantages:

  1. First stage training can be fast: smaller batch sizes can be used alongside simpler computations.
  2. Since the selector functions are only brought online in the second stage, they start training on fairly “mature” latents.

Converting a standard convolution to a switched convolution is simple: simply copy the kernel parameters across the switch breadth (b) and add a selector. Once you start training, the kernel parameters across the breadth dimension will naturally diverge and specialize as directed by the selector.

Uses & Demonstration

In experimenting with switched convolutions, I have seen the most success in applying them to generative networks. This is intuitive: they offer a way to decouple the expressive nature of the convolution in a generative network from a receptive understanding of what the network is actually working on. For example, a selector can learn to apply different kernels to “draw” hair, eyes, and skin – which all have different textures.


To demonstrate how effective switched convolutions are at improving network performance, I apply them to the stage 1 VQVAE network. I first train a vanilla stage 1 VQVAE to convergence:

I then convert the network by replacing 4 convolutions in both the encoder and decoder with switched convolutions that use b=8 and selector composed of a lambda layer followed by a 1×1 convolution:

The result is a 20% improvement in loss, accounting for both the pixel-MSE reconstruction loss and the commitment loss.

Other Tests

It is worth noting that VQVAE is likely under parameterized for the data I used in this experiment. Inserting switched convolutions in a similar manner into other networks did not show as much success. Here are some notable things I tried:

  1. Classification networks: inserted switched convs in the upper (high resolution) layers of resnet-50. Performance slightly degraded.
  2. Segmentation networks: inserted switched convs in the high resolution backbone layers. Performance did not change.
  3. Stylegan2: inserted switched convs in the generator. Performance degraded. (This is a special case because of the way conv weights interact with the mapping network).
  4. Super-resolution: A 5-layer deep switched conv network of breadth 8 was found to have competitive performance with the 23-layer deep RRDB network from the paper.

Visualizing the Selector Outputs

It is trivial to output the maps produced by the selectors as a colormap. This can be instructive as it shows how the network learns to partition the images. Here are some example selector maps from the high resolution decoder selector from the VQVAE I trained:

As you can see, these selector maps generally seem to resemble edge detectors in function. They also seem to perform shading in generative networks, for example the arms in the third image.

Future Work

At this point, I don’t believe switched convolutions have demonstrated enough value to support continued research as I have currently formulated them. That being said, I still think the concept has value and I would like to revisit them in the future.

In particular, I am not satisfied with the way the selectors operate. This is purely a heuristic, but I believe the power of switched convs would be best expressed when the semantics of the image are separated from the texture. That is to say – I would have liked to have regions of the image that exhibit different textures (e.g. hair, eyes, skin, background) selected differently.

One project I am currently pondering is working on an unsupervised auto-segmenter. Something in the vein of Pixel-Level Contrastive Learning. If I could train a network that produces useful semantic latents at the per-pixel level, it could likely be applied at the input of the selector in switched convolutions to great effect.

Super Resolution

SRGANs and Batch Size

Batch size is one of the oldest hyper parameters in SGD, but it doesn’t get enough attention for super-resolution GANs.

The problem starts with the fact that most SR algorithms are notorious GPU memory hogs. This is because they generally operate on high-dimensional images at high convolutional filter counts.

To put this in context, the final intermediate tensor of the classic RRDB model has a shape of (<bs>x64x128x128) or over 33M floats at a batch size of 32. This one tensor consumes more than 10% of the models total memory usage!

To cope with this high memory usage, SR papers often recommend training with miniscule batch sizes in the regime of 4-16 samples per batch. This is wholly inadequate, as I will discuss in this article.

Larger batches are (almost) always better

Training SR models with larger batch sizes results in an immediate permanent improvement in performance of every SR model I have trained thus far. I discovered this on a whim with a custom model I was developing, but found out later that it applies to RRDB and SRResNet as well. Here is an example plot:

Perceptual loss of two identical models trained on different batch sizes. Blue line is batch-size=16. Red line is batch-size=64. Blue anomaly is caused by an overflow during 16-bit training.

The plot above conveys my experience in general: a larger batch size does not just accelerate training, it permanently improves it. This difference is visible in the resulting images as well. Models trained on larger batch sizes exhibit less artifacts and more coherent fine image structures (e.g. eyes, hair, ears, fingers).

Here is an interesting anecdote from a recent experience I had with this: I am training an ESRGAN model and decided to move from training to 128×128 HQ images to 256×256. To accomplish this, I re-used the same model and added a layer to the discriminator. I decided to speed things up by reducing the batch size by a factor of 2. After nearly a week of training and many tens of thousands of iterations, the results were worse than what I had started with. After doubling the batch size, the model finally began to visually improve again.

Recommendations for larger batches

I’ve done some comparisons between the same model with different batch sizes. The performance improvement that comes with increasing batch size is nearly linear between batch-size=[16,128]. I have not experimented heavily past 128 due to my own computational budget limitations. Any model I am serious about these days gets a batch size of 128, though.

Accommodating Large Batches

As mentioned earlier, the authors of SR papers have good reason to recommend smaller batch sizes: the RRDB network proposed in ESRGAN consumes about 10GB of VRAM with a batch size of 16!

As I’ve worked on more SR topics, I’ve come up with several workarounds that can help you scale your batch sizes up.

  1. Gradient Accumulation – You can easily synthesize arbitrarily large batch sizes using a technique called gradient accumulation. This simply involves repeatedly summing the gradients from multiple backwards passes into your parameters before performing an optimizer step. This can affect models that use batch statistics, but shouldn’t matter for SRGAN models because they shouldn’t be using batch normalization. Gradient accumulation is controlled in DLAS using the mega_batch_factor configuration parameter.
  2. Gradient Checkpointing – This is an unfortunately named and underutilized feature of pytorch that allows you to prune out most of the intermediate tensors your model produces from GPU memory. This comes at the cost of having to re-compute these intermediate tensors in the backwards pass. Trust me: this is much faster than you think it is. The performance penalty of gradient checkpointing is often negligible simply because it allows you to fully utilize your GPU where you would otherwise only be partially using it. Gradient checkpointing is enabled in DLAS using the checkpointing_enabled configuration parameter.
  3. Mixed Precision – This is fairly old hat by now, but training in FP16 or in mixed precision mode will result in far lower memory usage. It can be somewhat of a pain, though, as evidenced above. Torch has recently made this a first-class feature.

(By the way, all of these are implemented in DLAS – my generative network trainer. Check that out if you are interested in trying these out without spending many hours tweaking knobs.)

DLAS Super Resolution

Training SRFlow in DLAS (and why you shouldn’t)

SRFlow is a really neat adaptation of normalizing flows for the purpose of image super-resolution. It is particularly compelling because it potentially trains SR networks with only a single negative-log-likelihood loss.

Thanks to a reference implementation from the authors or the paper, I was able to bring a trainable SRFlow network into DLAS. I’ve had some fun playing around with the models I have trained with this architecture, but I’ve also had some problems that I want to document here.

First of all – the good

First of all – SRFlow does work. It produces images that are perceptually better than PSNR-trained models and don’t have artifacts like GAN-trained ones. For this reason, I think this is a very promising research direction, especially if we can figure out more effective image processing operations that have tractable determinants.

Before I dig into the “bad”, I want to provide a “pressure relief” for the opinions I express here. These are not simple networks to train or understand. It is very likely that I have done something wrong in my experiments. Everything I state and do is worth being double-checked (and a lot of it is trivial to do so for those who are actually interested).

The Bad

Model Size

SRFlow starts with a standard RRDB backbone, and tacks on a normalizing flow network. This comes at significant computational cost. RRDB is no lightweight already, and the normalizing flow net is much, much worse. These networks have a step time about 4x what I was seeing with ESRGAN networks. It is worth noting that reported GPU utilization while training SRFlow networks is far lower than I am used to, averaging about 50%. I believe this is due to inefficiencies in the model code (which I took from the author). I was tempted to make improvements here, but preferred to keep backwards compatibility so I could use the authors pretrained model.

Aside from training slowly, SRFlow has a far higher memory burden. On my RTX3090 with 24G of VRAM, I was running OOM when trying to perform inference on images about 1000x1000px in size (on the HQ end).


While SRFlow generally produces aesthetically pleasing results, every trained model I have used generates subtle blocky artifacts. These artifacts are most visible in uniform textures. Here is are two good examples of what I am talking about:

Examples of SRFlow artifacts. Images artificially blown up 250% for better visualization.

I have encountered these artifacts in other generative models I have trained in the past. They result from 1×1 convolutions which cannot properly integrate small differences in latent representation that neighboring pixels might contain. Unfortunately, SRFlow can only use 1×1 convolutions because these are the only type of convolution which are invertible.

Technically speaking, there is no reason why we could not eliminate these artifacts using additional “consistency” filters trained on the SRFlow output. I think it is worth knowing about them, though, since they point at a deeper problem with the architecture.

The SRFlow architecture currently has poor convergence

This one is a bit more complicated. You first need to understand the objective function of normalizing flows: to map a latent space (in this case, HQ images conditional on their LQ counterparts) to a distribution indistinguishable from gaussian noise.

To show why I think that SRFlow does a poor job at this, I will use the pretrained 8x face upsampler model provided by the authors. To demonstrate the problems with this model, I pulled a random face from the FFHQ dataset and downsampled it 8x:

I then went to the Jupyter notebook found in author’s repo and did a few upsample tests with the CelebA_8x model. Here is the best result:
Note that it is missing a lot of high frequency details and has some of the blocky artifacts discussed earlier.

I then converted that same model into my repo, and ran a script I have been using to play with these models. One thing I can do with this script is generate the “mean” face for any LR input (simple really, you just feed a tensor full of zeros to the gaussian input). Here is the output from that:

So what you are seeing here is what the model thinks the “most likely” HQ image is for the given LQ input. For reference, here is the image difference between the original HQ and the mean:

Note that the mean is missing a lot of the high-frequency details. My original suspiscion for why this is happening is that the network is encoding these details into the Z vector that it is supposed to be converting to a gaussian distribution. To test this, I plotted the std(dim=1) and mean(dim=1) of the Z vectors at the end of the network (dim 1 is channel/filter dimension):

In a well trained normalizing flow, these would be indistinguishable from noise. As you can see, they are not: the Z vector contains a ton of structural information about the underlying HQ image. This tells me that the network is unable to properly capture these high frequency details and map them to a believable function.

This is, in general, my experience with SRFlow. I presented one image above, but the same behavior is exhibited in pretty much all inputs I have tested with and extends to every other SRFlow network I have trained or work with. The best I can ever get out of the network is images with Z=0, which produces appealing, “smoothed” images that beat out PSNR losses, but it is misses all of the high-frequency details that a true SR algorithm should be creating. No amount of noise at the Z-input produces these details: the network simply does not learn how to convert these high frequency details into true gaussian noise.

It is worth noting that I brought this up with the authors. They gave this response to my comments, which provides some reasons why I may be seeing these issues. I can buy into these reasons, but they point to limitations with SRFlow that render it much less useful than other types of SR networks.


I think the idea behind SRFlow has some real merit. I hope that the authors or others continue this line of research and find architectures that do a better job converging. For the time being, however, I will continue working with GANs for super-resolution.

Concepts Super Resolution

Translational Regularization for Image Super Resolution


Modern image super-resolution techniques generally use multiple losses when training. Many techniques use a GAN loss to aid in producing high-frequency details. This GAN loss comes at a cost of producing high-frequency artifacts and distortions on the source image. In this post, I propose a simple regularization method for reducing those artifacts in any SRGAN model.

Background on SR Losses

Most SR models use composite losses to achieve realistic outputs. A pixel-wise loss and/or a perceptual loss coerces the generator to produce images that look structurally similar to the input low-resolution image. With only these losses, the network converges on producing high-resolution images that are essentially the numerical mean of all of the training data.

To humans, this results in an output image that is blurred and overly smoothed. High-frequency details like pock-marks, individual hair strands, fine scratches, etc are not represented in the high-resolution images. These can be appealing to the eye, but they are also clearly artificial.

Extreme examples of images upsampled using only pixel losses. The network learns how to form sharp edges, but completely fails at producing high frequency details.

To improve on this situation, adding a GAN loss was proposed in the SRGAN paper from 2017. This loss is effective in bringing back many high-frequency details, but comes at a cost: the generator eventually begins to learn to “trick” the discriminator by adding high-frequency artifacts in the image.

Examples of GAN artifacts. They often appear in areas of high-frequency details like eyes and hair. For the hair, notice the “strands” the generator is applying that go against the actual flow of the hair.

These artifacts range from mild to extremely bothersome. I have observed them simply removing eyebrows from faces to distorting hands or feet into giant blobs, even when the structural information for those feature were in the low-resolution images. Images generated from GAN SR networks are therefore generally more realistic than their perceptual counterparts, but are even more unsuited for general use since their failure mode is so severe.

Existing Solutions to SRGAN Artifacts

There are many proposed solutions to GAN artifacts. To name a few:

SPSR trains two separate networks: one built on top of images that have been fed through an edge detector and one on the raw image. The logic is that the network is induced to preserve the structure of the low-resolution image throughout the upsampling process.

TecoGAN (and other video SR architectures) improve the state by adding temporal coherence losses, which forces the generator to be self-consistent across multiple frames.

GLEAN uses a pretrained generative network trained with only a GAN loss to guide the SRGAN process towards realistic high-frequency textures.

Approaching the problem by posing the loss in the frequency-domain or after a wavelet transform have also been explored as solutions to the problem.

Of these, I have found that the TecoGAN approach leads to the most impressive reduction in GAN artifacts. It is particularly intriguing because even though the intention of the paper was to improve temporal consistency, the authors also achieved superior single-image super-resolution.

Exploring Self Consistency Losses

The main divergence between SRGAN and TecoGAN is the pingpong loss proposed the TecoGAN paper. This loss is derived by feeding a series of warped video frames recursively forward then backward through the generative network. The same high-resolution video frame before and after this recursive feedforward is compared to each other with a simple pixel loss. The idea is that artifacts introduced by the network will necessarily grow during the “ping-pong” process causing inconsistent outputs which could then be trained away.

This type of self-consistency loss is more powerful than the standard L1/L2 loss against a fixed target because the network can learn to be self-consistent from the gradients of the feedforward passes that produced both the images. For example, the network can learn to fix the problem of growing artifacts by suppressing those artifacts early on (in the first pass of the network) or suppressing their growth by accumulating a better statistical understanding of the underlying natural image. Either way, downstream quality is a result.

Self Consistency Losses for Single Image Super Resolution

The same recursive redundancy loss can be performed for single images as well. The basic method to do this is to take an HQ image and derive two LQ images that share some region from that HQ image. Then, feed these LQ images through your generator and compare the same regions in the generated results.

There are actually many ways you can do this. Basically any image augmentation you might read from DiffAug or such works. For the purposes of image SR, you should probably steer away from color shifts or blurs, but translation, rotation and zooms are great methods.

Having tried all three, I have had particular success with translation. The following simple algorithm has had a noticeable effect on image quality for all of my SR networks:

Example crops from top-left and bottom-right corners of an HQ image.
  1. For any given HQ image, crop LQ patches from each corner of the image. For example, from a 256px image, extract 4 224px patches.
  2. Randomly pick any single corner image to feed forward through the network for the normal losses (e.g. L1, perceptual, GAN).
  3. Pull the region from (2) that is shared with all corner crops.
  4. Randomly pick a second corner crop, feed it forward, and crop out the region of the image that is shared with all corner crops.
  5. Perform an L1 loss between the results from (4) and (3).

This algorithm can be further improved upon by selecting crops that don’t necessarily need to start in the image corners, but I am not sure that the additional complexity warrants improvements. Sheer and zoom can also be added, but this also adds complexity (particularly regarding pixel alignment). I have tried zoom losses and they did not add significant performance gains.

Example validation performance gains on an L1-perceptual loss from a VGG-16 network between two networks. The red line represents a baseline network without the translational consistency loss. The blue line re-starts training of the baseline network at step 30k with the translational consistency loss added. Performance gains are ~1-2%. Heuristic perceptual gains are much higher due to less artifacts.

One note about this loss: it should not be applied to an SR network until after it begins to produce coherent images. Applying the loss from the start of training results in networks that never converge because their initial outputs are so noisy that the translational loss dominates the landscape. The TecoGAN authors noted the same result with their ping-pong loss, as an example.

DLAS Super Resolution

Deep Learning Art School (DLAS)

At the beginning of this year, I started working on image super-resolution on a whim: could I update some old analog-TV quality videos I have archived away to look more like modern videos? This has turned out to be a rabbit hole far deeper than I could have imagined.

It started out by learning about modern image super-resolution techniques. To this end, I started with a popular GitHub repo called ‘mmsr’. This repo no longer exists, and has since been absorbed into mmediting, but at the time it was a very well-written ML trainer library containing all of the components needed to set up an SR-training pipeline.

As my SR (and GAN) journey continued, I often needed to make sweeping alterations to the trainer code. This frustrated me, because it invalidated old experiments or added a ton of labor (and messy code) to keep them relevant. It was doubly-insulting because MMSR at its core was designed to be configuration-driven. As a “good” SWE before an ML practitioner, I started coming up with a plan to massively overhaul MMSR.

Deep Learning Art School is the manifestation of that plan. With it, I have wholly embraced a configuration-driven ML training pipeline that is targeted at research and experimentation. It was originally designed with training image super-resolution models in mind, but I have been able to easily build configurations that train everything from pure GANs to object detectors to image recognizers with very small changes to the plugin API. I now edit the core training code so infrequently that I considering breaking it off into its own repo (or turning it into a Python tool). This has been a design success beyond my wildest dreams.

The repo still has some rough edges, to be sure. Most of that is due to two things:

  1. In the original design, I never imagined I would be using it outside of image SR. There are many unnecessary hard-coded points that make this assumption and make other work flows inconvenient.
  2. I did not bother to write tests for the original implementation. I just never thought it would be as useful as it turned out to be.

In the next couple of months, I plan to slowly chip away at these problems. This tool has been incredible for me as a way to bootstrap my way into implementing pretty much any image-related paper or idea I can come up with, and I want to share it with the world.

Expect to hear more from me about this repo going forwards, but here are some reference implementations of SOTA papers that might whet your appetite for what DLAS is and what it can do:

  • SRFlow implentation – I pulled in the model source code from the author’s repo, made a few minor changes, and was able to train it!
  • GLEAN implementation – I hand-coded this one based on information from the paper and successfully reproduced some of what they accomplished in the paper (haven’t had a chance to test everything yet).
  • ESRGAN implementation – Not new by any measure, but shows what the DLAS way of accomplishing this “classic” method looks like.
Super Resolution

Diving into Super Resolution

After finishing my last project, I wanted to understand generative networks a bit better. In particular, GANs interest me because there doesn’t seem to be much research on them going on in the language modeling space.

To build up my GAN chops, I decided to try to figure out image repair and super-resolution. My reasoning was actually pretty simple: I have a large collection of old VHS quality Good Eats episodes that I enjoy watching with my family. Modern flat screens really bring out how inadequate the visual quality of these types of old videos are, however. Wouldn’t it be great if I could use machine learning to “fix” these videos to provide a better experience for myself and my family? How hard could it be?

Turns out, really hard.

State of SISR

SISR stands for single-image super-resolution. It is the most basic form of super-resolution that has been around for decades. It is appealing because it is extremely easy to collect data for it: just find a source of high quality images, downsample them and train a model to reverse that operation.

SISR has gone through the usual trends of data science. Methods run the spectrum from simple mathematic upsampling to PSNR-trained convolutional neural networks to GAN approaches. I decided to start with the latter, specifically a technique that Wang et al call “ESRGAN”.

This choice was driven primarily by the existence of the excellent ESRGAN Github project. This code is well designed and documented and has been a pleasure to work on top of.

Although my goal is eventually video super-sampling, my initial investigation into the field showed that video SR is just a subset of image SR (big shocker!). Therefore, I decided to start by really understanding SISR.

Challenges of Super Resolution (and image generation)

Training a deep GAN on image super-resolution is a hardware-challenged problem. I plan to dive into this a bit more in a future article, but TL;DR: these models benefit from training on large images, but large images consume utterly insane amounts of GPU memory during the training passes. Thus, we are forced to train on small snippets of the images. When you take small snippets, you lose context that the model would otherwise use to make better SR “decisions”.

This is coupled with the fact that convolutional networks are typically parameter-poor. Put another way: they can be hard to train because the models just don’t have the capacity and structure to generalize to the enormous variety found in the world of images.

The result of this is often hidden away by research papers. They present only the best results of highly-specialized networks that can do one thing very well, but absolutely fail on anything else. The famous StyleGAN, for example, can only produce one type of image (and one subset of those images to boot). Edge cases produce atrocious results.

Super-resolution does not have the luxury of specialization. An effective SR model must adapt to a wide variety of image contents. Even though you can restrict the domain of the images you are upsampling (for example, Good Eats frames in my case), the variety will still be staggering.

The ESRGAN authors wisely worked around this problem by specifically designing their model to recognize and reconstruct image textures. This can produce great results for the majority of an image, but begins to fall apart when you attempt to super-resolve high-frequency parts of an image – like hair or eyes that have no detail in the LR image.

Super Resolution for Pre-trained Image Models

One facet of SR that is particularly interesting to me is the possibility that, as a technique, it might be used to train models on image understanding. Large NLP models are largely trained on next token prediction, and you can consider SR to be the image-analog to this task.

I can’t help but shake the feeling that natural image understanding is fundamentally limited by our current image processing techniques. I feel that the whole field is on the cusp of a breakthrough, and SR might very well be the basis of that breakthrough.

Of course, there’s a caveat: images are insanely complex. The adage “an image is worth a thousand words” comes to mind here. If effective NLP models require billions of parameters – how many parameters are required for true image understanding?

Going Forwards

I started my deep dive into SISR just as the COVID pandemic began to take off in North America in 2020. I’m writing this a little more than 3 months in, and I feel that I’ve learned a lot in the process.

You’re probably wondering what the point of this article is. It’s an introduction into a series of articles on musings, findings, and explorations into the world of SR. For such an obvious field of ML application, SR doesn’t have a whole lot of documentation. My hope is that the things I’ve learned can be useful to others exploring the field. Stay tuned!