DLAS Super Resolution

Training SRFlow in DLAS (and why you shouldn’t)

SRFlow is a really neat adaptation of normalizing flows for the purpose of image super-resolution. It is particularly compelling because it potentially trains SR networks with only a single negative-log-likelihood loss.

Thanks to a reference implementation from the authors or the paper, I was able to bring a trainable SRFlow network into DLAS. I’ve had some fun playing around with the models I have trained with this architecture, but I’ve also had some problems that I want to document here.

First of all – the good

First of all – SRFlow does work. It produces images that are perceptually better than PSNR-trained models and don’t have artifacts like GAN-trained ones. For this reason, I think this is a very promising research direction, especially if we can figure out more effective image processing operations that have tractable determinants.

Before I dig into the “bad”, I want to provide a “pressure relief” for the opinions I express here. These are not simple networks to train or understand. It is very likely that I have done something wrong in my experiments. Everything I state and do is worth being double-checked (and a lot of it is trivial to do so for those who are actually interested).

The Bad

Model Size

SRFlow starts with a standard RRDB backbone, and tacks on a normalizing flow network. This comes at significant computational cost. RRDB is no lightweight already, and the normalizing flow net is much, much worse. These networks have a step time about 4x what I was seeing with ESRGAN networks. It is worth noting that reported GPU utilization while training SRFlow networks is far lower than I am used to, averaging about 50%. I believe this is due to inefficiencies in the model code (which I took from the author). I was tempted to make improvements here, but preferred to keep backwards compatibility so I could use the authors pretrained model.

Aside from training slowly, SRFlow has a far higher memory burden. On my RTX3090 with 24G of VRAM, I was running OOM when trying to perform inference on images about 1000x1000px in size (on the HQ end).


While SRFlow generally produces aesthetically pleasing results, every trained model I have used generates subtle blocky artifacts. These artifacts are most visible in uniform textures. Here is are two good examples of what I am talking about:

Examples of SRFlow artifacts. Images artificially blown up 250% for better visualization.

I have encountered these artifacts in other generative models I have trained in the past. They result from 1×1 convolutions which cannot properly integrate small differences in latent representation that neighboring pixels might contain. Unfortunately, SRFlow can only use 1×1 convolutions because these are the only type of convolution which are invertible.

Technically speaking, there is no reason why we could not eliminate these artifacts using additional “consistency” filters trained on the SRFlow output. I think it is worth knowing about them, though, since they point at a deeper problem with the architecture.

The SRFlow architecture currently has poor convergence

This one is a bit more complicated. You first need to understand the objective function of normalizing flows: to map a latent space (in this case, HQ images conditional on their LQ counterparts) to a distribution indistinguishable from gaussian noise.

To show why I think that SRFlow does a poor job at this, I will use the pretrained 8x face upsampler model provided by the authors. To demonstrate the problems with this model, I pulled a random face from the FFHQ dataset and downsampled it 8x:

I then went to the Jupyter notebook found in author’s repo and did a few upsample tests with the CelebA_8x model. Here is the best result:
Note that it is missing a lot of high frequency details and has some of the blocky artifacts discussed earlier.

I then converted that same model into my repo, and ran a script I have been using to play with these models. One thing I can do with this script is generate the “mean” face for any LR input (simple really, you just feed a tensor full of zeros to the gaussian input). Here is the output from that:

So what you are seeing here is what the model thinks the “most likely” HQ image is for the given LQ input. For reference, here is the image difference between the original HQ and the mean:

Note that the mean is missing a lot of the high-frequency details. My original suspiscion for why this is happening is that the network is encoding these details into the Z vector that it is supposed to be converting to a gaussian distribution. To test this, I plotted the std(dim=1) and mean(dim=1) of the Z vectors at the end of the network (dim 1 is channel/filter dimension):

In a well trained normalizing flow, these would be indistinguishable from noise. As you can see, they are not: the Z vector contains a ton of structural information about the underlying HQ image. This tells me that the network is unable to properly capture these high frequency details and map them to a believable function.

This is, in general, my experience with SRFlow. I presented one image above, but the same behavior is exhibited in pretty much all inputs I have tested with and extends to every other SRFlow network I have trained or work with. The best I can ever get out of the network is images with Z=0, which produces appealing, “smoothed” images that beat out PSNR losses, but it is misses all of the high-frequency details that a true SR algorithm should be creating. No amount of noise at the Z-input produces these details: the network simply does not learn how to convert these high frequency details into true gaussian noise.

It is worth noting that I brought this up with the authors. They gave this response to my comments, which provides some reasons why I may be seeing these issues. I can buy into these reasons, but they point to limitations with SRFlow that render it much less useful than other types of SR networks.


I think the idea behind SRFlow has some real merit. I hope that the authors or others continue this line of research and find architectures that do a better job converging. For the time being, however, I will continue working with GANs for super-resolution.

DLAS Super Resolution

Deep Learning Art School (DLAS)

At the beginning of this year, I started working on image super-resolution on a whim: could I update some old analog-TV quality videos I have archived away to look more like modern videos? This has turned out to be a rabbit hole far deeper than I could have imagined.

It started out by learning about modern image super-resolution techniques. To this end, I started with a popular GitHub repo called ‘mmsr’. This repo no longer exists, and has since been absorbed into mmediting, but at the time it was a very well-written ML trainer library containing all of the components needed to set up an SR-training pipeline.

As my SR (and GAN) journey continued, I often needed to make sweeping alterations to the trainer code. This frustrated me, because it invalidated old experiments or added a ton of labor (and messy code) to keep them relevant. It was doubly-insulting because MMSR at its core was designed to be configuration-driven. As a “good” SWE before an ML practitioner, I started coming up with a plan to massively overhaul MMSR.

Deep Learning Art School is the manifestation of that plan. With it, I have wholly embraced a configuration-driven ML training pipeline that is targeted at research and experimentation. It was originally designed with training image super-resolution models in mind, but I have been able to easily build configurations that train everything from pure GANs to object detectors to image recognizers with very small changes to the plugin API. I now edit the core training code so infrequently that I considering breaking it off into its own repo (or turning it into a Python tool). This has been a design success beyond my wildest dreams.

The repo still has some rough edges, to be sure. Most of that is due to two things:

  1. In the original design, I never imagined I would be using it outside of image SR. There are many unnecessary hard-coded points that make this assumption and make other work flows inconvenient.
  2. I did not bother to write tests for the original implementation. I just never thought it would be as useful as it turned out to be.

In the next couple of months, I plan to slowly chip away at these problems. This tool has been incredible for me as a way to bootstrap my way into implementing pretty much any image-related paper or idea I can come up with, and I want to share it with the world.

Expect to hear more from me about this repo going forwards, but here are some reference implementations of SOTA papers that might whet your appetite for what DLAS is and what it can do:

  • SRFlow implentation – I pulled in the model source code from the author’s repo, made a few minor changes, and was able to train it!
  • GLEAN implementation – I hand-coded this one based on information from the paper and successfully reproduced some of what they accomplished in the paper (haven’t had a chance to test everything yet).
  • ESRGAN implementation – Not new by any measure, but shows what the DLAS way of accomplishing this “classic” method looks like.