I’m going to take a stab at nailing down what I believe to be the five fundamental components of a deep neural network. I think there’s value in understanding complex systems at a simple, piecewise level. If you’re new to the field, I hope that these understandings I’ve built up over the last few years help you!
The unit of data representation in a DNN is a vector. Vectors are called many different things: embeddings, tensors, activations, hidden states. They’re all just a list of floating point numbers that represent some single thing.
The learned weights of the neural network are where data is stored. Specifically, it is becoming clear that the layers collectively known as “linear”, “affine”, or “MLPs” are principally responsible for information storage. This information is imparted onto the vectors by way of matrix multiplications.
When it comes to storage – parameter count is of paramount concern. If you are designing a neural network that you expect will need to store a lot of information (for example, to model a highly entropic dataset), you would do well to ensure that you have enough (or wide enough) MLPs.
A fairly obvious but nonetheless critical component of any neural network is the ability for it to mix information across a spatial domain. This concept is universal: different words in a sentence, pixels in an image, pressure readings in an audio signal, amino acids in a protein. Almost every ML problem boils down to modeling how the building blocks of a modality interact with each other.
Neural Networks learn information entanglement any time you allow two (or more) parts of a whole to interact with one another. This can be as simple as adding two things together, or as complicated as self-attention.
Entanglement can involve any number of parts, but traditionally we have seen the best performance from pairwise interactions (e.g. two parts interacting).
Self attention is the gold standard here. Most modern ML engineering boils down to figuring out how to efficiently apply self attention to various modalities.
Convolution is another popular way to achieve information entanglement. The problem with convolutions is that they cannot (easily) express pairwise relationships and thus are less efficient at learning many types of relationships than self-attention.
In order to be usable, neural networks must be trainable! Right now, our best known method of training a neural network is with back-propagation. To make this feasible, you must ensure that your neural network is numerically stable in both the forward pass and to full differentiation back to the inputs.
This is where normalization and residual layers come in. The goal of normalization is to ensure that the output of any module in a neural network is (at least initially) confined to the gaussian distribution (e.g. the outputs average 0 with a standard deviation of 1). If the outputs of all operations obey this rule, the network weights will generally get stable gradients and the entire system will be trainable.
For most practical purposes, normalization should only occur at the start of a residual branch.
Residual branches are another way of stabilizing the gradient that trains neural network. In a residual network, all weights have access to the same training signal at back-propagation time. The easiest way to implement an effective residual network is to ensure that all of the weights of the last affine operation in each residual branch are initialized to zero. The ideal neural network consists of an embedding layer, a single trunk with however many zero-initialized residual layers you want, and an unembedding layer at the end. This topology is practically guaranteed to be numerically stable and to train well.
An effective neural network must be able to represent complex data distributions. Complex data distributions are practically by definition non-linear.
The sole purpose of non-linearities (ReLU, GELU, softmax, etc) is to force neural networks to operate in non-linear domains. Put another way: non-linearities give the output distributions of neural networks their shape!
I’m of the opinion that this is by far the simplest aspect of designing neural networks, because the rule is pretty cut and dry: No two linear operations (e.g. matrix multiply) should be put in series on the same data without a nonlinearity between them. Every nonlinearity should be followed by a normalization layer. That’s it. If you follow these rules it doesn’t even matter which non-linearity you choose. The worst possible choice and the best possible choice are generally within a 5% performance gap.