Mixture of Experts – Non_Interactive

A Transformer is a stack of alternating Attention and MLP layers through which data embedded as high dimensional vectors is fed. A Mixture of Experts (MoE) Transformer substitutes the MLP layer for an “MoE Layer”. Let’s dive into what that means.

The “MLP” is one of the oldest neural network architectures, consisting of two linear transformations. First, an embedding vector is expanded via the first transformation. Next, a non-linearity is applied to the expanded vector. Finally, it is contracted back to the original dimensionality.

An “MoE” layer is a collection of n_experts MLP layers and a “router”. The router is an additional linear transform that outputs a probability distribution over which k of the MLP layers should be used. The embedding is fed through each of the top k selected MLP layers, then their collective outputs are summed together.

Effectively this means that, for each forward pass through the MoE layer, only k/n_experts of the parameters are being actively used. This property of the MoE is commonly referred to as “sparsity”, with the sparsity factor of an MoE layer being n_experts/k.

When measuring the efficiency of a neural network, as computed by the test loss achieved when trained on the pareto-optimal blend of data and compute (per Chinchilla), you will find that MoE Transformers are considerably more compute efficient than vanilla Transformers. This is an example of a compute multiplier. The multiplier tends to increase with sparsity (though there are diminishing returns!). Importantly, these efficiency gains persist when scaling the Transformer itself by adding layers or expanding the embedding dimension.

That last property (efficiency gains persist with scale) is actually quite rare – most architectural modifications to the Transformer tend to diminish in effect with scale. That’s because at large enough scale, a neural network can learn to mimic the implicit biases granted by most architectural decisions through brute force training.

The fact that MoE has great scaling properties indicates that something deeper is amiss with this architectural construct. This turns out to be sparsity itself – it is a new free parameter to the scaling laws for which sparsity=1 is suboptimal. Put another way – Chinchilla scaling laws focus on the relationship between data and compute, but MoEs give us another lever: the number of parameters in a neural network. Previously compute and quantity of parameters were proportional, but sparsity allows us to modulate this ratio.

This is the true magic behind the MoE Transformer, and why every big lab has been moving to them. It is not that we are somehow teaching a bunch of smaller subnetworks to specialize and then ensembling them – this is a kind of anthropomorphism that plagued a lot of architectural research in the late 2010s. Instead, it is all about the ability to add free parameters to our models to allow us to train and serve them more efficiently. And it seems to be the case that most LLMs are quite underparameterized.

There’s a sad side effect to all this – at inference time with low batch sizes on the hardware that is currently commercially available, Transformers are notoriously memory bound. That is to say that we are often only using a small fraction of the compute cores because they are constantly waiting for network weights to be loaded from VRAM. This problem gets much worse with MoE transformers – there are simply more weights to load. You need more VRAM and will be more memory bound (read: worse performance). The open source community is starting to see this with the DeepSpeed and Llama 4 models.

Instead, MoE transformers lend themselves best to highly distributed serving environments with ultra-large batch sizes and corresponding high latency per token. This makes me sad – both because I like low-latency systems and because I’m a fan of local models. I sincerely hope we see some hardware innovations in the coming years which will help to ameliorate this.