ReLU2 wins: sparse LLMs

Paper

i know about transformers and llms, but for this paper, what are they talking about when they say sparse/sparsity?

In this paper, "sparse" or "sparsity" refers to a pattern in neural networks where only a portion of neurons are actively contributing to the computation for any given input, while many others have negligible impact and could be skipped during inference.

The key concepts related to sparsity in this paper are:

  1. Sparse activation - The phenomenon where certain model parameters contribute very weakly for a given input, meaning these parameters can be excluded during inference with minimal impact on the final result
  2. Sparse computation - A technique to optimize LLM inference in low-resource scenarios by dynamically skipping the computation of inactive neurons
  3. Traditionally, sparse activation was identified in ReLU-based models by focusing on neurons with exactly zero activation values. The paper expands this definition by considering neuron output magnitudes instead - even if a neuron's activation isn't exactly zero, if its output magnitude is below a certain threshold, it's considered negligible and can be skipped
  4. The researchers discovered that neuron output magnitudes follow a "long-tail" distribution, where many neurons produce outputs with very small magnitudes that contribute minimally to the final result
  5. Sparsity ratio - The proportion of neurons that can be considered inactive (having output magnitudes below a threshold) compared to the total number of neurons

The paper explores which activation functions (ReLU, SwiGLU, ReGLU, ReLU²) result in the best trade-offs between performance and sparsity, with the goal of making LLM inference more efficient while maintaining accuracy.

if relu2 is just the square of relu then isn't it the same sparsity as relu, in that it's 0 iff relu is 0?

You've made an excellent observation! You're absolutely right - in terms of zero-detection, ReLU² has identical sparsity to ReLU, since:

ReLU²(x) = (max(0, x))² = 0 if and only if ReLU(x) = 0

The paper's approach actually goes beyond the traditional definition that only looks at exact zeros. They redefine sparsity based on output magnitude thresholds: