Scalable-Softmax Is Superior for Attention, 2025
Helps with long attention.



ReLU2 wins: sparse LLMs, 2024
Paper
i know about transformers and llms, but for this paper, what are they talking about when they say sparse/sparsity?
In this paper, "sparse" or "sparsity" refers to a pattern in neural networks where only a portion of neurons are actively contributing to the computation for any given input, while many others have negligible impact and could be skipped during inference.
The key concepts related to sparsity in this paper are:
- Sparse activation - The phenomenon where certain model parameters contribute very weakly for a given input, meaning these parameters can be excluded during inference with minimal impact on the final result
- Sparse computation - A technique to optimize LLM inference in low-resource scenarios by dynamically skipping the computation of inactive neurons
- Traditionally, sparse activation was identified in ReLU-based models by focusing on neurons with exactly zero activation values. The paper expands this definition by considering neuron output magnitudes instead - even if a neuron's activation isn't exactly zero, if its output magnitude is below a certain threshold, it's considered negligible and can be skipped
- The researchers discovered that neuron output magnitudes follow a "long-tail" distribution, where many neurons produce outputs with very small magnitudes that contribute minimally to the final result