Papers

Scalable-Softmax Is Superior for Attention, 2025

Helps with long attention.

image.png

image.png

image.png

ReLU2 wins: sparse LLMs, 2024

Paper

i know about transformers and llms, but for this paper, what are they talking about when they say sparse/sparsity?

In this paper, "sparse" or "sparsity" refers to a pattern in neural networks where only a portion of neurons are actively contributing to the computation for any given input, while many others have negligible impact and could be skipped during inference.