Meta learning, i.e. learning to learn. E.g. neural architecture search

Papers

Sharpness aware minimization

Scalable-Softmax Is Superior for Attention, 2025

Helps with long attention.

ReLU2 wins: sparse LLMs, 2024

i know about transformers and llms, but for this paper, what are they talking about when they say sparse/sparsity?

In this paper, "sparse" or "sparsity" refers to a pattern in neural networks where only a portion of neurons are actively contributing to the computation for any given input, while many others have negligible impact and could be skipped during inference.