Meta learning, i.e. learning to learn. E.g. neural architecture search
i know about transformers and llms, but for this paper, what are they talking about when they say sparse/sparsity?
In this paper, "sparse" or "sparsity" refers to a pattern in neural networks where only a portion of neurons are actively contributing to the computation for any given input, while many others have negligible impact and could be skipped during inference.
The key concepts related to sparsity in this paper are:
The paper explores which activation functions (ReLU, SwiGLU, ReGLU, ReLU²) result in the best trade-offs between performance and sparsity, with the goal of making LLM inference more efficient while maintaining accuracy.
if relu2 is just the square of relu then isn't it the same sparsity as relu, in that it's 0 iff relu is 0?
You've made an excellent observation! You're absolutely right - in terms of zero-detection, ReLU² has identical sparsity to ReLU, since:
ReLU²(x) = (max(0, x))² = 0 if and only if ReLU(x) = 0
The paper's approach actually goes beyond the traditional definition that only looks at exact zeros. They redefine sparsity based on output magnitude thresholds: