Rho Loss, 2022: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Idea: curriculum that prioritizes points that have high training loss but low holdout loss
Motivation: prioritize points that are
2 and 3 are covered by “irreducible holdout loss”
Holdout loss: the training loss after training on full dataset?
Formula: rho-loss = L[y | x; Dt] - L[y | x; Dho], where
The initial exact and intuitive but slow criteria. "It would be too expensive to naively train on every candidate point and evaluate the holdout loss each time.”
Approximate it with some mathy math
Meta learning, i.e. learning to learn. E.g. neural architecture search
Shifting from VQ to simpler LFQ / FSQ tokenizer. Also allows much larger codebooks.
Learning independent causal mechanisms, Parascandolo 2018
Only one expert is selected (argmax), and only that expert becomes better at the task. Eventual specialization:
These mechanisms can transfer to other domains, e.g. beyond MNIST. Experts can become good at generalizing because each expert only focuses on narrow task and thus can use simpler architectures with fewer params that generalize better.
Neural GPUs, 2016
Both Neural GPUs and Neural Turing Machines handle the same problem: learning algorithms by example where the inputs and outputs are arbitrarily long strings using a finite alphabet. The model NTMs use is an LSTM controller along with an array-structured external memory where reading and writing operations involve a soft attention over the whole array. The authors demonstrate that this external memory can allow the controller to handle longer inputs and output strings than a vanilla LSTM. The Neural GPU paper introduces a model that doesn’t use an external memory but instead changes the recurrent cell definition so that state vectors suffice. There are several changes involved, but the most significant is that cell outputs are functions of several convolutional layers applied to the state vector instead of a matrix multiplication of the state vector with learned parameters. This enables the Neural GPUs to learn how to do binary addition and multiplication on 2000 digit inputs with no error, which in their experiments was not achievable with a simplification of NTMs (it does purely content-based addressing whereas the full NTM read and write operations also involve interpolation and rotation steps). This is an intuitive result because lots of algorithms (such as grade school style arithmetic) involve repeated processing of local information which convolutions can perform more naturally and with less work than general matrix multiplication.
Note that the main ideas in these papers aren’t incompatible. A Neural GPU style controller with an NTM style external memory might be a good model for some problems that could benefit from more memory.
(source)
Neural Turing machine (NTM), Deepmind 2014