Meta learning, i.e. learning to learn. E.g. neural architecture search
Shifting from VQ to simpler LFQ / FSQ tokenizer. Also allows much larger codebooks.
Learning independent causal mechanisms, Parascandolo 2018
Only one expert is selected (argmax), and only that expert becomes better at the task. Eventual specialization:
These mechanisms can transfer to other domains, e.g. beyond MNIST. Experts can become good at generalizing because each expert only focuses on narrow task and thus can use simpler architectures with fewer params that generalize better.
Neural GPUs, 2016
Both Neural GPUs and Neural Turing Machines handle the same problem: learning algorithms by example where the inputs and outputs are arbitrarily long strings using a finite alphabet. The model NTMs use is an LSTM controller along with an array-structured external memory where reading and writing operations involve a soft attention over the whole array. The authors demonstrate that this external memory can allow the controller to handle longer inputs and output strings than a vanilla LSTM. The Neural GPU paper introduces a model that doesn’t use an external memory but instead changes the recurrent cell definition so that state vectors suffice. There are several changes involved, but the most significant is that cell outputs are functions of several convolutional layers applied to the state vector instead of a matrix multiplication of the state vector with learned parameters. This enables the Neural GPUs to learn how to do binary addition and multiplication on 2000 digit inputs with no error, which in their experiments was not achievable with a simplification of NTMs (it does purely content-based addressing whereas the full NTM read and write operations also involve interpolation and rotation steps). This is an intuitive result because lots of algorithms (such as grade school style arithmetic) involve repeated processing of local information which convolutions can perform more naturally and with less work than general matrix multiplication.
Note that the main ideas in these papers aren’t incompatible. A Neural GPU style controller with an NTM style external memory might be a good model for some problems that could benefit from more memory.
(source)
Neural Turing machine (NTM), Deepmind 2014