Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation, 2021

Paper

The paper presents Actor-Learner Distillation (ALD), a technique for reinforcement learning that addresses the computational constraints of running complex models during acting/inference.

Core Problem

In reinforcement learning, agents must act while learning. Many real-world applications (robotics, distributed RL) have strict latency constraints on the acting policy - the model must produce actions within a fixed time budget. This prevents using large, powerful models like transformers that achieve better sample efficiency but have high computational costs.

The ALD Solution

ALD uses two separate models:

Learner model: Large capacity model (e.g., transformer) that trains using RL but never directly acts in the environment
Actor model: Small, fast model (e.g., LSTM) that collects data by acting in the environment

The key innovation is continual online distillation between these models during training:

The actor model collects trajectories using its fast inference
The learner model trains on this data using standard RL algorithms
Simultaneously, the actor model is trained to mimic the learner's policy through distillation losses:
- Policy distillation: KL divergence between actor and learner policies
- Value distillation: MSE between actor and learner value functions
The learner is also regularized toward the actor policy for smoother optimization

Key Results

On memory-intensive environments (I-Maze, Meta-Fetch):

ALD achieves sample efficiency close to the large transformer learner
Maintains wall-clock training time comparable to the fast LSTM actor
Outperforms an asymmetric actor-critic baseline that uses LSTM for policy and transformer for value function
The "distillation steps per RL step" ratio is critical - more distillation improves sample efficiency