Paper
The paper presents Actor-Learner Distillation (ALD), a technique for reinforcement learning that addresses the computational constraints of running complex models during acting/inference.
Core Problem
In reinforcement learning, agents must act while learning. Many real-world applications (robotics, distributed RL) have strict latency constraints on the acting policy - the model must produce actions within a fixed time budget. This prevents using large, powerful models like transformers that achieve better sample efficiency but have high computational costs.
The ALD Solution
ALD uses two separate models:
- Learner model: Large capacity model (e.g., transformer) that trains using RL but never directly acts in the environment
- Actor model: Small, fast model (e.g., LSTM) that collects data by acting in the environment
The key innovation is continual online distillation between these models during training:
- The actor model collects trajectories using its fast inference
- The learner model trains on this data using standard RL algorithms
- Simultaneously, the actor model is trained to mimic the learner's policy through distillation losses:
- Policy distillation: KL divergence between actor and learner policies
- Value distillation: MSE between actor and learner value functions
- The learner is also regularized toward the actor policy for smoother optimization
Key Results
On memory-intensive environments (I-Maze, Meta-Fetch):
- ALD achieves sample efficiency close to the large transformer learner
- Maintains wall-clock training time comparable to the fast LSTM actor
- Outperforms an asymmetric actor-critic baseline that uses LSTM for policy and transformer for value function
- The "distillation steps per RL step" ratio is critical - more distillation improves sample efficiency