Paper

The paper presents Actor-Learner Distillation (ALD), a technique for reinforcement learning that addresses the computational constraints of running complex models during acting/inference.

Core Problem

In reinforcement learning, agents must act while learning. Many real-world applications (robotics, distributed RL) have strict latency constraints on the acting policy - the model must produce actions within a fixed time budget. This prevents using large, powerful models like transformers that achieve better sample efficiency but have high computational costs.

The ALD Solution

ALD uses two separate models:

The key innovation is continual online distillation between these models during training:

  1. The actor model collects trajectories using its fast inference
  2. The learner model trains on this data using standard RL algorithms
  3. Simultaneously, the actor model is trained to mimic the learner's policy through distillation losses:
  4. The learner is also regularized toward the actor policy for smoother optimization

Key Results

On memory-intensive environments (I-Maze, Meta-Fetch):