Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Chat

This paper introduces Pass-at-k Policy Optimization (PKPO), a technique to improve reinforcement learning by optimizing for the best reward among multiple samples rather than the average reward.

The Setup

You generate n total attempts at a problem
c of them are correct
You want to know: "What's the probability that if I randomly pick k attempts, at least one is correct?"

We can get an unbiased estimator of this.

The Counting Logic

Let's use a concrete example: n=5 attempts, c=2 correct, k=3 picks

Your attempts look like: [❌, ✓, ❌, ✓, ❌]

Question: If you pick 3 attempts randomly, what's the chance you get at least one ✓?

Easier to calculate: What's the chance you get NO ✓'s (all ❌)?

Total ways to pick 3 from 5: $\binom{5}{3} = 10$
Ways to pick 3 from the 3 ❌'s: $\binom{3}{3} = 1$

So:

Probability of all ❌ = 1/10
Probability of at least one ✓ = 1 - 1/10 = 9/10

The formula: $\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

Why This Works for Gradients

The clever part: When computing gradients, they need to assign a "reward" to each individual attempt. They do this by asking: "How much does this attempt contribute to the pass@k?"