Chat

This paper introduces Pass-at-k Policy Optimization (PKPO), a technique to improve reinforcement learning by optimizing for the best reward among multiple samples rather than the average reward.

The Setup

We can get an unbiased estimator of this.

The Counting Logic

Let's use a concrete example: n=5 attempts, c=2 correct, k=3 picks

Your attempts look like: [❌, ✓, ❌, ✓, ❌]

Question: If you pick 3 attempts randomly, what's the chance you get at least one ✓?

Easier to calculate: What's the chance you get NO ✓'s (all ❌)?

So:

The formula: $\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

Why This Works for Gradients

The clever part: When computing gradients, they need to assign a "reward" to each individual attempt. They do this by asking: "How much does this attempt contribute to the pass@k?"