This paper introduces Pass-at-k Policy Optimization (PKPO), a technique to improve reinforcement learning by optimizing for the best reward among multiple samples rather than the average reward.
We can get an unbiased estimator of this.
Let's use a concrete example: n=5 attempts, c=2 correct, k=3 picks
Your attempts look like: [❌, ✓, ❌, ✓, ❌]
Question: If you pick 3 attempts randomly, what's the chance you get at least one ✓?
Easier to calculate: What's the chance you get NO ✓'s (all ❌)?
So:
The formula: $\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$
The clever part: When computing gradients, they need to assign a "reward" to each individual attempt. They do this by asking: "How much does this attempt contribute to the pass@k?"