Thinking in RL

By Xuhui Zhou · Apr 13, 2026

RL is happening everywhere. The RL algorithms powering today's frontier models are being actively debated, reimagined, and simplified at a remarkable pace. The blog aims to summerize them in a relatively high level concept and tries to find synergies among the "million" RL algorithms. Many of the high level thoughts and summerization and metaphors are totally from my own perspectives (so it's mostly about LLM), so they may or may not be absolutely true. Up for a discussion or debate anytime haha.

If you want a more comprehensive treatment alongside this post, these are the resources I'd recommend:

Before diving in, here's the landscape of RL algorithms for LLM training as of early 2026.

The RL Formulation for Language Models

Before the algorithms, we need the setup. Generating text with a language model maps naturally to a Markov Decision Process (MDP):

  • State sts_t: the full prefix up to position tt — prompt plus any tokens already generated
  • Action ata_t: the next token chosen from the vocabulary
  • Policy πθ(atst)\pi_\theta(a_t \mid s_t): the language model parameterized by θ\theta
  • Reward R(τ)R(\tau): a scalar signal at the end of the episode — e.g., +1 if a math answer is correct, 0 otherwise
  • Episode / trajectory: a complete generation τ=(s0,a0,s1,a1,,sT)\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)

The training objective is to maximize the expected reward over trajectories drawn from the policy:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigl[R(\tau)\bigr]

Three observations here: (1) This objective, when in the LLM setting, is essentially mode-seeking. For example, in RLHF, if producing a certain format of text would generally gain more preference (aka rewards) from whoever human judges included in the training phase. Then you would almost observe this format (e.g., bullet points) showing up really frequently.

(2) Rewards in LLM training are almost always sparse and outcome-level. There is no per-token signal telling the model "this word was a good choice." A trajectory might be thousands of tokens long, and the model only learns after the final token whether it did well. This actually makes the credit assignment problem particularly hard. That's why on-policy distallation gets so trended these days .

(3) The reward here is a scalar value! However, for complex things happening in our real life, the judges could be from multiple dimensions, and somehow we would need a way to squeeze them into a scalar value in this setting.

Nevertheless, the whole thing is really powerful, in a way that almost start to crack into "every" corner of human life.

The Policy Gradient Theorem

The Policy Gradient Theorem is the mathematical backbone of every algorithm we'll discuss:

θJ(θ)=Eτπθ[t=1Tθlogπθ(atst)Qπ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^{\pi}(s_t, a_t)\right]

where Qπ(st,at)Q^{\pi}(s_t, a_t) is the action-value function, i.e, the expected total reward from taking action ata_t in state sts_t and following policy π\pi thereafter.

The term θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t \mid s_t) is the score function (also called the log-likelihood gradient). It points in the direction that makes the chosen token ata_t more probable under the current policy. Multiplying by QπQ^{\pi} then says: scale that direction by how good the action actually is. If QQ is high, push hard toward this action. If low, pull away.

The problem is that we don't have QπQ^{\pi} — we need to estimate it from rollouts. How we do that estimation is exactly what distinguishes the algorithms we'll cover.

The Advantage Function

Rather than estimating QπQ^{\pi} directly, sometimes we prefer to work with the advantage function:

Aπ(st,at)=Qπ(st,at)Vπ(st)A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)

where Vπ(st)=Eaπ[Qπ(st,a)]V^{\pi}(s_t) = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s_t, a)] is the state value function, aka how good state sts_t is on average, regardless of which action is taken.

The advantage asks a sharper question: is this particular action better or worse than what the policy would do on average in this state? That's more informative than the raw QQ value, because QQ carries a lot of "how good is this state" signal mixed in. A trajectory with reward 0.8 might be unremarkable if the policy routinely achieves 0.9 from this prompt, or exceptional if the policy usually only manages 0.3.

This substitution doesn't introduce any bias. Subtracting Vπ(st)V^{\pi}(s_t) from Qπ(st,at)Q^{\pi}(s_t, a_t) has zero effect on the expected gradient, because:

Eaπθ(s) ⁣[θlogπθ(as)]=θaπθ(as)=θ1=0\mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}\!\left[\nabla_\theta \log \pi_\theta(a \mid s)\right] = \nabla_\theta \sum_a \pi_\theta(a \mid s) = \nabla_\theta\, 1 = 0

This is the zero-gradient property, and it's the key insight we'll return to.

[Figure advantage-estimation not found] shows concretely how the same 8 rollouts look under each algorithm. REINFORCE, with no baseline, sees all positive advantages (since rewards are non-negative). The other methods center the advantages around zero, producing a more informative training signal — some rollouts are pushed up, others pushed down.

Online Policy Gradient

Online Policy Gradient is the blue branch from [Figure algorithm-tree not found]. Every algorithm in this branch (PPO, GRPO, REINFORCE, REINFORCE++, and the many variants below them) shares the same gradient shape: score-function × advantage. They differ only in how they estimate the advantage and how they regularize each step. We work from the most expensive (PPO, with a learned critic) down to the simplest (REINFORCE, no baseline at all).

PPO: The Trusted Workhorse

Proximal Policy Optimization was the algorithm behind early RLHF and remains the most thoroughly studied option. It makes two key innovations on top of basic policy gradient:

A learned critic

PPO trains a separate value network Vϕ(s)V_\phi(s) to explicitly estimate VπV^{\pi}. In practice this is typically a copy of the policy LLM with a scalar head, updated at each training step to minimize:

Lcritic=Et ⁣[(Vϕ(st)V^ttarget)2]\mathcal{L}_{\text{critic}} = \mathbb{E}_t\!\left[\bigl(V_\phi(s_t) - \hat{V}_t^{\text{target}}\bigr)^2\right]

With a good VϕV_\phi, the advantage estimate A^t=RtVϕ(st)\hat{A}_t = R_t - V_\phi(s_t) is tight, giving low-variance gradients.

Generalized Advantage Estimation

Rather than the raw single-step residual, PPO uses GAE to reduce variance further:

A^tGAE=l=0(γλ)lδt+l,δt=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^{\text{GAE}} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

The quantity δt\delta_t is the TD residual, and it is itself a one-sample estimate of the advantage.

The parameter λ[0,1]\lambda \in [0,1] trades bias for variance: λ=0\lambda=0 gives the one-step TD estimate (low variance, some bias), λ=1\lambda=1 gives the full Monte Carlo return (unbiased, higher variance).

A subtlety in the LLM setting: as noted above, there is no genuine per-token reward — rt=0r_t = 0 for every intermediate token, and rT=Rr_T = R only at the terminal token. Plugging this into GAE, all the δt\delta_t for t<Tt < T reduce to γVϕ(st+1)Vϕ(st)\gamma V_\phi(s_{t+1}) - V_\phi(s_t), and only δT\delta_T carries the actual reward signal. The critic VϕV_\phi is doing all the heavy lifting: it learns to predict the eventual outcome from each prefix, and GAE then uses those predictions to spread the terminal reward backward across tokens. This is the real reason PPO needs a well-trained critic in LLM-RL — without it, there is no per-token learning signal at all.

Clipped objective

PPO adds a constraint to prevent the policy update from being too large in any single step:

LPPO=1iτii=1Nt=1τimin ⁣(ri,t(θ)A^i,t,  clip(ri,t(θ),1ϵ,1+ϵ)A^i,t)\mathcal{L}_{\text{PPO}} = -\frac{1}{\sum_i |\tau_i|} \sum_{i=1}^{N} \sum_{t=1}^{|\tau_i|} \min\!\Bigl(r_{i,t}(\theta)\hat{A}_{i,t},\;\text{clip}\bigl(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon\bigr)\hat{A}_{i,t}\Bigr)

where the sum runs over NN rollouts τi\tau_i in the batch and all tokens within each, and ri,t(θ)=πθ(ai,tsi,t)πold(ai,tsi,t)r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} \mid s_{i,t})}{\pi_{\text{old}}(a_{i,t} \mid s_{i,t})} is the per-token importance ratio.

The cost: PPO requires maintaining and training a critic network that is often the same size as the policy. For a 70B-parameter LLM, that is a second 70B model in memory, plus all the optimizer states. This makes PPO expensive.

GRPO: Eliminating the Critic

Group Relative Policy Optimization, introduced in DeepSeekMath and later central to DeepSeek-R1, makes one clean simplification: estimate the baseline from the policy's own rollouts instead of a critic.

For each prompt qq, GRPO samples a group of GG responses {τ1,,τG}\{\tau_1, \ldots, \tau_G\} from the current policy and uses their mean as the baseline:

A^i=Rimean({Rj}j=1G)std({Rj}j=1G)\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}

No critic. No value network. The group mean is the baseline.

Notice what RiR_i is doing here: it's standing in for QQ. The action is the entire response, and the episode ends the moment it's produced, so the observed return RiR_i is exactly the Monte Carlo action-value Qπ(q,τi)Q^{\pi}(q, \tau_i). With the group mean estimating Vπ(q)V^{\pi}(q), the GRPO advantage Rimean(Rj)R_i - \text{mean}(R_j) is just QVQ - V — both terms estimated by sampling instead of learned by a critic.

Think of it as grading on a curve. The GG responses are students sitting the same exam qq, and their mean reward is the class average; the advantage asks the only useful question — did this response beat its cohort? — instead of the meaningless was 0.60.6 good in the abstract? Because that average is shared by the whole group and never looks at which response is rollout ii, the zero-gradient property says subtracting it keeps the gradient unbiased — it only sharpens the contrast. And the cohort average is itself a sample estimate of the prompt's expected reward Vπ(q)=Eτπ[R(τ)q]V^{\pi}(q) = \mathbb{E}_{\tau \sim \pi}[R(\tau) \mid q], so the group quietly plays the role of PPO's critic — no value network needed, and the bigger GG is, the better the estimate.

Dividing by std just puts every prompt at the same volume. Otherwise an easy prompt scoring 0.5±0.010.5{\pm}0.01 whispers while one spanning 0.00.01.01.0 shouts, and the update only hears the loud one — even though picking the best sibling is just as real a lesson on the quiet prompt.

GRPO's clipped objective also adds a token-level KL term against a reference policy πref\pi_{\text{ref}} to prevent reward hacking:

LGRPO=1iτii=1Gt=1τi[min(ri,t(θ)A^i,  clip(ri,t(θ),1ϵ,1+ϵ)A^i)βDKL(πθπref)]\mathcal{L}_{\text{GRPO}} = -\frac{1}{\sum_i |\tau_i|} \sum_{i=1}^{G} \sum_{t=1}^{|\tau_i|} \Bigl[\min\bigl(r_{i,t}(\theta)\hat{A}_i,\;\text{clip}(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\bigr) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\Bigr]

where ri,t(θ)=πθ(ai,tsi,t)πold(ai,tsi,t)r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} \mid s_{i,t})}{\pi_{\text{old}}(a_{i,t} \mid s_{i,t})} is the per-token importance ratio for rollout ii — the GRPO-indexed counterpart of rt(θ)r_t(\theta) from PPO.

The tradeoff: No critic memory, but you now sample GG responses per prompt per step. If G=8G = 8, rollout cost increases 8×. For most setups the memory saving (no second model) outweighs the extra rollout cost.

REINFORCE and REINFORCE++

Both algorithms go even simpler — no critic and no group sampling.

REINFORCE is the original policy gradient algorithm from Williams (1992). Use the actual episode return as the QQ estimate:

θJ(θ)1Ni=1Ntθlogπθ(at(i)st(i))R(i)\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_t \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot R^{(i)}

No baseline, no critic, no clipping. Each token gets the same gradient weight (the full episode reward). This is unbiased — the sample gradient is an exact estimate of the true gradient in expectation. But it has very high variance: if all trajectories happen to score similarly (common early in training), the gradient is nearly zero and learning stalls.

REINFORCE++ makes two simple fixes: token-level KL regularization folded into the reward, and a batch-mean baseline.

First, shape each rollout's reward with a token-level KL penalty against the reference:

R~i=Riβtlogπθ(at(i)st(i))πref(at(i)st(i))\tilde{R}_i = R_i - \beta \sum_t \log \frac{\pi_\theta(a_t^{(i)} \mid s_t^{(i)})}{\pi_{\text{ref}}(a_t^{(i)} \mid s_t^{(i)})}

Then center across the batch using the mean of those shaped rewards:

A^i=R~imean({R~j}j=1B)\hat{A}_i = \tilde{R}_i - \text{mean}(\{\tilde{R}_j\}_{j=1}^B)

where BB is the full batch of diverse prompts (not the same prompt GG times, as in GRPO).

The KL term at the token level penalizes the policy for drifting from the reference model token-by-token, not just at the sequence level. This is a finer-grained constraint that helps prevent the policy from collapsing onto degenerate solutions.

Putting It All Together

Here is a summary of what each algorithm requires and what it achieves:

Critic?BaselineClippingVariance
REINFORCENoNone (b=0b=0)NoHigh
REINFORCE++NoBatch meanNoMedium
GRPONoGroup mean + std-normYesLow–medium
PPOYesVπ(s)V^{\pi}(s) via criticYesLow

For most practical LLM training today, GRPO or REINFORCE++ are preferred over PPO precisely because they avoid the second model. For a 70B policy, a 70B critic adds ~140B parameters to keep in GPU memory, plus separate optimizer states. GRPO trades that memory cost for more rollouts per step, which is typically the better deal.

Preference Optimization

The green branch from [Figure algorithm-tree not found] takes a different route from Online Policy Gradient: skip the rollouts entirely. There's no critic, no policy-gradient sample, no separately-trained reward model — just a supervised loss over preference pairs (yw,yl)(y_w, y_l). What looks like SFT is actually solving the same KL-regularized RL objective the Online PG branch is iterating toward, but in closed form.

DPO: closed-form RL as a supervised loss

Start with the KL-regularized RL objective — the same one Online PG implicitly maximizes when it adds the βDKL(πθπref)\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) anchor:

maxπ ExD,yπ(x) ⁣[r(x,y)]βDKL(π(x)πref(x))\max_\pi\ \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot \mid x)}\!\big[r(x, y)\big] - \beta \cdot D_{KL}\big(\pi(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\big)

This has a closed-form optimum:

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y))\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x) \, \exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big)

Rearrange to express the reward as a function of π\pi^* and πref\pi_{\text{ref}}:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

Plug into the Bradley-Terry preference model P(ywylx)=σ ⁣(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma\!\big(r(x, y_w) - r(x, y_l)\big). The logZ(x)\log Z(x) terms cancel because they don't depend on yy. What's left is the DPO loss:

LDPO=E(x,yw,yl) ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]

This is literally "KL-regularized RL, expressed as a supervised log-likelihood." The optimal policy of an Online PG run with reward rr and anchor πref\pi_{\text{ref}} is exactly the policy that minimizes LDPO\mathcal{L}_{\text{DPO}} on preference data labeled by that reward. DPO skips the rollouts and the policy gradient and goes straight to the supervised loss whose minimum is the RL optimum.

So when you train with DPO you're not doing supervised learning instead of RL — you're doing the closed-form solution to the same KL-regularized RL objective the Online PG branch is iterating toward.

The cost: DPO needs preference data (x,yw,yl)(x, y_w, y_l) with explicit pairwise labels, and the implicit reward is whatever the labelers agreed about, baked into the dataset. No reward model needs to be learned and no rollouts need to be sampled — but you've paid for the preferences upfront.

Variants

The DPO variants in [Figure algorithm-tree not found] all share DPO's structural choice (target = closed-form RL optimum, divergence = supervised log-likelihood) and tweak one of the moving parts:

  • SimPO (Meng et al. 2024) drops πref\pi_{\text{ref}} from the loss entirely — uses average log-prob normalized by length as the implicit reward. Faster to train, no reference model in memory, slightly looser alignment with the original RL optimum.
  • KTO (Ethayarajh et al. 2024) replaces pairwise preferences with absolute "good" / "bad" labels per sample (binary classification framing). Easier label collection, comparable performance.
  • IPO (Azar et al. 2024) replaces the sigmoid in DPO with an MSE-style loss to address overfitting on near-deterministic preference data — essentially adds smoothing.
  • ORPO (Hong et al. 2024) folds an explicit SFT term into the DPO loss so you can collapse the SFT and preference-tuning stages into one pass.
  • Online DPO runs DPO on a stream of fresh preferences (sampled and labeled during training) instead of a fixed dataset, recovering some of the on-policy benefits.

These all live on the same axis as Online PG; they just trade rollouts for preference data and a closed-form solution.

Self-Training

The amber branch from [Figure algorithm-tree not found] sits at the simplest end of the tree. There's no policy gradient, no KL term, no preference data, no critic. Sample rollouts, filter by reward, SFT on the survivors, repeat.

That's it. The "RL" lives entirely in the filtering step.

The iterated SFT loop

For TT rounds:

  1. Sample NN rollouts from current πθ\pi_\theta on each prompt.
  2. Score them with a reward function (or a correctness check, or a test-passing oracle, etc.).
  3. Keep the top-kk per prompt, or all rewards above threshold → call this filtered set Dt\mathcal{D}_t.
  4. SFT πθ\pi_\theta on Dt\mathcal{D}_t → new πθ\pi_\theta.
  5. Repeat.

The per-round objective is the plain SFT loss on the filtered set:

Lt(θ)=E(x,y)Dt ⁣[logπθ(yx)],Dt={(x,y):yπθt1(x),  R(x,y)τ}\mathcal{L}_t(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_t}\!\left[\log \pi_\theta(y \mid x)\right], \quad \mathcal{D}_t = \{(x, y) : y \sim \pi_{\theta_{t-1}}(\cdot \mid x),\; R(x, y) \geq \tau\}

Each iteration is forward KL minimization to a self-curated target distribution biased toward high reward. The rejection-sampling filter is doing the work of the policy gradient — high-reward samples get all the gradient mass, low-reward samples get none. As πθ\pi_\theta improves, Dt\mathcal{D}_t improves, which improves πθ\pi_\theta further.

Compared to the Online PG branch, Self-Training has lower variance per step (no per-token gradient on every rollout) but coarser credit assignment (sequence-level filter, no per-token signal). You can think of it as the policy-gradient equivalent of REINFORCE with a hard threshold instead of a learned baseline — all-or-nothing advantages instead of continuous ones.

STaR, ReST, RAFT

The three named methods differ mainly in what gets filtered and how the loop is scheduled:

  • STaR (Zelikman et al. 2022) — the original. For reasoning tasks with verifiable answers: sample reasoning chains, keep the ones that arrive at the correct final answer, SFT on the (problem, correct chain) pairs. Plus a "rationalization" trick: for problems the model gets wrong, condition on the gold answer to generate a fake-but-plausible chain, and include those in Dt\mathcal{D}_t too. Bootstraps reasoning ability without supervised reasoning traces.
  • ReST (Gulcehre et al. 2023) — Reinforced Self-Training. Generalizes STaR's loop to arbitrary reward-model-scored samples (not just verifiable correctness). Two nested loops: an outer "Grow" loop that samples new data from the current policy, and an inner "Improve" loop that filters at progressively higher thresholds and runs SFT.
  • RAFT (Dong et al. 2023) — Reward-rAnked Fine-Tuning. The most direct: sample NN, take top-kk by reward, SFT, repeat. No threshold scheduling, no rationalization trick. The simplest possible iterated-SFT loop.

The reason this branch works at all is the same reason Online PG works: as long as the filter is correlated with reward, the policy improves on average each round. Self-Training trades the precision of a continuous advantage for the simplicity of "just SFT, repeatedly."

Linking back to SFT

The three branches above (Online Policy Gradient, Preference Optimization, Self-Training) cover the algorithm tree from [Figure algorithm-tree not found]. SFT itself sits outside that tree — it's the loss everything starts from.

These four (SFT plus the three branches) aren't actually different paradigms. Every one of them reduces to the same shape:

Choose a target distribution. Minimize a divergence to it.

What changes is which target and which divergence.

SFT: forward KL to the data

The standard SFT loss is cross-entropy:

LSFT=logπθ(yx)\mathcal{L}_{\text{SFT}} = -\log \pi_\theta(y^* \mid x)

Cross-entropy decomposes as H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p \,\|\, q). For SFT the target distribution pp is one-hot on the gold token, so H(p)=0H(p) = 0 and cross-entropy equals KL exactly:

LSFT=DKL(pdataπθ)\mathcal{L}_{\text{SFT}} = D_{KL}(p_{\text{data}} \,\|\, \pi_\theta)

That's forward KL — data on the left, model on the right. Forward KL is mode-covering: it punishes the model for assigning low probability where the data has mass, so the model is forced to spread itself to cover everything in the dataset. This is the baseline. Every other branch is a way of choosing a different target distribution and minimizing some divergence to that — when the data alone isn't enough.

Forward KL vs reverse KL

The βDKL(πθπref)\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) anchor used by Online PG is reverse KL — model on the left, reference on the right. Reverse KL is mode-seeking: only places where the model puts mass contribute to the penalty. The model is free to drop mass on anything; the only thing it can't do without paying is invent outputs πref\pi_{\text{ref}} never produced.

This is why post-RL policies are sharper than their SFT base. SFT spread the mass to cover everything reasonable in the data; reverse-KL-anchored RL lets the model concentrate mass on whichever subset of that cover the reward favors. This is real and observable: entropy drops, perplexity on its own samples falls, output diversity at any given temperature shrinks. People sometimes call this "mode collapse" and treat it as a bug. It's the design working as specified — the KL direction was chosen precisely to allow collapse onto high-reward modes while preventing the worse failure of the model walking off SFT's support entirely.

(Aside: classical RL — Atari, MuJoCo, gym — has no πref\pi_{\text{ref}} in the loss at all. Just the clipped surrogate. You start from a random policy, there's no good behavior to anchor to, and drift is the goal. The frozen-SFT KL anchor is a contribution of LLM-RL, not PPO itself.)

Two kinds of mode-seeking

A reasonable question after the KL-direction story: is RL just a mode-seeking operation? Two effects say yes, and they're worth separating.

Reason 1: the fixed point of reward maximization. Forget the KL anchor for a moment. The bare objective J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] has a degenerate optimum — a point mass on argmaxaR(a)\arg\max_a R(a). The policy that maximizes expected reward puts all its probability on the single best action (or one of them, if ties). This has nothing to do with KL. Maximizing an expectation over a bounded reward collapses the distribution onto the argmax. Pure RL, in this sense, is maximally mode-seeking: the fixed point is a delta function. The path during training doesn't have to be that, but the destination is.

Reason 2: the reverse-KL anchor. Layered on top of Reason 1 in LLM-RL is the βDKL(πθπref)-\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) term — the mode-seeking direction of KL, as we just saw. This doesn't cause mode-seeking (reward max already does that); it constrains where the collapse lands. Without the anchor, the policy is free to collapse anywhere — including onto out-of-distribution junk that happens to game the reward. With it, the policy can only collapse onto a subset of πref\pi_{\text{ref}}'s support. The anchor doesn't prevent mode-seeking; it just channels where the collapse goes.

So the LLM-RL setup stacks two mode-seeking pressures: reward-max picks the high-reward mode, the reverse-KL anchor picks the modes inside πref\pi_{\text{ref}}'s support.

The cases where RL is not mode-seeking are exactly the cases where one of these is broken. Maximum-entropy RL (maxE[R]+βH(π)\max\,\mathbb{E}[R] + \beta\, H(\pi), as in soft actor-critic) replaces the argmax fixed point with a softmax over rewards — concentrated but not collapsed. Off-policy methods with a diverse behavior prior inherit some of that breadth. Explicit diversity rewards add a term that resists collapse. None of these are active in the standard RLHF or GRPO loop. That's why post-RL policies are notably sharp — the mode-seeking is intentional, on two axes at once, and nothing in the recipe pushes back.

The unifying picture

Pulling all four together:

BranchTargetDivergenceGradient flowMode behavior
SFTThe dataForward KLDirect on demosMode-covering
Online PGReward-shaped, anchored to ref policyReverse KL + reward maxScore-function on rolloutsMode-seeking
Preference OptClosed-form RL optimumSupervised log-likelihoodDirect on preference pairsMode-seeking (fits the reverse-KL optimum)
Self-TrainingReward-filtered self-samplesForward KLIterated SFT on the filterMode-seeking (covers within the reward filter)

The branches answer which of the two questions am I willing to pay for?

  • Cheap target, expensive divergence: Online PG. The target is just "high reward, anchored to SFT" — easy to specify, but you pay with rollouts and noisy gradients.
  • Expensive target, cheap divergence: Preference Optimization. The target requires preference data (and an implicit reward model embedded in the labelers), but once you have it, the loss is plain supervised.
  • Iterative bootstrap: Self-Training. Don't pay for either upfront — let the policy and the target distribution co-evolve through filtering.
  • No target synthesis at all: SFT. The target is given (the data) and the divergence is the cheapest possible.

That's why "SFT vs RL" is the wrong axis. The right axis is: how much work are you willing to do to construct the target distribution, and what divergence are you willing to compute against it? Everything else — the critic, the clip, the KL coefficient, the preference labelers, the rejection filter — is engineering in service of those two choices.

Closing Thoughts

All four categories of algorithms are fundamentally solving the same problem: estimate the gradient of J(θ)=E[R(τ)]J(\theta) = \mathbb{E}[R(\tau)]. What separates them is how carefully they estimate the advantage, and at what cost.

The elegant thing about the zero-gradient property is that it gives enormous freedom in choosing the baseline. GRPO: the mean reward of a group of same-prompt rollouts is a cheap, unbiased estimate of VπV^{\pi} REINFORCE++': a batch-level constant helps, and token-level KL tightens the constraint where it matters

You see! deep theory and practical engineering are not always in tension. Sometimes, the right theoretical framing reveals that a simple heuristic is actually doing exactly the right thing.

If we stand a bit higher! Every branch picks a target distribution and minimizes a divergence to it: SFT to the data, Self-Training to its reward-filtered samples, Preference Optimization to the closed-form π\pi^*. Even Online PG, the branch that looks like it maximizes reward rather than matching a distribution, could be interpreted as doing the same thing. The KL-regularized objective is, up to a constant, a divergence to the very same π(yx)πref(yx)exp ⁣(1βr(x,y))\pi^*(y \mid x) \propto \pi_{\text{ref}}(y \mid x)\,\exp\!\big(\tfrac{1}{\beta} r(x, y)\big) we derived for DPO:

Eyπ[r(x,y)]βDKL(ππref)reward, KL-anchored  =  βDKL(ππ)divergence to the target  +  const\underbrace{\mathbb{E}_{y \sim \pi}\big[r(x, y)\big] - \beta\, D_{KL}\big(\pi \,\|\, \pi_{\text{ref}}\big)}_{\text{reward, KL-anchored}} \;=\; \underbrace{-\beta\, D_{KL}\big(\pi \,\|\, \pi^*\big)}_{\text{divergence to the target}} \;+\; \text{const}

Maximizing reward is minimizing reverse KL to a target. The reward here somehow defines/shapes the target distribution of the policy, pushing the reference policy by exp(r/β)\exp(r/\beta), with β\beta the temperature of the tilt: small β\beta concentrates π\pi^* on high reward, large β\beta leaves it near πref\pi_{\text{ref}}.

That turns "SFT vs RL" from a question of philosophy into a question of access: do you have samples from your target, or only a way to score samples? Demonstrations are samples from the target, so forward KL is computable directly, aka SFT. A reward is only a scorer; you can't draw from the target, so the best you can do is sample your own policy and reweight by reward, which is reverse KL on rollouts, aka RL. The divergence direction isn't chosen for its mode-seeking or mode-covering flavor; it's forced by what you can sample, and the mode behavior follows.

Ground this in two tasks that pull in opposite directions.

Take a verifiable task: r=1r = 1 when the answer checks out, 00 otherwise. The tilt exp(r/β)\exp(r/\beta) multiplies correct answers by exp(1/β)\exp(1/\beta) and wrong ones by exp(0)=1\exp(0) = 1, so as β0\beta \to 0 it blows up on the correct answers while the wrong ones wash out under the normalizer. What's left is the base model's own distribution, restricted to correct answers. This is exactly what you get by sampling from πref\pi_{\text{ref}} and throwing the failures away. The cold-temperature reward optimum is literally rejection sampling, which is why filtered self-training fits the same target.

A learned safety reward is the soft version of the same move: it scores refusals on harmful prompts highly, so the tilt piles π\pi^*'s mass onto refusal-shaped responses. But this task has essentially one acceptable behavior: refuse, and you want it every single time, not a diverse spread of ways to handle a jailbreak. So here you actually want the policy to collapse onto that one mode, which is exactly the mode-seeking reverse-KL RL gives you. The diversity loss that looks like RL's weakness is the feature.

Same machinery, opposite wishes: math has many correct answers and you'd rather keep them all; safety has basically one right behavior and you'd rather lock onto it. The reward says where the target sits; the divergence says whether you fan out across it or collapse onto it.

Which is the whole post in one line: the gradient of J(θ)J(\theta) walks the policy toward a target distribution the reward quietly defines.

There's one last lens that makes the same point from underneath. A policy assigns a probability π(yx)\pi(y \mid x), and any probability is a code: the string yy can be written in logπ(yx)-\log \pi(y \mid x) bits. Forward KL to the data is then just the excess bits the model spends describing it, DKL(pdataπ)=Edata[logπ]H(pdata)D_{KL}(p_{\text{data}} \,\|\, \pi) = \mathbb{E}_{\text{data}}[-\log \pi] - H(p_{\text{data}}), so SFT is nothing but make the model the shortest description of the data your function class can express. That's the Minimum Description Length principle (yeah, Kolmogorov complexity), and it's why the same objective that trains a language model also makes it a state-of-the-art compressor.

The reward branch is the same statement, re-priced. Read the optimum π\pi^* in bits:

logπ(yx)=logπref(yx)1βr(x,y)+logZ(x)-\log \pi^*(y \mid x) = -\log \pi_{\text{ref}}(y \mid x) - \tfrac{1}{\beta}\, r(x, y) + \log Z(x)

The codelength under the optimal policy is the prior's codelength minus reward (in nats) over β\beta. Reward literally buys shorter codes, and β\beta is the exchange rate between bits and reward. The anchor βDKL(ππref)\beta \cdot D_{KL}(\pi \,\|\, \pi_{\text{ref}}) is the price, in bits, of moving off the prior. So pretraining pays the enormous compression cost of describing the corpus, and RL pays only a small, KL-bounded surcharge to re-rank what already lives in πref\pi_{\text{ref}}'s support.

Which is maybe the most compact way to hold the whole tree in your head: every method is searching for the shortest code for its target, and reward is just a discount on the codelength of the outcomes you want to see.

Citation

Please cite this work as:

Xuhui Zhou, “Thinking in RL”, 2026.

Or use the BibTeX citation:

@misc{zhou2026rl,
  author = {Xuhui Zhou},
  title = {Thinking in RL},
  year = {2026},
  howpublished = {\url{https://xuhuizhou.com/blog/thinking-in-rl}},
}