Thinking in RL

By Xuhui Zhou · Apr 13, 2026

RL is happening everywhere. The RL algorithms powering today's frontier models are being actively debated, reimagined, and simplified at a remarkable pace. The blog aims to summerize them in a relatively high level concept and tries to find synergies among the "million" RL algorithms. Many of the high level thoughts and summerization and metaphors are totally from my own perspectives (so it's mostly about LLM), so they may or may not be absolutely true. Up for a discussion or debate anytime haha.

If you want a more comprehensive treatment alongside this post, these are the resources I'd recommend:

A (Long) Peek into Reinforcement Learning — Lilian Weng's blog post. One of the clearest RL overviews I know.
DeepMind × UCL RL Lecture Series — comprehensive lectures covering RL from the ground up.
Stanford CS234: Reinforcement Learning — the full course, including policy gradient lectures.

Before diving in, here's the landscape of RL algorithms for LLM training as of early 2026.

The RL Formulation for Language Models

Before the algorithms, we need the setup. Generating text with a language model maps naturally to a Markov Decision Process (MDP):

State $s_t$ : the full prefix up to position $t$ — prompt plus any tokens already generated
Action $a_t$ : the next token chosen from the vocabulary
Policy $\pi_\theta(a_t \mid s_t)$ : the language model parameterized by $\theta$
Reward $R(\tau)$ : a scalar signal at the end of the episode — e.g., +1 if a math answer is correct, 0 otherwise
Episode / trajectory: a complete generation $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$

The training objective is to maximize the expected reward over trajectories drawn from the policy:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigl[R(\tau)\bigr]$

Three observations here: (1) This objective, when in the LLM setting, is essentially mode-seeking. For example, in RLHF, if producing a certain format of text would generally gain more preference (aka rewards) from whoever human judges included in the training phase. Then you would almost observe this format (e.g., bullet points) showing up really frequently.

(2) Rewards in LLM training are almost always sparse and outcome-level. There is no per-token signal telling the model "this word was a good choice." A trajectory might be thousands of tokens long, and the model only learns after the final token whether it did well. This actually makes the credit assignment problem particularly hard. That's why on-policy distallation gets so trended these days On-Policy Distallation.

(3) The reward here is a scalar value! However, for complex things happening in our real life, the judges could be from multiple dimensions, and somehow we would need a way to squeeze them into a scalar value in this setting.

Nevertheless, the whole thing is really powerful, in a way that almost start to crack into "every" corner of human life.

The Policy Gradient Theorem

The Policy Gradient Theorem is the mathematical backbone of every algorithm we'll discuss:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^{\pi}(s_t, a_t)\right]$

where $Q^{\pi}(s_t, a_t)$ is the action-value function, i.e, the expected total reward from taking action $a_t$ in state $s_t$ and following policy $\pi$ thereafter.Where does this come from? We start from $J(\theta) = \mathbb{E}_\tau[R(\tau)]$ and take the gradient. The log-derivative trick ( $\nabla p = p \cdot \nabla \log p$ ) gives $\nabla_\theta J = \mathbb{E}_\tau[\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)]$ . The trajectory log-prob expands as $\log p(s_0) + \sum_t \log \pi_\theta(a_t|s_t) + \sum_t \log p(s_{t+1}|s_t,a_t)$ — the initial state and transition terms don't depend on $\theta$ and vanish, leaving $\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)$ . This gives REINFORCE: $\mathbb{E}_\tau[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)]$ . Finally, causality: $r_{t'}$ with $t' < t$ was collected before action $a_t$ , so $\mathbb{E}[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot r_{t'}] = 0$ for past rewards. Dropping them replaces $R(\tau)$ with the return-to-go $R_t = \sum_{t' \geq t} r_{t'}$ . Since $\mathbb{E}[R_t \mid s_t, a_t] = Q^\pi(s_t, a_t)$ by definition, $R_t$ is just a Monte Carlo sample of $Q^\pi$ .

The term $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ is the score function (also called the log-likelihood gradient). It points in the direction that makes the chosen token $a_t$ more probable under the current policy. Multiplying by $Q^{\pi}$ then says: scale that direction by how good the action actually is. If $Q$ is high, push hard toward this action. If low, pull away.

The problem is that we don't have $Q^{\pi}$ — we need to estimate it from rollouts. How we do that estimation is exactly what distinguishes the algorithms we'll cover.

The Advantage Function

Rather than estimating $Q^{\pi}$ directly, sometimes we prefer to work with the advantage function:

$A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$

where $V^{\pi}(s_t) = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s_t, a)]$ is the state value function, aka how good state $s_t$ is on average, regardless of which action is taken.

The advantage asks a sharper question: is this particular action better or worse than what the policy would do on average in this state? That's more informative than the raw $Q$ value, because $Q$ carries a lot of "how good is this state" signal mixed in. A trajectory with reward 0.8 might be unremarkable if the policy routinely achieves 0.9 from this prompt, or exceptional if the policy usually only manages 0.3.

This substitution doesn't introduce any bias. Subtracting $V^{\pi}(s_t)$ from $Q^{\pi}(s_t, a_t)$ has zero effect on the expected gradient, because:

$\mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}\!\left[\nabla_\theta \log \pi_\theta(a \mid s)\right] = \nabla_\theta \sum_a \pi_\theta(a \mid s) = \nabla_\theta\, 1 = 0$

This is the zero-gradient property, and it's the key insight we'll return to.

[Figure advantage-estimation not found] shows concretely how the same 8 rollouts look under each algorithm. REINFORCE, with no baseline, sees all positive advantages (since rewards are non-negative). The other methods center the advantages around zero, producing a more informative training signal — some rollouts are pushed up, others pushed down.

Online Policy Gradient

Online Policy Gradient is the blue branch from [Figure algorithm-tree not found]. Every algorithm in this branch (PPO, GRPO, REINFORCE, REINFORCE++, and the many variants below them) shares the same gradient shape: score-function × advantage. They differ only in how they estimate the advantage and how they regularize each step. We work from the most expensive (PPO, with a learned critic) down to the simplest (REINFORCE, no baseline at all).

PPO: The Trusted Workhorse

Proximal Policy Optimization was the algorithm behind early RLHF and remains the most thoroughly studied option. It makes two key innovations on top of basic policy gradient:

A learned critic

PPO trains a separate value network $V_\phi(s)$ to explicitly estimate $V^{\pi}$ . In practice this is typically a copy of the policy LLM with a scalar head, updated at each training step to minimize:

$\mathcal{L}_{\text{critic}} = \mathbb{E}_t\!\left[\bigl(V_\phi(s_t) - \hat{V}_t^{\text{target}}\bigr)^2\right]$

With a good $V_\phi$ , the advantage estimate $\hat{A}_t = R_t - V_\phi(s_t)$ is tight, giving low-variance gradients.

Generalized Advantage Estimation

Rather than the raw single-step residual, PPO uses GAESchulman et al., 2015 — the GAE paper, which is often cited alongside PPO. to reduce variance further:

$\hat{A}_t^{\text{GAE}} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$

The quantity $\delta_t$ is the TD residual, and it is itself a one-sample estimate of the advantage.Why is $\delta_t$ an advantage estimate? By definition $A(s_t,a_t) = Q(s_t,a_t) - V(s_t)$ , and $Q(s_t,a_t) = \mathbb{E}[r_t + \gamma V(s_{t+1})]$ . This bootstrapping idea of updating an estimate from another estimate rather than waiting for the full return is temporal-difference (TD) learning, the bridge between Monte Carlo and dynamic programming. This video has the most intuitive explanation I've found.

The parameter $\lambda \in [0,1]$ trades bias for variance: $\lambda=0$ gives the one-step TD estimate (low variance, some bias), $\lambda=1$ gives the full Monte Carlo return (unbiased, higher variance).

A subtlety in the LLM setting: as noted above, there is no genuine per-token reward — $r_t = 0$ for every intermediate token, and $r_T = R$ only at the terminal token.In implementations like TRL/OpenRLHF, the per-token KL penalty against $\pi_{\text{ref}}$ is often folded into $r_t$ as a dense shaping term, which is why you'll see non-zero intermediate $r_t$ in code even though the task reward itself is purely terminal. Plugging this into GAE, all the $\delta_t$ for $t < T$ reduce to $\gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ , and only $\delta_T$ carries the actual reward signal. The critic $V_\phi$ is doing all the heavy lifting: it learns to predict the eventual outcome from each prefix, and GAE then uses those predictions to spread the terminal reward backward across tokens. This is the real reason PPO needs a well-trained critic in LLM-RL — without it, there is no per-token learning signal at all.

Clipped objective

PPO adds a constraint to prevent the policy update from being too large in any single step:

$\mathcal{L}_{\text{PPO}} = -\frac{1}{\sum_i |\tau_i|} \sum_{i=1}^{N} \sum_{t=1}^{|\tau_i|} \min\!\Bigl(r_{i,t}(\theta)\hat{A}_{i,t},\;\text{clip}\bigl(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon\bigr)\hat{A}_{i,t}\Bigr)$

where the sum runs over $N$ rollouts $\tau_i$ in the batch and all tokens within each, and $r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} \mid s_{i,t})}{\pi_{\text{old}}(a_{i,t} \mid s_{i,t})}$ is the per-token importance ratio.

The cost: PPO requires maintaining and training a critic network that is often the same size as the policy. For a 70B-parameter LLM, that is a second 70B model in memory, plus all the optimizer states. This makes PPO expensive.

GRPO: Eliminating the Critic

Group Relative Policy Optimization, introduced in DeepSeekMath and later central to DeepSeek-R1, makes one clean simplification: estimate the baseline from the policy's own rollouts instead of a critic.

For each prompt $q$ , GRPO samples a group of $G$ responses $\{\tau_1, \ldots, \tau_G\}$ from the current policy and uses their mean as the baseline:

$\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}$

No critic. No value network. The group mean is the baseline.

Notice what $R_i$ is doing here: it's standing in for $Q$ . The action is the entire response, and the episode ends the moment it's produced, so the observed return $R_i$ is exactly the Monte Carlo action-value $Q^{\pi}(q, \tau_i)$ .Same fact used to derive REINFORCE above — the return-to-go is a Monte Carlo sample of $Q^\pi$ — specialized to the one-step case where the action is the whole trajectory and the reward is terminal, so the return is the reward. With the group mean estimating $V^{\pi}(q)$ , the GRPO advantage $R_i - \text{mean}(R_j)$ is just $Q - V$ — both terms estimated by sampling instead of learned by a critic.

Think of it as grading on a curve. The $G$ responses are students sitting the same exam $q$ , and their mean reward is the class average; the advantage asks the only useful question — did this response beat its cohort? — instead of the meaningless was $0.6$ good in the abstract? Because that average is shared by the whole group and never looks at which response is rollout $i$ , the zero-gradient property says subtracting it keeps the gradient unbiased — it only sharpens the contrast. And the cohort average is itself a sample estimate of the prompt's expected reward $V^{\pi}(q) = \mathbb{E}_{\tau \sim \pi}[R(\tau) \mid q]$ , so the group quietly plays the role of PPO's critic — no value network needed, and the bigger $G$ is, the better the estimate.

Dividing by std just puts every prompt at the same volume. Otherwise an easy prompt scoring $0.5{\pm}0.01$ whispers while one spanning $0.0$ – $1.0$ shouts, and the update only hears the loud one — even though picking the best sibling is just as real a lesson on the quiet prompt.This is pure rescaling, not a correctness fix — which is exactly why PPO has no such term: its trained critic already returns advantages on a calibrated, per-state scale, so there's nothing to re-normalize. The std is also a known wart (it over-weights very low-variance prompts), which is why Dr. GRPO drops it.

GRPO's clipped objective also adds a token-level KL term against a reference policy $\pi_{\text{ref}}$ to prevent reward hacking:

$\mathcal{L}_{\text{GRPO}} = -\frac{1}{\sum_i |\tau_i|} \sum_{i=1}^{G} \sum_{t=1}^{|\tau_i|} \Bigl[\min\bigl(r_{i,t}(\theta)\hat{A}_i,\;\text{clip}(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\bigr) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\Bigr]$

where $r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} \mid s_{i,t})}{\pi_{\text{old}}(a_{i,t} \mid s_{i,t})}$ is the per-token importance ratio for rollout $i$ — the GRPO-indexed counterpart of $r_t(\theta)$ from PPO.

The tradeoff: No critic memory, but you now sample $G$ responses per prompt per step. If $G = 8$ , rollout cost increases 8×. For most setups the memory saving (no second model) outweighs the extra rollout cost.

REINFORCE and REINFORCE++

Both algorithms go even simpler — no critic and no group sampling.

REINFORCE is the original policy gradient algorithm from Williams (1992).Williams, 1992 — the REINFORCE paper. Remarkably, the core algorithm is still in active use 30 years later. Use the actual episode return as the $Q$ estimate:

$\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_t \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot R^{(i)}$

No baseline, no critic, no clipping. Each token gets the same gradient weight (the full episode reward). This is unbiased — the sample gradient is an exact estimate of the true gradient in expectation. But it has very high variance: if all trajectories happen to score similarly (common early in training), the gradient is nearly zero and learning stalls.

REINFORCE++Hu et al., 2025 — REINFORCE++ is introduced as part of the TULU 3 / open-source RL work, demonstrating it matches PPO at lower cost. makes two simple fixes: token-level KL regularization folded into the reward, and a batch-mean baseline.

First, shape each rollout's reward with a token-level KL penalty against the reference:

$\tilde{R}_i = R_i - \beta \sum_t \log \frac{\pi_\theta(a_t^{(i)} \mid s_t^{(i)})}{\pi_{\text{ref}}(a_t^{(i)} \mid s_t^{(i)})}$

Then center across the batch using the mean of those shaped rewards:

$\hat{A}_i = \tilde{R}_i - \text{mean}(\{\tilde{R}_j\}_{j=1}^B)$

where $B$ is the full batch of diverse prompts (not the same prompt $G$ times, as in GRPO).

The KL term at the token level penalizes the policy for drifting from the reference model token-by-token, not just at the sequence level. This is a finer-grained constraint that helps prevent the policy from collapsing onto degenerate solutions.

Putting It All Together

Here is a summary of what each algorithm requires and what it achieves:

	Critic?	Baseline	Clipping	Variance
REINFORCE	No	None ( $b=0$ )	No	High
REINFORCE++	No	Batch mean	No	Medium
GRPO	No	Group mean + std-norm	Yes	Low–medium
PPO	Yes	$V^{\pi}(s)$ via critic	Yes	Low

For most practical LLM training today, GRPO or REINFORCE++ are preferred over PPO precisely because they avoid the second model. For a 70B policy, a 70B critic adds ~140B parameters to keep in GPU memory, plus separate optimizer states. GRPO trades that memory cost for more rollouts per step, which is typically the better deal.

Preference Optimization

The green branch from [Figure algorithm-tree not found] takes a different route from Online Policy Gradient: skip the rollouts entirely. There's no critic, no policy-gradient sample, no separately-trained reward model — just a supervised loss over preference pairs $(y_w, y_l)$ . What looks like SFT is actually solving the same KL-regularized RL objective the Online PG branch is iterating toward, but in closed form.

DPO: closed-form RL as a supervised loss

Start with the KL-regularized RL objective — the same one Online PG implicitly maximizes when it adds the $\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$ anchor:

$\max_\pi\ \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot \mid x)}\!\big[r(x, y)\big] - \beta \cdot D_{KL}\big(\pi(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\big)$

This has a closed-form optimum:

$\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x) \, \exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big)$

Rearrange to express the reward as a function of $\pi^*$ and $\pi_{\text{ref}}$ :

$r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)$

Plug into the Bradley-Terry preference model $P(y_w \succ y_l \mid x) = \sigma\!\big(r(x, y_w) - r(x, y_l)\big)$ . The $\log Z(x)$ terms cancel because they don't depend on $y$ . What's left is the DPO loss:Rafailov et al., 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model. The full derivation is in §4.

$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$

This is literally "KL-regularized RL, expressed as a supervised log-likelihood." The optimal policy of an Online PG run with reward $r$ and anchor $\pi_{\text{ref}}$ is exactly the policy that minimizes $\mathcal{L}_{\text{DPO}}$ on preference data labeled by that reward. DPO skips the rollouts and the policy gradient and goes straight to the supervised loss whose minimum is the RL optimum.

So when you train with DPO you're not doing supervised learning instead of RL — you're doing the closed-form solution to the same KL-regularized RL objective the Online PG branch is iterating toward.

The cost: DPO needs preference data $(x, y_w, y_l)$ with explicit pairwise labels, and the implicit reward is whatever the labelers agreed about, baked into the dataset. No reward model needs to be learned and no rollouts need to be sampled — but you've paid for the preferences upfront.

Variants

The DPO variants in [Figure algorithm-tree not found] all share DPO's structural choice (target = closed-form RL optimum, divergence = supervised log-likelihood) and tweak one of the moving parts:

SimPO (Meng et al. 2024) drops $\pi_{\text{ref}}$ from the loss entirely — uses average log-prob normalized by length as the implicit reward. Faster to train, no reference model in memory, slightly looser alignment with the original RL optimum.
KTO (Ethayarajh et al. 2024) replaces pairwise preferences with absolute "good" / "bad" labels per sample (binary classification framing). Easier label collection, comparable performance.
IPO (Azar et al. 2024) replaces the sigmoid in DPO with an MSE-style loss to address overfitting on near-deterministic preference data — essentially adds smoothing.
ORPO (Hong et al. 2024) folds an explicit SFT term into the DPO loss so you can collapse the SFT and preference-tuning stages into one pass.
Online DPO runs DPO on a stream of fresh preferences (sampled and labeled during training) instead of a fixed dataset, recovering some of the on-policy benefits.

These all live on the same axis as Online PG; they just trade rollouts for preference data and a closed-form solution.

Self-Training

The amber branch from [Figure algorithm-tree not found] sits at the simplest end of the tree. There's no policy gradient, no KL term, no preference data, no critic. Sample rollouts, filter by reward, SFT on the survivors, repeat.

That's it. The "RL" lives entirely in the filtering step.

The iterated SFT loop

For $T$ rounds:

Sample $N$ rollouts from current $\pi_\theta$ on each prompt.
Score them with a reward function (or a correctness check, or a test-passing oracle, etc.).
Keep the top- $k$ per prompt, or all rewards above threshold → call this filtered set $\mathcal{D}_t$ .
SFT $\pi_\theta$ on $\mathcal{D}_t$ → new $\pi_\theta$ .
Repeat.

The per-round objective is the plain SFT loss on the filtered set:

$\mathcal{L}_t(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_t}\!\left[\log \pi_\theta(y \mid x)\right], \quad \mathcal{D}_t = \{(x, y) : y \sim \pi_{\theta_{t-1}}(\cdot \mid x),\; R(x, y) \geq \tau\}$

Each iteration is forward KL minimization to a self-curated target distribution biased toward high reward. The rejection-sampling filter is doing the work of the policy gradient — high-reward samples get all the gradient mass, low-reward samples get none. As $\pi_\theta$ improves, $\mathcal{D}_t$ improves, which improves $\pi_\theta$ further.

Compared to the Online PG branch, Self-Training has lower variance per step (no per-token gradient on every rollout) but coarser credit assignment (sequence-level filter, no per-token signal). You can think of it as the policy-gradient equivalent of REINFORCE with a hard threshold instead of a learned baseline — all-or-nothing advantages instead of continuous ones.

STaR, ReST, RAFT

The three named methods differ mainly in what gets filtered and how the loop is scheduled:

STaR (Zelikman et al. 2022) — the original. For reasoning tasks with verifiable answers: sample reasoning chains, keep the ones that arrive at the correct final answer, SFT on the (problem, correct chain) pairs. Plus a "rationalization" trick: for problems the model gets wrong, condition on the gold answer to generate a fake-but-plausible chain, and include those in $\mathcal{D}_t$ too. Bootstraps reasoning ability without supervised reasoning traces.
ReST (Gulcehre et al. 2023) — Reinforced Self-Training. Generalizes STaR's loop to arbitrary reward-model-scored samples (not just verifiable correctness). Two nested loops: an outer "Grow" loop that samples new data from the current policy, and an inner "Improve" loop that filters at progressively higher thresholds and runs SFT.
RAFT (Dong et al. 2023) — Reward-rAnked Fine-Tuning. The most direct: sample $N$ , take top- $k$ by reward, SFT, repeat. No threshold scheduling, no rationalization trick. The simplest possible iterated-SFT loop.

The reason this branch works at all is the same reason Online PG works: as long as the filter is correlated with reward, the policy improves on average each round. Self-Training trades the precision of a continuous advantage for the simplicity of "just SFT, repeatedly."

Linking back to SFT

The three branches above (Online Policy Gradient, Preference Optimization, Self-Training) cover the algorithm tree from [Figure algorithm-tree not found]. SFT itself sits outside that tree — it's the loss everything starts from.

These four (SFT plus the three branches) aren't actually different paradigms. Every one of them reduces to the same shape:

Choose a target distribution. Minimize a divergence to it.

What changes is which target and which divergence.

SFT: forward KL to the data

The standard SFT loss is cross-entropy:

$\mathcal{L}_{\text{SFT}} = -\log \pi_\theta(y^* \mid x)$

Cross-entropy decomposes as $H(p, q) = H(p) + D_{KL}(p \,\|\, q)$ . For SFT the target distribution $p$ is one-hot on the gold token, so $H(p) = 0$ and cross-entropy equals KL exactly:

$\mathcal{L}_{\text{SFT}} = D_{KL}(p_{\text{data}} \,\|\, \pi_\theta)$

That's forward KL — data on the left, model on the right. Forward KL is mode-covering: it punishes the model for assigning low probability where the data has mass, so the model is forced to spread itself to cover everything in the dataset. This is the baseline. Every other branch is a way of choosing a different target distribution and minimizing some divergence to that — when the data alone isn't enough.

Forward KL vs reverse KL

The $\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$ anchor used by Online PG is reverse KL — model on the left, reference on the right. Reverse KL is mode-seeking: only places where the model puts mass contribute to the penalty. The model is free to drop mass on anything; the only thing it can't do without paying is invent outputs $\pi_{\text{ref}}$ never produced.

This is why post-RL policies are sharper than their SFT base. SFT spread the mass to cover everything reasonable in the data; reverse-KL-anchored RL lets the model concentrate mass on whichever subset of that cover the reward favors. This is real and observable: entropy drops, perplexity on its own samples falls, output diversity at any given temperature shrinks. People sometimes call this "mode collapse" and treat it as a bug. It's the design working as specified — the KL direction was chosen precisely to allow collapse onto high-reward modes while preventing the worse failure of the model walking off SFT's support entirely.

(Aside: classical RL — Atari, MuJoCo, gym — has no $\pi_{\text{ref}}$ in the loss at all. Just the clipped surrogate. You start from a random policy, there's no good behavior to anchor to, and drift is the goal. The frozen-SFT KL anchor is a contribution of LLM-RL, not PPO itself.)

Two kinds of mode-seeking

A reasonable question after the KL-direction story: is RL just a mode-seeking operation? Two effects say yes, and they're worth separating.

Reason 1: the fixed point of reward maximization. Forget the KL anchor for a moment. The bare objective $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ has a degenerate optimum — a point mass on $\arg\max_a R(a)$ . The policy that maximizes expected reward puts all its probability on the single best action (or one of them, if ties). This has nothing to do with KL. Maximizing an expectation over a bounded reward collapses the distribution onto the argmax. Pure RL, in this sense, is maximally mode-seeking: the fixed point is a delta function. The path during training doesn't have to be that, but the destination is.

Reason 2: the reverse-KL anchor. Layered on top of Reason 1 in LLM-RL is the $-\beta \cdot D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$ term — the mode-seeking direction of KL, as we just saw. This doesn't cause mode-seeking (reward max already does that); it constrains where the collapse lands. Without the anchor, the policy is free to collapse anywhere — including onto out-of-distribution junk that happens to game the reward. With it, the policy can only collapse onto a subset of $\pi_{\text{ref}}$ 's support. The anchor doesn't prevent mode-seeking; it just channels where the collapse goes.

So the LLM-RL setup stacks two mode-seeking pressures: reward-max picks the high-reward mode, the reverse-KL anchor picks the modes inside $\pi_{\text{ref}}$ 's support.

The cases where RL is not mode-seeking are exactly the cases where one of these is broken. Maximum-entropy RL ( $\max\,\mathbb{E}[R] + \beta\, H(\pi)$ , as in soft actor-critic) replaces the argmax fixed point with a softmax over rewards — concentrated but not collapsed. Off-policy methods with a diverse behavior prior inherit some of that breadth. Explicit diversity rewards add a term that resists collapse. None of these are active in the standard RLHF or GRPO loop. That's why post-RL policies are notably sharp — the mode-seeking is intentional, on two axes at once, and nothing in the recipe pushes back.

The unifying picture

Pulling all four together:

Branch	Target	Divergence	Gradient flow	Mode behavior
SFT	The data	Forward KL	Direct on demos	Mode-covering
Online PG	Reward-shaped, anchored to ref policy	Reverse KL + reward max	Score-function on rollouts	Mode-seeking
Preference Opt	Closed-form RL optimum	Supervised log-likelihood	Direct on preference pairs	Mode-seeking (fits the reverse-KL optimum)
Self-Training	Reward-filtered self-samples	Forward KL	Iterated SFT on the filter	Mode-seeking (covers within the reward filter)

The branches answer which of the two questions am I willing to pay for?

Cheap target, expensive divergence: Online PG. The target is just "high reward, anchored to SFT" — easy to specify, but you pay with rollouts and noisy gradients.
Expensive target, cheap divergence: Preference Optimization. The target requires preference data (and an implicit reward model embedded in the labelers), but once you have it, the loss is plain supervised.
Iterative bootstrap: Self-Training. Don't pay for either upfront — let the policy and the target distribution co-evolve through filtering.
No target synthesis at all: SFT. The target is given (the data) and the divergence is the cheapest possible.

That's why "SFT vs RL" is the wrong axis. The right axis is: how much work are you willing to do to construct the target distribution, and what divergence are you willing to compute against it? Everything else — the critic, the clip, the KL coefficient, the preference labelers, the rejection filter — is engineering in service of those two choices.

Closing Thoughts

All four categories of algorithms are fundamentally solving the same problem: estimate the gradient of $J(\theta) = \mathbb{E}[R(\tau)]$ . What separates them is how carefully they estimate the advantage, and at what cost.

The elegant thing about the zero-gradient property is that it gives enormous freedom in choosing the baseline. GRPO: the mean reward of a group of same-prompt rollouts is a cheap, unbiased estimate of $V^{\pi}$ REINFORCE++': a batch-level constant helps, and token-level KL tightens the constraint where it matters

You see! deep theory and practical engineering are not always in tension. Sometimes, the right theoretical framing reveals that a simple heuristic is actually doing exactly the right thing.

If we stand a bit higher! Every branch picks a target distribution and minimizes a divergence to it: SFT to the data, Self-Training to its reward-filtered samples, Preference Optimization to the closed-form $\pi^*$ . Even Online PG, the branch that looks like it maximizes reward rather than matching a distribution, could be interpreted as doing the same thing. The KL-regularized objective is, up to a constant, a divergence to the very same $\pi^*(y \mid x) \propto \pi_{\text{ref}}(y \mid x)\,\exp\!\big(\tfrac{1}{\beta} r(x, y)\big)$ we derived for DPO:

$\underbrace{\mathbb{E}_{y \sim \pi}\big[r(x, y)\big] - \beta\, D_{KL}\big(\pi \,\|\, \pi_{\text{ref}}\big)}_{\text{reward, KL-anchored}} \;=\; \underbrace{-\beta\, D_{KL}\big(\pi \,\|\, \pi^*\big)}_{\text{divergence to the target}} \;+\; \text{const}$

Maximizing reward is minimizing reverse KL to a target. The reward here somehow defines/shapes the target distribution of the policy, pushing the reference policy by $\exp(r/\beta)$ , with $\beta$ the temperature of the tilt: small $\beta$ concentrates $\pi^*$ on high reward, large $\beta$ leaves it near $\pi_{\text{ref}}$ .

That turns "SFT vs RL" from a question of philosophy into a question of access: do you have samples from your target, or only a way to score samples? Demonstrations are samples from the target, so forward KL is computable directly, aka SFT. A reward is only a scorer; you can't draw from the target, so the best you can do is sample your own policy and reweight by reward, which is reverse KL on rollouts, aka RL. The divergence direction isn't chosen for its mode-seeking or mode-covering flavor; it's forced by what you can sample, and the mode behavior follows.

Ground this in two tasks that pull in opposite directions.

Take a verifiable task: $r = 1$ when the answer checks out, $0$ otherwise. The tilt $\exp(r/\beta)$ multiplies correct answers by $\exp(1/\beta)$ and wrong ones by $\exp(0) = 1$ , so as $\beta \to 0$ it blows up on the correct answers while the wrong ones wash out under the normalizer. What's left is the base model's own distribution, restricted to correct answers. This is exactly what you get by sampling from $\pi_{\text{ref}}$ and throwing the failures away. The cold-temperature reward optimum is literally rejection sampling, which is why filtered self-training fits the same target.

A learned safety reward is the soft version of the same move: it scores refusals on harmful prompts highly, so the tilt piles $\pi^*$ 's mass onto refusal-shaped responses. But this task has essentially one acceptable behavior: refuse, and you want it every single time, not a diverse spread of ways to handle a jailbreak. So here you actually want the policy to collapse onto that one mode, which is exactly the mode-seeking reverse-KL RL gives you. The diversity loss that looks like RL's weakness is the feature.

Same machinery, opposite wishes: math has many correct answers and you'd rather keep them all; safety has basically one right behavior and you'd rather lock onto it. The reward says where the target sits; the divergence says whether you fan out across it or collapse onto it.

Which is the whole post in one line: the gradient of $J(\theta)$ walks the policy toward a target distribution the reward quietly defines.

There's one last lens that makes the same point from underneath. A policy assigns a probability $\pi(y \mid x)$ , and any probability is a code: the string $y$ can be written in $-\log \pi(y \mid x)$ bits. Forward KL to the data is then just the excess bits the model spends describing it, $D_{KL}(p_{\text{data}} \,\|\, \pi) = \mathbb{E}_{\text{data}}[-\log \pi] - H(p_{\text{data}})$ , so SFT is nothing but make the model the shortest description of the data your function class can express. That's the Minimum Description Length principle (yeah, Kolmogorov complexity), and it's why the same objective that trains a language model also makes it a state-of-the-art compressor.Ilya Sutskever, An Observation on Generalization — Simons Institute, Aug 2023. Argues that unsupervised learning works because fitting the data well is compressing it, with the explicit link to Kolmogorov complexity.Delétang et al., 2023 — Language Modeling Is Compression. Log-loss is bits-per-byte, and large language models turn out to be strong general-purpose compressors.

The reward branch is the same statement, re-priced. Read the optimum $\pi^*$ in bits:

$-\log \pi^*(y \mid x) = -\log \pi_{\text{ref}}(y \mid x) - \tfrac{1}{\beta}\, r(x, y) + \log Z(x)$

The codelength under the optimal policy is the prior's codelength minus reward (in nats) over $\beta$ . Reward literally buys shorter codes, and $\beta$ is the exchange rate between bits and reward. The anchor $\beta \cdot D_{KL}(\pi \,\|\, \pi_{\text{ref}})$ is the price, in bits, of moving off the prior. So pretraining pays the enormous compression cost of describing the corpus, and RL pays only a small, KL-bounded surcharge to re-rank what already lives in $\pi_{\text{ref}}$ 's support.

Which is maybe the most compact way to hold the whole tree in your head: every method is searching for the shortest code for its target, and reward is just a discount on the codelength of the outcomes you want to see.

Citation

Please cite this work as:

Xuhui Zhou, “Thinking in RL”, 2026.

Or use the BibTeX citation:

@misc{zhou2026rl,
  author = {Xuhui Zhou},
  title = {Thinking in RL},
  year = {2026},
  howpublished = {\url{https://xuhuizhou.com/blog/thinking-in-rl}},
}