Learning with Verbal Feedback

By Xuhui Zhou and Weiwei Sun · May 31, 2026

After scores and checkmarks, the next reward is a sentence.

Imagine learning tennis from a single number. You serve, you rally, and at the end of each point a coach who has watched silently holds up a card: 0.6. No "your toss is too low," no "you're late on the backhand" — just a score, point after point, for months. You would eventually improve, because even a number carries some gradient. But you would improve absurdly slowly, and you would have no idea what you were doing right or wrong.

This is, more or less, how we train language models with reinforcement learning. And almost everyone agrees it is strange — yet it is the standard recipe. This post is about the slow, ongoing correction of that strangeness: the arc along which the feedback signal in RL for language models has been getting more expressive, and where we think it goes next. The short version of our thesis is that the reward is becoming a sentence.

We will end at a specific paper — Ditto, which folds verbal feedback directly into the RL loop for the messiest domain we know of — simulating human behavior. But Ditto only makes sense as the endpoint of a much longer story, so we will spend most of the post getting there.

Recap: RL as a gradient on a scalar

In the companion post, Thinking in RL, we argued that every RL algorithm for language models is, underneath, the same move: estimate the gradient of

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigl[R(\tau)\bigr]

where R(τ)R(\tau) is a scalar reward assigned to a trajectory. PPO, GRPO, REINFORCE, DPO — they differ in how they estimate the advantage and regularize the step, but they all differentiate the same object: the expectation of a number.

This whole post is about that number refusing to stay a number. Read the arc as the story of RR slowly acquiring the bandwidth of language — and the algorithms barely changing while it does.

The scalar era and its bottleneck

The modern recipe begins by deciding, deliberately, that the human teacher should never have to say what they mean. Christiano et al. reduce human teaching to ordinal comparisons between trajectory segments, then fit a single scalar reward to those comparisons. Ziegler, Stiennon, and then InstructGPT carry this into language models and make it the industry default.

The defining act of this era is compression. Everything a human cares about — is it correct, is it kind, is it the right length, did it dodge the question — gets squeezed into one number so the policy gradient has something scalar to chase. It works astonishingly well. But the bottleneck is real and, by now, formally understood. Casper et al. catalogue how a single scalar reward model cannot represent a population with diverse or multi-objective values; Siththaranjan et al. prove that standard preference learning silently performs a Borda-count vote over hidden context, discarding the disagreement instead of representing it.

This is E[R(τ)]\mathbb{E}[R(\tau)] in its purest, lossiest form: a number that is, by construction, a vote whose ballots we burned.

Two ways the scalar tried to grow

Faced with the bottleneck, the field tried two repairs — both of which kept the reward a number.

Decompose it. Instead of one holistic score, train several dimension-specific reward models and hand out dense, segment-level rewards. Wu et al.'s fine-grained RLHF is the clean example; process reward models push the same idea down to individual reasoning steps. Richer, yes — but still a vector of scalars, still no words.

Make it honest. The reasoning era took the opposite tack: stop trying to learn a faithful reward and instead pick a domain where the reward can be checked. RL with verifiable rewards — DeepSeekMath's GRPO, DeepSeek-R1, Tulu 3 — replaces the hackable reward model with a programmatic correctness check. This is the scalar's apotheosis: when correctness is verifiable, the number loses almost nothing, and it works spectacularly.

But "make it honest" only works where honesty is checkable. It draws a boundary the rest of this post lives on the other side of: math and code, where a scalar is honest — and was this a believable, socially skilled human?, where it is not.

[Figure honesty not found] is the whole domain argument in one image. In a verifiable task, the scalar and a full verbal critique agree — the number is an honest summary. In a subjective task, the scalar 0.7 sits next to a long, divergent critique, because it has compressed away the very thing being judged. Verbal feedback helps most exactly where the scalar hides the most.

Language sneaks in — and gets escorted back out

Language did find its way into the loop, repeatedly. The pattern is always the same: it enters as a critique and leaves as a number.

Constitutional AI is the pivotal example. A plain-English principle drives a verbal self-critique-and-revise; the model rewrites its answer to better satisfy the principle. But the language is an intermediate: the revised pair is immediately converted into a preference and distilled into a scalar reward model. RLAIF formalizes the AI-feedback substitution and shows it scales; Self-Rewarding and Meta-Rewarding push the judge to be the model itself, judging its own outputs and then its own judgments.

Even the reasoning-era verifiers started to talk before scoring. Critique-out-Loud reward models and generative verifiers verbalize a critique and then emit a number; rubrics-as-rewards replace the binary check with a structured language rubric to push RLVR into fuzzier domains.

In every one of these, language is allowed in the front door and escorted out the back. The critique does real work, then we throw away the words and keep the number. The obvious question — the one this whole post is circling — is: what if the words are allowed to stay?

The verbal thread, branch one: feedback at inference time

There is a parallel tradition that keeps the feedback in language and never converts it to a reward at all. Its first branch does so at inference time, with no weight updates.

Reflexion is the conceptual anchor. It reframes the RL loop itself as verbal: the agent acts, a critic produces a verbal reflection on what went wrong, and the "policy update" is simply appending that reflection to an episodic memory the agent reads on its next attempt. Self-Refine strips it to the minimum — one frozen model generates, critiques itself, and revises, no training at all; CRITIC grounds the self-critique in external tools so it is reliable rather than hallucinated.

These are genuinely powerful, and they prove the point that language is a sufficient learning signal. But the lesson lives in the context window and dies with the episode. The model re-derives "don't make this mistake" every single time it sees a fresh instance; nothing accumulates in the weights. It is learning written entirely in sand.

The verbal thread, branch two: internalizing feedback into weights

The second branch makes the feedback stick by turning it into a gradient.

Chain-of-Hindsight is the earliest clean instance: condition on a (generation, feedback) pair and train next-token prediction on the improved continuation, for feedback of any polarity. The ILF (imitation learning from language feedback) lineage makes it a real RLHF alternative: take a piece of language feedback, sample candidate refinements conditioned on it, keep the best, and fine-tune on it. REFINER trains a generator against a learned critic that emits structured natural-language feedback on each reasoning step.

What makes this branch principled — rather than a pile of heuristics — is that it sits on three older ideas, which we will see fused in Ditto:

  • Privileged information. The feedback is information available to the teacher during training but absent at test time. Learning under privileged information (LUPI) is exactly this asymmetry, and it is known to speed up learning, not just enable it.
  • Distillation. Keeping a privileged teacher's competence in an unprivileged student is just distillation, and the two ideas were formally unified as "generalized distillation."
  • Hindsight relabeling. Re-running an episode as if you had known the right thing all along turns a failure into supervision. Hindsight Experience Replay did this for goals in control; HIR lifted it to instructions for language models.

Recently the two threads — verbal feedback and the RL loop — have started to merge outright. CTRL and Critique-RL make the verbal critique itself the object of RL, optimized for how much it helps a downstream revision; Critique-GRPO augments numerical GRPO with natural-language critiques, on the explicit argument that a scalar "cannot convey why a response fails or how to fix it." And the sharpest statement of the counter-thesis simply abandons the scalar: condition the policy directly on the feedback.

Ditto: letting the words stay

Ditto is where this thread becomes load-bearing in the place the scalar fails hardest — subjective, multi-turn human-behavior simulation — and where the verbal feedback is finally allowed to stay as the operative training signal.

The setup keeps the policy-gradient machinery from the GRPO era and changes only what the judge returns. For each prompt xx, a judge returns not a number but a tuple: a scalar reward rr and structured verbal feedback hh.

The mechanism, concretely:

  1. Student rollout. Sample y0πθ(x)y_0 \sim \pi_\theta(\cdot \mid x) — the unaided attempt — and judge it to get (r0,h)(r_0, h).
  2. Hindsight teacher rollout. Condition the same policy on the critique: y1πθ(x,h)y_1 \sim \pi_\theta(\cdot \mid x, h), and judge it to get r1r_1. Because y1y_1 got to read what was wrong with y0y_0, it is reliably better. This is hindsight relabeling in language.
  3. Joint GRPO. Put {y0,y1}\{y_0, y_1\} in one group and compute group-relative advantages A^i=(riμ)/σ\hat{A}_i = (r_i - \mu)/\sigma, exactly as in GRPO. Add an extra term Lfb\mathcal{L}_{\text{fb}} that does a second GRPO update on the feedback-conditioned rollouts alone:

LDitto=Lgroup({y0,y1})+λLfb(y1)\mathcal{L}_{\text{Ditto}} = \mathcal{L}_{\text{group}}(\{y_0, y_1\}) + \lambda\,\mathcal{L}_{\text{fb}}(y_1)

The feedback-conditioned policy πθ(x,h)\pi_\theta(\cdot \mid x, h) is an implicit teacher whose improved behavior gets distilled, on-policy, into the unconditioned student πθ(x)\pi_\theta(\cdot \mid x). The critique hh is privileged information: present during training, gone at test time. By the end, the model behaves like the feedback-conditioned teacher on a fresh prompt in a single forward pass — no judge, no critique, nothing to re-derive. The lesson is in the weights, not in sand.

This is the precise contrast with Reflexion. Same insight — reflect in language, then improve — opposite ontology. Reflexion's update is text in a buffer that evaporates at the end of the episode; Ditto's update is a gradient that persists and transfers.

And because the feedback is language, it has headroom a saturating scalar does not. As the policy improves, the critiques get sharper — early ones flag blunt failures, later ones target subtle ones — so the feedback-conditioned teacher keeps pulling ahead of the student rather than collapsing into it.

Why this domain, why now

The reason Ditto lives in human-behavior simulation, and not in math, is that human simulation is where the scalar is most dishonest. Simulating a believable user, patient, learner, or persona is judged along many axes that do not reduce to a common currency — and, increasingly, it is something we genuinely rely on. Generative agents are judged on believability by human raters; social agents in SOTOPIA are scored on seven subjective dimensions with free-text rationales; theory-of-mind benchmarks reveal that scalar accuracy can hide illusory competence, a right answer for the wrong reason.

To measure this, Ditto introduces SOUL (Simulation gym Of hUman-Like behavior): 10 tasks across six categories — Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. These are deliberately the tasks where a scalar reward is least informative.

Trained with Ditto, an 8B model improves an average of 36% over the base policy (0.533 → 0.726) and beats GPT-5.4 on 6 of the 10 tasks, despite being far smaller. The average is the least interesting number, though — what matters is where the gains land.

The blowouts are on Social Skill (Sotopia, 0.28 → 0.47) and User Simulation (UserLLM, 0.47 → 0.93; MirrorBench, 0.55 → 0.71) — the most interactive, most multi-dimensional tasks on the board. GPT-5.4 still leads on the role-play tasks that lean more on stored knowledge than on moment-to-moment social calibration. That split is the thesis: verbal feedback buys almost nothing where the scalar was already honest, and almost everything where it was hiding the most.

A few honest caveats

We find the arc compelling, but it is worth being clear about what verbal feedback does not buy you.

The feedback here comes from an LLM judge, not a human. That puts Ditto in the same family as RLAIF: the policy can only become as nuanced as the critiques it trains on, so the judge's blind spots become the policy's blind spots. It also opens a language-space version of reward hacking — verbosity, sycophancy, gaming the critic's stylistic tells — that we are only beginning to understand.

It is not free, either. Each prompt now needs two rollouts — student and feedback-conditioned teacher — so you pay roughly double the generation cost per step. The bet is that the better target is worth more than the extra samples; on SOUL it clearly was, but that is workload-dependent.

And internalization cuts both ways. Because the feedback dimensions get baked into the weights, you cannot cheaply change your mind about what you want at test time the way you can by re-prompting a system that reads feedback live (the Reflexion trade, in reverse). If the target moves, you retrain.

Finally, beating GPT-5.4 on SOUL is not the same as being indistinguishable from a person — and the illusory-ToM worry applies to our own model too. The real test of verbal feedback is whether the student internalized the reasoning the critique pointed at, or just the answer. That is the thing we most want to verify next.

Coda: R is a sentence now

Return to the gradient we started with, θEτπθ[R(τ)]\nabla_\theta\, \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]. Across the entire arc — RLHF, RLAIF, RLVR, and now verbal feedback — that gradient barely changes. What changes is the thing we put inside the expectation. For a decade it was a scalar, because a scalar is what made the gradient easy. The arc is the field slowly admitting that the native form of human feedback was always language, and engineering its way back to it.

No coach hands an athlete a 0.62. They say you're late on the backhand. The interesting frontier in RL for language models is the same correction, finally arriving: the reward has gotten the bandwidth of a sentence.

Citation

Please cite this work as:

Xuhui Zhou and Weiwei Sun, “Learning with Verbal Feedback”, 2026.

Or use the BibTeX citation:

@misc{zhou2026verbal,
  author = {Xuhui Zhou and Weiwei Sun},
  title = {Learning with Verbal Feedback},
  year = {2026},
  howpublished = {\url{https://xuhuizhou.com/blog/learning-with-verbal-feedback}},
}