Saved
Scale AI · ML Research Discussion

On the Quest of User-Effective AI Agents

Productivity • Proactivity • Personalization

Xuhui Zhou Carnegie Mellon University · Language Technologies Institute April 8, 2026
01

AI agents are increasingly capable—
but they disappoint real users

Current benchmarks evaluate task success, but real users care about the experience of working with an agent.

02

Our Vision: The PPP Framework

We argue that effective real-world agents require optimizing three dimensions beyond task success:

Productivity
Complete tasks successfully and efficiently
Proactivity
Ask essential clarifying questions when instructions are underspecified
Personalization
Adapt to diverse user preferences and working styles
03

A Research Arc Toward User-Effective Agents

ICLR 2026
Ambig-SWE: Can agents detect ambiguity and ask the right questions?
arXiv 2025
PPP: Training agents to be productive, proactive, and personalized
arXiv 2025
TOM-SWE: Persistent user mental modeling for coding agents
arXiv 2026
Sim2Real: Do LLM user simulators actually resemble real users?
04
Agents are powerful — but what happens when the user’s instructions are incomplete? Can agents learn to ask?
05
ICLR 2026
Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering
Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
06

How SWE-Bench Works

Issue

data leak in GBDT due to warm start (non-histogram-based…)

Codebase
📁 sklearn/📄 reqs.txt 📁 examples/📄 setup.cfg 📄 README.rst📄 setup.py
🤖 Language Model
🔄 Generated PR +20 −12
📄 gradient_boosting.py 📄 helper.py 📁 utils
📋 Unit Tests
Pre Post Test
join_struct_col
vstack_struct
dstack_struct
matrix_transform
euclidean_diff

Given a GitHub issue + codebase, the agent generates a PR. Success = failing tests now pass, passing tests still pass.

Key assumption: the GitHub issue is complete and well-specified.

07

But Real Users Write Vague Instructions

We craft Ambig-SWE from SWE-Bench Verified:

  • Start from 500 curated, well-specified GitHub issues
  • Use GPT-4o to summarize each into a vague version — preserve the domain but strip the detail
  • Creates controlled information asymmetry: user holds the full issue, agent sees only the summary
Well-specified (SWE-Bench)
"data leak in GBDT due to warm start: when using warm_start=True with GradientBoostingClassifier, training data from previous fit calls leaks into new trees via residuals stored in…"
Underspecified (Ambig-SWE)
"There's a data leak issue with warm start in the gradient boosting module."
08

Three Questions About Agent Proactivity

🔍
Detect
Can agents distinguish well-specified from underspecified instructions?
Ask
When agents ask, do they extract useful information from the user?
Leverage
Does the interaction actually improve task outcomes?
Full
Agent sees complete issue
(upper bound)
Hidden
Agent sees vague summary only
(lower bound)
Interactive
Vague summary + can ask user
(the treatment)
09

Interaction Helps Massively — But Agents Never Ask

+74%
improvement with interaction
on underspecified tasks

Interaction recovers 80% of full-specification performance through clarification alone.

But: FNR (failure to detect ambiguity)

Haiku 3.5
0.97
Sonnet 3.5
0.81
Llama 70B
0.57
Deepseek-v2
0.31
Agentic-tuned models almost never flag ambiguity. Non-agentic models ask — but with high false-positive rates.

The value of interaction is clear. The open question: how do we get agents to ask the right questions at the right time?

10
Interaction is valuable, but agents never initiate it. How do we train agents to ask the right questions at the right time?
11
arXiv 2025
PPP: Training Proactive and Personalized LLM Agents
Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, Yiming Yang
12

Why Task Success Alone Fails

Training only for task success actively degrades proactivity and personalization:

Productivity ~same Proactivity diverges! Personalization diverges! PPP (Ours) Task Success Only

Solution: multi-objective RL

  • UserVille — interactive environment with LLM user simulators, configurable preferences
  • Three rewards — Rprod (task success) + Rproact (ask useful questions) + Rpers (match user style)
  • Joint optimization prevents any dimension from being sacrificed
13

How PPP Training Works

VAGUE PROMPT "fix the leak bug" AGENT πθ USER SIM holds full spec ask_user τ = (a₁,o₁, ... , aₜ,oₜ) tool calls + ask_user calls + observations R_prod pass tests? R_proact low-effort Q's? R_pers match user style? R(τ) = R_prod + R_proact + R_pers

GRPO with token-level credit

Sample $G$ rollouts per prompt, group-relative advantage:

$$\hat{A}_{i,t} = \frac{\operatorname{clip}(R_i, 0, 1) - \operatorname{mean}(\{R_j\})}{\operatorname{std}(\{R_j\})}$$

Clipped PPO-style objective:

$$\mathcal{J} = \sum_{i,t} \min\!\Big\{r_{i,t}(\theta)\,\hat{A}_{i,t},\;\operatorname{clip}\!\big(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon\big)\,\hat{A}_{i,t}\Big\}$$

  • $R_{\text{proact}}$ — rewards low-effort Q's (e.g. "which color?"); penalizes high-effort ones
  • $R_{\text{pers}}$ — user sim scores alignment with stated preference (language, format, style)
  • Training — SWE-Gym tasks in UserVille; precise prompts → vague; diverse user profiles (brevity, language, expertise); base model: Seed-OSS-36B-Inst
  • Eval — SWE-Bench-Verified (Func-Loc) + BrowseComp+; 8 unseen user preferences for generalization
14

PPP Across Three Dimensions

40 80 Productivity Proactivity Personalization
GPT-5
PPP (Ours, 36B open-source)
GPT-5PPPΔ
Productivity55.856.3+0.4
Proactivity36.675.6+38.9
Personalization13.089.3+76.3

+21.6 average over GPT-5 on SWE-Func-Loc

15

Trained Interaction is the Key to Real-World Utility

F1 on SWE-Func-Loc

58.5
64.2
Precise
No Int.
36.5
44.0
Vague
No Int.
36.0
57.7
Vague
+ Int.
Baseline
+ PPP RL

Vague + Interaction + RL (57.7) recovers to near precise baseline (58.5)

Interaction quality over training

PPP (Ours) 0 1 2 0 100 200 Training Step Low Effort Medium Effort High Effort No $R_{\text{proact}}$ 0 1 2 0 100 200 Training Step Low Effort Medium Effort High Effort

PPP: low-effort Q's rise, medium rise-then-fall (learns to ask better). Baseline: all categories explode (gets “lazy”, offloads work to user).

16
PPP makes agents smart collaborators within a session. But when the conversation ends, the agent forgets everything. Can agents remember who they serve?
17
arXiv 2025
TOM-SWE: User Mental Modeling for Software Engineering Agents
Xuhui Zhou, Valerie Chen, Zora Zhiruo Wang, Graham Neubig, Maarten Sap, Xingyao Wang
18

A Dual-Agent Architecture

TOM-SWE Architecture
  • SWE Agent handles code generation & execution (the standard coding agent)
  • ToM Agent (lightweight LLM) infers user goals, constraints, and preferences
  • Hierarchical Memory persists the user model across sessions — the agent remembers you
  • Key insight: agents need a model of who they work for, not just what they work on
19

Introducing the Stateful SWE Benchmark

Stateful SWE Benchmark Overview
Step 1
Collect 453 real human-agent sessions
Step 2
Create 15 developer profiles from 75 preferences
Step 3
Pair issues + profiles with 1.5M token histories
20

Stateful SWE Benchmark Results

Success rate (%) — ToM-enhanced agents consistently outperform across all models:

18.1
38.4
59.7
Claude 4
18.7
14.4
46.5
Claude 3.7
14.2
16.9
44.3
Qwen3
CodeActAgent
RAGCodeActAgent
TomCodeActAgent (Ours)

Claude 4 + ToM: 59.7% vs. 18.1% — a 3.3x improvement. Real-world validation: 86% useful in 3-week study (17 devs, 209 sessions).

21

Cheap, Effective, and Validated in Practice

Cost vs. resolved rate

0% 20% 40% 60% $0 $0.10 $0.20 Avg. cost per session w/o ToM nano mini GPT-5 3.7 Claude 4
w/o ToM
GPT-5
Claude
$0.17
per session — only 16% of SWE avg ($1.08)

3-week real-world deployment:

  • 17 professional developers using ToM-enhanced CLI daily
  • 209 coding sessions on real tasks
  • 86% acceptance rate of ToM agent suggestions

User modeling is a distinct, efficient capability — even small LLMs dramatically boost performance as ToM agents.

22
PPP and TOM-SWE both train and validate with LLM user simulators. But what if those simulators are lying to us?
23
arXiv 2026
Mind the Sim2Real Gap in User Simulation for Agentic Tasks
Xuhui Zhou*, Weiwei Sun*, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, Maarten Sap
24

The Experiment: τ-bench with Real Humans

Why τ-bench?

  • Full interactive loop: LLM user simulator + tool-augmented agent + automatic reward
  • Customer service across airline (bookings, cancellations) and retail (returns, orders)
  • Enables controlled comparison: swap LLM simulator with real humans, keep agent & reward fixed

Human annotation

  • 451 participants (ages 18–80, recruited via Prolific) across 165 tasks
  • Role-play as customers, interact with the same agent used in τ-bench
  • 3 independent batches on same tasks → measure human–human agreement as ceiling
  • Post-task survey: 5-way task success + 8 quality dimensions (efficiency, human-likeness, reuse, etc.)
25

A Taxonomy of Sim2Real Gaps

Taxonomy of Sim2Real Gaps
Behavioral Gap
D1 Communication · D2 Information · D3 Clarification · D4 Error Reaction
Evaluative Gap
Success criteria alignment + quality dimension scores across 8 axes
26

Measuring the Gap: User-Sim Index

We aggregate behavioral and evaluative dimensions into a single 0–100 score:

$$\text{USI} = \frac{1}{6}\big(\text{D1} + \text{D2} + \text{D3} + \text{D4} + (1{-}\text{ECE}){\times}100 + \text{Eval}\big)$$

  • D1–D4: Sørensen–Dice coefficients per behavioral dimension (higher = closer to human)
  • ECE: outcome calibration error (lower = better)
  • Eval: $(1{-}\text{MAE}){\times}100$ evaluative alignment

31 LLM simulators benchmarked

  • 18 proprietary — GPT, Claude, Gemini families
  • 9 open-source — DeepSeek, Llama, Qwen
  • 4 specialized — CoSER, UserLM, HumanLike, HumanLM (fine-tuned for user simulation)

Human inter-annotator USI: 92.9. Best LLM simulator: 76.0. A 16.9-point gap that no model closes.

27

The “Easy Mode” Problem

USI vs Chatbot Arena

USI vs. Chatbot Arena Elo. Only GPT family shows correlation ($r{=}0.91$). Claude & Gemini: no significant relationship.

  • Too cooperative — GPT-4o: 49% polite vs. 29% for humans; 1% short turns vs. 15.3%
  • Front-load information — simulators volunteer all details upfront; real users say “Hi, help with a return under Sarah”
  • Never frustrated — humans: “you already asked me that”; simulators quietly pivot
  • Inflates success — top LLM simulators: 77.8% agent success vs. 63.6% human baseline

GPT-5.1 as evaluator overestimates human-likeness by 55% and overall quality by 18%. Rule-based rewards are orthogonal to human-perceived quality across all 8 dimensions.

28

Implications for Benchmark Design

Human-in-the-Loop
LLM simulators are useful for scale, but final benchmarks must include real human participants. USI quantifies how much to trust simulator results.
Diversity-Aware Simulation
Current simulators suffer mode collapse. Calibrate against real human behavioral distributions, not just average accuracy.
Report the Gap
A model scoring 78% with LLM users but 64% with real users tells a very different story. Report sim-human gaps alongside scores.

USI provides a principled way to audit any simulator-based benchmark — directly relevant to Scale AI’s evaluation infrastructure.

29

The Full Picture

Work Question Key Result Scale
Ambig-SWE Can agents ask? +74% with interaction SWE-Bench Verified
PPP Can we train for it? +21.6 F1 over GPT-5 Multi-objective RL
TOM-SWE Can agents remember? 59.7% vs 18.1% (3.3x) 453 sessions, 17 devs
Sim2Real Are evals valid? LLMs = "easy mode" 451 humans, 31 LLMs

A progression from benchmarking the problemtraining for itpersistent modelingvalidating evaluation

30

Design Implications for Real-World Agents

31

Open Questions & Future Work

Particularly relevant for Scale AI: agents that navigate ambiguity in enterprise contexts need exactly these capabilities.

Thank You

Looking forward to your questions and discussion!

Xuhui Zhou xuhuiz@cs.cmu.edu · Carnegie Mellon University