Scale AI · ML Research Discussion

On the Quest of User-Effective AI Agents

Productivity • Proactivity • Personalization

Xuhui Zhou Carnegie Mellon University · Language Technologies Institute April 8, 2026

01

AI agents are increasingly capable—
but they disappoint real users

Misunderstanding ambiguity — making unwarranted assumptions instead of asking clarifying questions
Ignoring preferences — one-size-fits-all execution that doesn't adapt to individual user styles
Failing to collaborate — executing commands in isolation rather than engaging in iterative dialogue
Creating safety risks — tool misuse and wasted resources from underspecified instructions

Current benchmarks evaluate task success, but real users care about the experience of working with an agent.

02

Our Vision: The PPP Framework

We argue that effective real-world agents require optimizing three dimensions beyond task success:

⚙

Productivity

Complete tasks successfully and efficiently

❓

Proactivity

Ask essential clarifying questions when instructions are underspecified

✎

Personalization

Adapt to diverse user preferences and working styles

03

A Research Arc Toward User-Effective Agents

ICLR 2026

Ambig-SWE: Can agents detect ambiguity and ask the right questions?

arXiv 2025

PPP: Training agents to be productive, proactive, and personalized

arXiv 2025

TOM-SWE: Persistent user mental modeling for coding agents

arXiv 2026

Sim2Real: Do LLM user simulators actually resemble real users?

04

Agents are powerful — but what happens when the user’s instructions are incomplete? Can agents learn to ask?

05

ICLR 2026

Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig

06

How SWE-Bench Works

◉ Issue

data leak in GBDT due to warm start (non-histogram-based…)

○ Codebase

                        📁 sklearn/📄 reqs.txt
                        📁 examples/📄 setup.cfg
                        📄 README.rst📄 setup.py
                    

→

🤖 Language Model

↓

🔄 Generated PR +20 −12

                        📄 gradient_boosting.py ✚
                        📄 helper.py ◊
                        📁 utils ➖
                    

→

📋 Unit Tests

Pre	Post	Test
✘	✔	join_struct_col
✘	✔	vstack_struct
✘	✔	dstack_struct
✔	✔	matrix_transform
✔	✔	euclidean_diff

Given a GitHub issue + codebase, the agent generates a PR. Success = failing tests now pass, passing tests still pass.

Key assumption: the GitHub issue is complete and well-specified.

07

But Real Users Write Vague Instructions

We craft Ambig-SWE from SWE-Bench Verified:

Start from 500 curated, well-specified GitHub issues
Use GPT-4o to summarize each into a vague version — preserve the domain but strip the detail
Creates controlled information asymmetry: user holds the full issue, agent sees only the summary

Well-specified (SWE-Bench)

"data leak in GBDT due to warm start: when using warm_start=True with GradientBoostingClassifier, training data from previous fit calls leaks into new trees via residuals stored in…"

Underspecified (Ambig-SWE)

"There's a data leak issue with warm start in the gradient boosting module."

08

Three Questions About Agent Proactivity

🔍

Detect

Can agents distinguish well-specified from underspecified instructions?

❓

Ask

When agents ask, do they extract useful information from the user?

⚡

Leverage

Does the interaction actually improve task outcomes?

Full

Agent sees complete issue
(upper bound)

Hidden

Agent sees vague summary only
(lower bound)

Interactive

Vague summary + can ask user
(the treatment)

09

Interaction Helps Massively — But Agents Never Ask

+74%

improvement with interaction
on underspecified tasks

Interaction recovers 80% of full-specification performance through clarification alone.

But: FNR (failure to detect ambiguity)

Haiku 3.5

0.97

Sonnet 3.5

0.81

Llama 70B

0.57

Deepseek-v2

0.31

Agentic-tuned models almost never flag ambiguity. Non-agentic models ask — but with high false-positive rates.

The value of interaction is clear. The open question: how do we get agents to ask the right questions at the right time?

10

Interaction is valuable, but agents never initiate it. How do we train agents to ask the right questions at the right time?

11

arXiv 2025

PPP: Training Proactive and Personalized LLM Agents

Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, Yiming Yang

12

Why Task Success Alone Fails

Training only for task success actively degrades proactivity and personalization:

Solution: multi-objective RL

UserVille — interactive environment with LLM user simulators, configurable preferences
Three rewards — R_prod (task success) + R_proact (ask useful questions) + R_pers (match user style)
Joint optimization prevents any dimension from being sacrificed

13

How PPP Training Works

GRPO with token-level credit

Sample $G$ rollouts per prompt, group-relative advantage:

$$\hat{A}_{i,t} = \frac{\operatorname{clip}(R_i, 0, 1) - \operatorname{mean}(\{R_j\})}{\operatorname{std}(\{R_j\})}$$

Clipped PPO-style objective:

$$\mathcal{J} = \sum_{i,t} \min\!\Big\{r_{i,t}(\theta)\,\hat{A}_{i,t},\;\operatorname{clip}\!\big(r_{i,t}(\theta), 1{-}\epsilon, 1{+}\epsilon\big)\,\hat{A}_{i,t}\Big\}$$

$R_{\text{proact}}$ — rewards low-effort Q's (e.g. "which color?"); penalizes high-effort ones
$R_{\text{pers}}$ — user sim scores alignment with stated preference (language, format, style)
Training — SWE-Gym tasks in UserVille; precise prompts → vague; diverse user profiles (brevity, language, expertise); base model: Seed-OSS-36B-Inst
Eval — SWE-Bench-Verified (Func-Loc) + BrowseComp+; 8 unseen user preferences for generalization

14

PPP Across Three Dimensions

GPT-5

PPP (Ours, 36B open-source)

	GPT-5	PPP	Δ
Productivity	55.8	56.3	+0.4
Proactivity	36.6	75.6	+38.9
Personalization	13.0	89.3	+76.3

+21.6 average over GPT-5 on SWE-Func-Loc

15

Trained Interaction is the Key to Real-World Utility

F1 on SWE-Func-Loc

58.5

64.2

Precise
No Int.

36.5

44.0

Vague
No Int.

36.0

57.7

Vague
+ Int.

Baseline

+ PPP RL

Vague + Interaction + RL (57.7) recovers to near precise baseline (58.5)

Interaction quality over training

PPP: low-effort Q's rise, medium rise-then-fall (learns to ask better). Baseline: all categories explode (gets “lazy”, offloads work to user).

16

PPP makes agents smart collaborators within a session. But when the conversation ends, the agent forgets everything. Can agents remember who they serve?

17

arXiv 2025

TOM-SWE: User Mental Modeling for Software Engineering Agents

Xuhui Zhou, Valerie Chen, Zora Zhiruo Wang, Graham Neubig, Maarten Sap, Xingyao Wang

18

A Dual-Agent Architecture

SWE Agent handles code generation & execution (the standard coding agent)
ToM Agent (lightweight LLM) infers user goals, constraints, and preferences

Hierarchical Memory persists the user model across sessions — the agent remembers you
Key insight: agents need a model of who they work for, not just what they work on

19

Introducing the Stateful SWE Benchmark

Step 1

Collect 453 real human-agent sessions

Step 2

Create 15 developer profiles from 75 preferences

Step 3

Pair issues + profiles with 1.5M token histories

20

Stateful SWE Benchmark Results

Success rate (%) — ToM-enhanced agents consistently outperform across all models:

18.1

38.4

59.7

Claude 4

18.7

14.4

46.5

Claude 3.7

14.2

16.9

44.3

Qwen3

CodeActAgent

RAGCodeActAgent

TomCodeActAgent (Ours)

Claude 4 + ToM: 59.7% vs. 18.1% — a 3.3x improvement. Real-world validation: 86% useful in 3-week study (17 devs, 209 sessions).

21

Cheap, Effective, and Validated in Practice

Cost vs. resolved rate

w/o ToM

GPT-5

Claude

$0.17

per session — only 16% of SWE avg ($1.08)

3-week real-world deployment:

17 professional developers using ToM-enhanced CLI daily
209 coding sessions on real tasks
86% acceptance rate of ToM agent suggestions

User modeling is a distinct, efficient capability — even small LLMs dramatically boost performance as ToM agents.

22

PPP and TOM-SWE both train and validate with LLM user simulators. But what if those simulators are lying to us?

23

arXiv 2026

Mind the Sim2Real Gap in User Simulation for Agentic Tasks

Xuhui Zhou*, Weiwei Sun*, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, Maarten Sap

24

The Experiment: τ-bench with Real Humans

Why τ-bench?

Full interactive loop: LLM user simulator + tool-augmented agent + automatic reward
Customer service across airline (bookings, cancellations) and retail (returns, orders)
Enables controlled comparison: swap LLM simulator with real humans, keep agent & reward fixed

Human annotation

451 participants (ages 18–80, recruited via Prolific) across 165 tasks
Role-play as customers, interact with the same agent used in τ-bench
3 independent batches on same tasks → measure human–human agreement as ceiling
Post-task survey: 5-way task success + 8 quality dimensions (efficiency, human-likeness, reuse, etc.)

25

A Taxonomy of Sim2Real Gaps

Behavioral Gap

D1 Communication · D2 Information · D3 Clarification · D4 Error Reaction

Evaluative Gap

Success criteria alignment + quality dimension scores across 8 axes

26

Measuring the Gap: User-Sim Index

We aggregate behavioral and evaluative dimensions into a single 0–100 score:

$$\text{USI} = \frac{1}{6}\big(\text{D1} + \text{D2} + \text{D3} + \text{D4} + (1{-}\text{ECE}){\times}100 + \text{Eval}\big)$$

D1–D4: Sørensen–Dice coefficients per behavioral dimension (higher = closer to human)
ECE: outcome calibration error (lower = better)
Eval: $(1{-}\text{MAE}){\times}100$ evaluative alignment

31 LLM simulators benchmarked

18 proprietary — GPT, Claude, Gemini families
9 open-source — DeepSeek, Llama, Qwen
4 specialized — CoSER, UserLM, HumanLike, HumanLM (fine-tuned for user simulation)

Human inter-annotator USI: 92.9. Best LLM simulator: 76.0. A 16.9-point gap that no model closes.

27

The “Easy Mode” Problem

USI vs. Chatbot Arena Elo. Only GPT family shows correlation ($r{=}0.91$). Claude & Gemini: no significant relationship.

Too cooperative — GPT-4o: 49% polite vs. 29% for humans; 1% short turns vs. 15.3%
Front-load information — simulators volunteer all details upfront; real users say “Hi, help with a return under Sarah”
Never frustrated — humans: “you already asked me that”; simulators quietly pivot
Inflates success — top LLM simulators: 77.8% agent success vs. 63.6% human baseline

GPT-5.1 as evaluator overestimates human-likeness by 55% and overall quality by 18%. Rule-based rewards are orthogonal to human-perceived quality across all 8 dimensions.

28

Implications for Benchmark Design

Human-in-the-Loop

LLM simulators are useful for scale, but final benchmarks must include real human participants. USI quantifies how much to trust simulator results.

Diversity-Aware Simulation

Current simulators suffer mode collapse. Calibrate against real human behavioral distributions, not just average accuracy.

Report the Gap

A model scoring 78% with LLM users but 64% with real users tells a very different story. Report sim-human gaps alongside scores.

USI provides a principled way to audit any simulator-based benchmark — directly relevant to Scale AI’s evaluation infrastructure.

29

The Full Picture

Work	Question	Key Result	Scale
Ambig-SWE	Can agents ask?	+74% with interaction	SWE-Bench Verified
PPP	Can we train for it?	+21.6 F1 over GPT-5	Multi-objective RL
TOM-SWE	Can agents remember?	59.7% vs 18.1% (3.3x)	453 sessions, 17 devs
Sim2Real	Are evals valid?	LLMs = "easy mode"	451 humans, 31 LLMs

A progression from benchmarking the problem → training for it → persistent modeling → validating evaluation

30

Design Implications for Real-World Agents

Interaction is a feature, not a failure — agents that ask questions outperform those that guess
User modeling must be persistent — one-shot personalization isn't enough; agents need memory
Multi-objective training is essential — optimizing only for task success misses the user experience
Validate with real humans — LLM simulators are useful but systematically overestimate agent quality

31

Open Questions & Future Work

Scaling user modeling — can ToM approaches work across tasks, tools, and domains?
Better user simulators — closing the Sim2Real gap through calibrated, diverse simulators
Safety + user-effectiveness — how do we balance proactivity with appropriate caution?
Enterprise deployment — user-effective agents in legal, enterprise, and ambiguous domains

Particularly relevant for Scale AI: agents that navigate ambiguity in enterprise contexts need exactly these capabilities.

Thank You

Looking forward to your questions and discussion!

Ambig-SWE — arxiv.org/abs/2502.13069
PPP — arxiv.org/abs/2511.02208
TOM-SWE — arxiv.org/abs/2510.21903
Sim2Real — arxiv.org/abs/2603.11245

Xuhui Zhou xuhuiz@cs.cmu.edu · Carnegie Mellon University

On the Quest of User-Effective AI Agents

AI agents are increasingly capable—but they disappoint real users

Our Vision: The PPP Framework

A Research Arc Toward User-Effective Agents

How SWE-Bench Works

But Real Users Write Vague Instructions

Three Questions About Agent Proactivity

Interaction Helps Massively — But Agents Never Ask

Why Task Success Alone Fails

How PPP Training Works

PPP Across Three Dimensions

Trained Interaction is the Key to Real-World Utility

A Dual-Agent Architecture

Introducing the Stateful SWE Benchmark

Stateful SWE Benchmark Results

Cheap, Effective, and Validated in Practice

The Experiment: τ-bench with Real Humans

A Taxonomy of Sim2Real Gaps

Measuring the Gap: User-Sim Index

The “Easy Mode” Problem

Implications for Benchmark Design

The Full Picture

Design Implications for Real-World Agents

Open Questions & Future Work

Thank You

AI agents are increasingly capable—
but they disappoint real users