The Quest of User-Effective AI Agents

By Xuhui Zhou and Weiwei Sun· Nov 2, 2025

A robot pushing a large boulder up a mountain — The feeling of using an AI agent that is not user-effective

It's not the year of AI agents, but the decade of AI agents.Here's the video of Andrej Karpathy talking about AI agents. Highly recommended.
Andrej Karpathy, “We're summoning ghosts, not building animals”

The past year has seen remarkable progress in AI agents, from coding assistants to research tools, these systems demonstrate increasingly sophisticated capabilities on standardized benchmarks. Yet the dream of one person commanding dozens of AI “interns” for 10× productivity still remains elusive. In reality, a user might wait for an agent to generate thousands of lines of code, only to discover that it misunderstood the task or produced an overly complex, unusable implementation. This often triggers a long cycle of rewrites, prompt tweaks, and “try again” attempts, but alas, all in vain. What's worse, impressive benchmark scores don't seem to prevent these failures.For some examples, see Becker et al., 2025 and our work Chen et al., 2025 What's missing?

The dominant paradigm in training and evaluating AI agents centers on task success, overlooking the fundamental goal of supporting real-world users. This post outlines our efforts to rethink that paradigm for more user-effective AI agents.

The PPP principles: defining user-effectiveness

We characterize user-effectiveness through three core dimensions: the PPP Principles. The first P is Productivity. While existing benchmarks focus primarily on whether an agent can accurately follow instructions, we argue that speed is equally important. Thus, productivity encompasses both high accuracy and high efficiency.

But wait, is this enough? Not every task is a simple one-off, and not every instruction is complete or correct. Often, users themselves aren’t entirely sure what they want, let alone how to describe it precisely. In these cases, the agent needs to ask questions, clarify goals, and guide the user through the process of refining instructions. That’s where the second P: Proactivity comes in. Furthermore, a proactive agent doesn’t just seek missing information, it also helps users understand how to best work with the agent, what it produces, and how to use those results effectively. For instance, if the user want the agent to “write a website?&rdquo, the agent should first ask what kind of website the user would like to build given the highly under-specified instruction. After finishing the task, the agent should help the user understand how to use it, handle edge cases, and recognize limitations. This kind of proactivity not only improves outcomes but also builds trust between the user and the agent.

Finally, every user has their own way of working with AI agents. Some prefer to give high-level language instructions; others like to co-create step by step. Some are comfortable letting the agent take risks; others want close oversight. Even stylistic preferences, like a “pythonic” versus “C-style” approach, can differ widely. That’s why there’s no one-size-fits-all agentic solution for the third P: Personalization. To truly be user-effective, agents must be agile and flexible to adapt to different users, contexts, and collaboration styles.

Why not focus only on Productivity? Our research shows (through training experiments detailed in [Section reinforcing-user-effective-ai-agents-with-ppp-inspired-rewards not found]) that optimizing solely for Productivity may yield short-term gains in overall user effectiveness (across all three PPP dimensions). However, as optimization continues, it inevitably undermines the other two Ps: Proactivity and Personalization (see [Figure rl-curves not found]).

This need for a multi-dimensional framework is echoed in recent research on Scaling Collaborative EffortIn Shen et al., 2025, authors introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. The authors argue that two additional metrics are needed to truly measure how well agents collaborate with humans:

User Effort — how much cognitive and investigative work users invest in the collaboration process, which may involve actively building an understanding of the task or the agent’s reasoning process, or simply answering the agent’s clarification prompts;
Utility of Joint Actions — how much the joint human and agent team can accomplish together, reminiscent of joint human-AI team performance studied in prior literature.In Bansal et al., 2021, authors study the performance of joint human-AI teams and find that the utility of joint actions increases with increasing user involvement

This perspective resonates deeply with our PPP Principles: Productivity captures the efficiency of joint outcomes, Proactivity shapes the interaction that drives user engagement, and Personalization determines how seamlessly the agent adapts to each user’s level of effort and collaboration style.

Proactivity and Personalization

We could consider current agentic benchmarks as Partially Observable Markov Decision Processes (POMDPs),For more details, see Puterman, 1994 as the agent must reason under uncertainty about the environment's true state (e.g., SWE-bench and BrowseComp).BrowseComp is a QA benchmark requiring agents to extensively explore the webdata However, many of these benchmarks capture only half of the real-world problem. When agents must collaborate with human users, a second, critical source of partial observability is the user's state. The agent's task is no longer just a POMDP over the environment, but one that must also account for the user. It must reason under uncertainty about the user's unobservable mental state.

We believe that building truly user-effective AI agents requires moving beyond solving just the environmental POMDP. Agents must also model the human POMDP, inferring the user’s intent and goals, knowledge and belief states, and preferences and constraints across long-horizon sessions. Below we introduce two of our research projects introducing simulated users into the development of AI agents. Starting with software engineering tasks, the first study examines the proactive behavior of AI agents, while the second explores both their proactivity and their ability to personalize to individual users.

Interactive Agents to Overcome Ambiguity

In 2024, we initiated a project to investigate the question asking behavior of coding assistants.For more details, see Vijayvargiya et al., 2025 We found that most of the software engineering (SWE) agents are not good at asking questions. In fact, they barely ask any questions at all.

Specifically, we took the SWE-bench-verified datasetThe dataset is released by OpenAI, where they hired experienced SWE developers to verify the issue/instruction of the task is complete and clear and transformed the complete and clear task instruction into an incomplete and ambiguous one by applying LLM-based perturbation. We then gave the original instruction to an LLM-simulated human user (mimicking a scenario where users have more context and knowledge about the task than the agent). The perturbed instruction was given to the agent, and we asked it to complete the task and explicitly informed the agent that it could ask the user questions if it was uncertain about any part of the task. As shown in [Figure model-comparison not found], agents powered by leading agentic models rarely asked questions (that's why they have zero false positive rate). Interestingly, models not specifically tuned for agentic tasks were the ones that actually initiated questions, even though often in incorrect ways.

Through this study, we realized a fundamental gap in how today's AI agents are trained and evaluated, which is a missing piece that limits their real-world effectiveness. And we are now on a quest to address that gap.

ToM-SWE: user mental modeling for software engineering agents

ToM-SWEFor more details, see Zhou et al., 2025 serves as our very first attempt to model the user's mental state and continuously learn from the user's feedback in a complex and long-horizon setting. As shown in [Figure tom-swe not found], we propose this dual-agent framework where the code agent and ToM agent could be powered by different LLMs and mangage different tasks. Specifically, the ToM agent communicates with the code agent while also agentically managing the hierarchical database to persist the user's previous conversation history and user models.

ToM-SWE system architecture showing user interaction with SWE Agent, ToM Agent, and hierarchical memory — **Figure 3:** Overview of the ToM-SWE framework: the SWE agent handles code generation and execution, while the ToM agent focuses on user modeling and intent inference. The SWE agent consults the ToM agent to predict the user's mental state before suggesting technical actions. Meanwhile, the ToM agent maintains an external hierarchical memory system to persist the user's state and update user models after each session (with update_memory action).

To really evaluate agent's ability in modeling user's mental state in the long run, we build this benchmark dataset called Stateful SWE Bench, where we collected 453 real-world developer-agent sessions and created 15 distinct “developer profiles”, each with unique communication styles (like verbosity) and coding preferences (like testing habits). Our benchmark then uses an LLM-powered simulator to “act” as these different profiles. The initial user instruction is further perturbed to create a more challenging setting, e.g., a single sentence vague description for the complex Github issue.

The agents have to correctly query the user for clarification about the tasks (PPP Principle 2: Proactivity). Furthermore, the agents are given access to the user's past conversation history and must learn to adapt from those past interactions,The conversation history averaged around 1.5 million tokens, thus creating a huge context window requiring efficient memory management e.g., ask a “low verbosity” user too many questions, and their satisfaction score will drop. This pushes agents to move beyond just task completion and become effective collaborators that can model and adapt to their users (PPP Principle 3: Personalization).

Our ToM-SWE agent significantly outperforms baselines on the Stateful SWE Bench ([Figure stateful-swe not found]). More importantly, [Figure cost-efficiency not found] reveals that even small LLMs, when powering the ToM agent, dramatically boost performance. This result is key: it suggests that modeling the user's mental state is a distinct and critical capability, one that can be powered by smaller, more efficient models.

To validate this in a real-world setting, we ran a three-week study where 17 developers used our ToM-enhanced CLI for their own daily coding tasks. Across 209 sessions, developers accepted (fully or partially) the ToM agent's suggestions 86% of the time, confirming its practical, real-world utility.

The success of ToM-SWE validates our hypothesis: modeling the “human POMDP” is a key driver for Proactivity and Personalization, and our benchmarks provide a way to evaluate it. A more fundamental challenge remains: how to scalably train agents to be Proactive and Personalized from the start?

Reinforcing User-Effective AI Agents with PPP-Inspired Rewards

The PPP principles give us a clear target, but a critical question remains: how do we actually train agents to be Productive, Proactive, and Personalized? As we saw in [Figure rl-curves not found], optimizing for productivity (task success) alone isn’t merely insufficient, it can actually hurt proactivity and personalization.

The core bottleneck is the lack of a scalable training environment that can provide informative user feedback. We can’t hire enough human users to interact with an agent for thousands of hours. To solve this, we built UserVille, an interactive environment populated with diverse, preference-aware LLM-based user simulators, as a scalable and challenging training ground for agents to learn in.

UserVille provides training signals that standard benchmarks lack by introducing ambiguity, user preferences, and user-effort labeling. It converts precise task prompts into underspecified ones to create controlled information asymmetry, uses diverse simulated user profiles such as those preferring brevity, language constraints, or varying expertise levels, and enables this user-side feedback to serve as one of the reward signals for training the agent.For more details, see Sun et al., 2025

From Interaction to Learning Signal

Each episode begins from a paired prompt: (1) a precise “ground truth” task specification (held exclusively by the user simulator), and (2) a deliberately vague prompt (given to the agent). During an episode, the agent may call tools (e.g., browse, bash) and a special ask_user tool.

Given the vague user prompt $q$ , the agent produces a multi-turn trajectory

\tau=(a_1,o_1,\ldots,a_T,o_T),\quad p_{\theta}(\tau\mid q)=\prod_{t=1}^{T}\pi_{\theta}\!\big(a_t\mid q,\tau_{\lt t}\big),

where each $a_t$ may be a tool call or an ask_user call, and $o_t$ is the corresponding observation or answer from the user simulator. More specifically, the user simulator answers questions subject to a user preference, and labels every question with a user-effort class: low, medium, or high as well as how the question aligns with the user personal preference.To avoid reward hacking, the labeling process of the user simulator is semi rule-based, where we have clear checklist to guide the user simulator to label the each question/interaction correctly.

This produces three rewards per trajectory $\tau$ : Productivity ( $R_{\text{prod}}$ ): task success (e.g., pass tests); Proactivity ( $R_{\text{proact}}$ ): whether the agent asks essential questions that's easy for the user to answer and discourage ones that could be answered from interacting with the environment and hard for the user to figure out; Personalization ( $R_{\text{pers}}$ ): whether the agent's behavior aligns with the stated user-specific preference. We then sum them:

R(\tau)=R_{\text{prod}}+R_{\text{proact}}+R_{\text{pers}}

We optimize with a GRPO-style clipped objective and token-level credit assignment. For each prompt $q$ we sample $G$ rollouts $\{\tau_i\}_{i=1}^G$ from $\pi_{\text{old}}$ , compute total reward $R_i$ per rollout, and form a group-relative advantage:

Let $r_{i,t}(\theta)=\frac{\pi_{\theta}(\tau_{i,t}\mid q,\tau_{i,<t})}{\pi_{\text{old}}(\tau_{i,t}\mid q,\tau_{i,<t})}$ .

The objective is

\mathcal{J}=\frac{1}{\sum_i|\tau_i|}\sum_{i=1}^{G}\sum_{t=1}^{|\tau_i|} \min\!\Big\{r_{i,t}(\theta)\hat{A}_{i,t},\; \operatorname{clip}\!\big(r_{i,t}(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_{i,t}\Big\}.

This encourages full-trajectory behaviors that finish tasks, ask minimally-disruptive essential questions, and adhere to user preferences.

Key Results and findings

We start by applying PPP training to a 36B open-source model Seed-OSS-36B-Instruct, producing PPP-36B. This model not only outperforms GPT-5 on two real-world complex tasks, but also shows much stronger proactivity (here we specifically focus on asking clarifying questions when needed), and better alignment with users’ personalized preferences. The results are exciting, and so are the insights we’ve gained:

Does interaction help when prompts are vague?

[Figure vague-prompt not found] presents our first key finding: agent-user interaction is essential for handling vague prompts, but only when agents are trained to interact effectively.

The vague prompt causes a 20.39-point F1 drop (64.50 → 44.11) without interaction. Simply allowing the base model to ask questions doesn't help, it needs to be trained with proper interaction objectives. With PPP training, the agent nearly recovers the full performance, demonstrating that effective interaction can bridge the information gap.

How does PPP compare to baselines across all three dimensions?

We found that frontier LLMs perform unevenly. For example, GPT-5 scores high on productivity (55.83 F1 on SWE) but much lower on proactivity (36.60) and personalization (12.96). This supports our hypothesis that optimizing only for task success doesn't create agents that truly work well with users.

Our PPP approach makes big gains: it improves GPT-5's average score by +21.6 on SWE-Func-LocThis is the same task as the SWE-Bench-Verified benchmark, here, we only ask the agent to locate the function needed to fix the issue rather than the full task. and BrowseComp+.BrowseComp+ is a QA benchmark replicating the feature of BrowseComp, coming with a static database without the need to search the dynamic, real-world websites. The boosts are especially large for proactivity (+38.9 on SWE) and personalization (+76.3 on SWE).

We also confirmed that all three objectives are essential. Removing any one of them clearly hurts performance. It's also interesting to see the trade-offs of various dimensions. For example, optimizing only productivity + proactivity lowers personalization (-21.82), and optimizing only productivity + personalization reduces proactivity (-5.95). Surprisingly, we found that the full PPP setup balances all three effectively.

What are some key interactive behaviors that emerge from PPP training?

We first found that PPP-trained agents learned to distinguish precise vs. vague prompts: they ask questions in 100% of vague SWE instances but only 6% of precise ones. This confirms the agent learned to identify ambiguity rather than asking indiscriminately. A critical capability for minimally-disruptive agents is asking only when necessary.

We also discovered the “increase-then-decrease” learning dynamic, where agents learn to ask better (lower-effort) questions over training (see [Figure learning-dynamic not found]).

From [Figure learning-dynamic not found], we can see that the PPP method shows a clear increase-then-decrease pattern for medium-effort questions: they rise from 0.13 to 0.38 (as the agent learns to ask more), then fall to 0.04 (as it learns to ask better). Low-effort questions consistently increase from 0.35 to 0.92. In contrast, the Prod-only baseline shows continuous degradation: medium-effort questions explode to 0.53 and high-effort to 0.31. It's almost like the agent is becoming “lazy”, offloading hard work to the user rather than doing its own exploration.

Finally, to test generalization, we evaluate PPP-36B on 8 unseen user preferences (e.g., language: Italian, multilingual, ALL CAPS; formatting: JSON, no commas, three sentences; style: humor, code snippets). We find that PPP-36B generalizes well, achieving 87.7% personalization accuracy. In contrast, if we only train the model with task success reward, the performance collapses from 69.8% to 48.2%.

Looking forward

This blog post was a collaboration between the authors and the very AI agents we studyWe use AI agents to build the rendering engine of this blog post. You are welcome to use the template to build your own blog post!. The process was... interesting. We felt both the frustration of their flaws and the undeniable power of their speed, a speed no human developer can match. Despite their problems, one thing is clear: there is no going back. And now leaves us to think about how can we push the boundary of AI agents to be more user-effective and collaborative? Here, we share our thougths and open questions, hoping to inspire new research in this direction. We split the discussion into three critical aspects of reinforcement learning: context, priors, and environment.

More human user context

Probably the quickest fix is to bridge the context gap. Most collaborations break down because agents are blind to the huge chunk of context users have just “below the surface” of an instruction. This context comes from various places: the website we just browsed, the paper we just read, the people we just talked to, the food we just ate, etc. Through designing better systems, we can get more human user context. For example, we can capture the user's screen as visual contextShaikh et al., 2025 or manage conversational memory to understand intent over time.MemGPT

We could even further push the boundary of collecting user context by capturing more fine-grained user behaviors. For example, track the trajectory of the user's mouse and keyboard movements, collecting the user's voice commands, and background environmental sounds. In this case, the human users don't need to actively and explicitly tell the agent what to do. Instead, the collected context already provides a lot of information about the user's intent and goals.Hua, Ye, Fu, et al., 2025

Yet, with that much context, how to efficiently and effectively fuse these disparate, multi-modal sources into a single, unified model of “confluence of contexts”? How to reason about the user's mental state and intent from such context? How to provide safeguarding mechanisms to avoid the context from being leaked or misused?

Better user priors

Unlike formal systems like math, human behavior isn't always logical; it's often highly random and chaotic. Because of this, learning a robust user prior is essential for effective social navigation. Capturing observable context alone will never be enough. We cannot see the “higher-level mental states”, i.e., a user's true goal, intent, or belief. Furthermore, in long-horizon sessions, simply logging all observations would cause the context window to “explode.”

Humans solve this exact problem daily. We don't have full context on each other, yet we collaborate effectively. As Michael Tomasello’s work shows, we do this by forming mental models of others and engaging in shared intentionality, i.e., the capacity to build joint intentions and commitments that enable genuine cooperation.Why We Cooperate This mental modeling is the key to unlocking user-effective AI and creating that “I know what you want” feeling. The pivotal question is how. Current LLM training paradigms are starved of the necessary data (like inner monologue or social reasoning), which is rarely verbalized due to reporting biases.

Can we create synthetic data to fill this gap? For example, we could learn to induce a structured model of social dynamics (like beliefs, intentions, and actions) from the “lossy, free-form narratives” of real-world interaction, which allows us to continue pretraining models on such recovered synthetic data.Social World Models Some open questions here: how to ensure the quality of such synthetic data? And how could we model the underlying uncertainties of social reasoning?

More realistic user RL environments

A great deal of research has focused on building RL environments for AI agents, both for task completionThere are environments focusing on the task completion Pan et al., 2024 and Wang et al., 2025 and for human-AI interaction using off-the-shelf LLMs as simulated users.Some examples include Wu et al., 2025, and ours Zhou et al., 2025 This approach, however, raises a crucial question: Are these LLM simulators actually doing a faithful job of simulating human users?

Recent investigations suggest the answer is no. The behaviors of SOTA LLMs are often very different from those of human users.Oh et al., 2025 Furthermore, they frequently suffer from “mode collapse,” leading to a predictable and poor diversity of behaviors that fails to challenge the agent in realistic ways.

This problem points to an obvious solution: we must build better, more realistic user simulators. This is the goal of recent work like User-LM, which fine-tunes language models on large-scale, real-world human-chatbot interaction data to create more faithful and diverse personas.User-LM is finetuned on WildChat dataset, which collected over 1M ChatGPT interaction log from 200k users

But even a perfectly faithful text simulator isn't enough. A real user is not just a text-generation machine; they are an agent in their own right, driven by their own intentions and goals with diverse background context. This suggests, we could create more realistic environments that simulate a world of users, each with their own distinct, long-horizon goals, memory and background beliefs. However, it is still unclear that how the quality of the user simulator could influence the trained agent's interaction behavior. Can we find tasks where models trained on bad user simulators could make a significant difference in the agent's interaction behavior? And how are we going to judge whether a user simulator is good enough?We highly recommend reading these two blogs: User Sims/Part I and User Sims/Part II

Conclusion

We are at an exciting inflection point. Agents are finally generalizable and powerful enough for true human collaboration. We know that humans and AI, originating from vastly different “training paradigms”, have complementary strengths.For example, in Dziri et al., 2023, authors study the inherent limitations of LLMs in compositionality. Yet, the dominant public narrative focuses on replacement, amplifying anxiety and fear.There are numerous news articles reflecting this narrative, amplifying the public anxiety: Anthropic's new Claude can code for 30 hours. Think of it as your AI coworker; Anthropic cofounders say the likelihood of AI replacing human jobs is so high that they needed to warn the world about it This fear-driven narrative misses the more robust and sustainable future: one built on collaboration.It's great to see there more recent effort from Silicon Valley betting on this more collaborative and human-centric future: Humans& is building AI models that are better at interacting with humans

The quest for user-effective AI agents is not just about better models or higher benchmark scores, it requires fundamentally rethinking how we design, evaluate, and deploy AI systems that work with users seamlessly. We are actively working on these aspects and if you are interested in this direction, please feel free to reach out to us.

Citation

Please cite this work as:

Xuhui Zhou and Weiwei Sun, “The Quest of User-Effective AI Agents”, 2025.

Or use the BibTeX citation:

@misc{zhou2025usereffective,
  author = {Xuhui Zhou and Weiwei Sun},
  title = {The Quest of User-Effective AI Agents},
  year = {2025},
  howpublished = {\url{https://xuhuizhou.github.io/blog/on-the-quest-of-user-effective-ai-agents}},
}

Thanks to Xingyao, Saujas, Sanidhya, Shannon, Valerie, Zhiqiu, Hao, Graham, Maarten for feedback and thoughtful discussions ❤️ 🤝. Email me (xuhuiz@cs.cmu.edu) if you have thoughts or comments.