2025-04-11 11:26:00
fin.ai
The ability to perform high-agency tasks is important, but it is just as important to ensure that agents can execute tasks competently, reliably, and consistently, when deploying them in high value use cases.
Why is customer support such a challenging space?
Over the past few months, Large Language Models (LLMs) have significantly advanced. Products like ‘computer use’ from Anthropic and OpenAI, and DeepResearch by OpenAI, demonstrate LLMs’ increasing capability in high-agency tasks.
High-agency agents are those where an agent’s actions are primarily self-governed, constrained only by its environment, and a goal.
However, most examples of high agency agents operate in ideal environments which provide complete knowledge to the agent, and are ‘patient’ to erroneous or flaky interactions. That is, the agent has access to the complete snapshot of its environment at all times, and the environment is forgiving of its mistakes.
This contrasts sharply with customer support agents, like Fin, who generally have knowledge gaps since they are configured by real humans, and interact with humans who may be incoherent, and are often impatient and frustrated. In addition, the agents are highly constrained by how much time they can spend solving a problem, as latency is a core user experience issue.
What are the customer’s expectations from agents?
Fin has excelled at addressing informational queries using its state of the art RAG framework. This framework has achieved resolution rates in the high 60s for customers with well-developed knowledge bases.
However, as Fin resolves more informational queries, we are seeing a rise in demand for it to solve complex problems. Achieving this introduces many requirements, like conversational debugging by gathering personalised context, getting personalised data from external APIs, asking questions, executing business decision logic, and much more.
Typical use cases from our customers include:
- Subscription/order management: renewing, cancelling, pausing subscriptions
- Credit refunds, typically conditional on account state and/or metadata
- Context gathering, and decision logic based on gathered context values
- For sensitive issues, gathering personal context from the end user, before escalating to humans to finally resolve them
The hidden requirement behind all of these use cases, which are typically quite sensitive for the business, is that customers require a very high level of reliability and control.
This means they want to tune how Fin asks questions, takes branching decisions, and calls external APIs. But they also want a guarantee that Fin would make the same decision reliably and consistently over hundreds of conversations, no matter how the user expresses themselves!
This means that a successful agent should possess (1) high agency, (2) solve highly complex problems, (3) with very high levels of reliability.
To begin to address these customer expectations, it becomes essential to develop a robust method for measuring agent performance.
We we discuss our measurement in the context of our work to build our ‘Give Fin a Task’ (GFAT) agent, which is a agent within Fin designed to allow customers to achieve the high reliability they want, by constraining some agency.
Measuring agents

Fig.1 : Balancing the AI agent’s agency and control with reliability for complex tasks is hard. Customers typically want high reliability, even if it comes at a cost of some agency. “Give Fin a Task” is a variant of AI agent that fits this requirement, and with its composable nature, customers can progressively solve more complex tasks.
Assessing the competence of customer support agents is challenging because performance depends on both task complexity and the level of agency an agent has (Fig. 1).
A practical way to measure reliability is to evaluate how consistently an agent completes a given task when repeated multiple times. This is particularly relevant for customer support, where agents frequently handle repetitive requests, such as processing refunds or cancellations.
Simulated Task Testing
To measure agent performance, we simulate a “Task” where the agent interacts with a simulated “end user” over multiple turns. The simulated user, powered by another LLM, follows a predefined script to mimic real customer interactions while attempting to resolve a specific support issue. Each set of such interactions forms a “test,” which can be repeated as needed. A test consists of:
- Expected outcomes – Clearly defined success criteria for the test, such as an expected order status, a specific value like a refund amount, or required API calls (e.g., getShopifyOrders) to retrieve additional information.
- Simulated user prompt – Defines how the user interacts with the agent, including their persona and communication style.
- Stopping condition – Reached when the agent fails to meet the task requirements or exceeds the maximum number of interaction turns.
The test concludes when either the expected outcomes are met or the stopping condition is reached. For example, in an order shipment inquiry, the agent must extract the correct user ID, call the appropriate function to retrieve the order status, interpret the response, and clearly communicate it to the user. Only when all these expectations are met, the test would be considered a pass.
Realistic User Simulation
To reflect real-world interactions with Fin, we model customer behaviors such as impatience, brevity, and incomplete information. These factors, combined with task complexity and imperfect API specifications, create a realistic testing environment.
By simulating diverse agentic tasks, we gain insights into how different LLM agent architectures perform under real-world conditions.
The Metric
Most benchmarks use pass@k, which measures the percentage of tasks successfully completed at least once in k repetitions. However, this is not a useful measure of reliability for customer support agents.
Instead, we need the total percentage of tasks successfully completed every time when repeated k times. This metric is widely known as pass^k. This is a much stricter metric, as it requires consistent success rather than a single correct attempt. The choice of k is indicative rather than absolute—it helps identify trends but does not perfectly reflect real-world expectations. In practice, no two customer interactions are identical, making absolute replication unlikely.
Agency, Control, and Reliability
Agency and control are a tradeoff, which means the more free the agent is to take actions, the less can our customers control its behaviour, which sounds intuitive. But also as seen from our experiments with our benchmark tests, agency and reliability is also a tradeoff. Allowing an agent to take long horizon decisions bounded only by the availability of tools is a recipe for unreliable behaviour.
The graph below shows the computed value of the pass rate metric across all our tests, at different values of k. This graph suggests that if someone expects an LLM based agent to solve a high agency task over multiple customer support conversations, they will be disappointed by their success rates. But even this graph tells only half the story

Fig 2. Task pass rates across all tests for high agency function calling agents, with different values of K for repetitions. The dotted trend line indicates that the reliability of these agents falls rapidly.
To get a deeper insight into the agent’s behaviour, we classified the tasks in our benchmark into three categories,
- Simple Tasks: Involve independent context gathering or straightforward action calls. This might include a one-off status check for an order.
- Moderate Tasks: Require conditional branching or chaining, such as using one action’s output as input for another. This might involve use cases like validation of some information from the user, but first authenticating them, checking their accounts, and then gathering context for validation.
- Complex Tasks: Involve multiple pathways to achieve a goal, imperfect information (e.g., a user forgetting credentials), and extensive context gathering and chaining. These are much more complicated tasks where we are dealing with a user who has incomplete information, or a user that wants to do several changes to their state, or simply a difficult user.
When we run the experiments just on simple and complex tasks as classified by the above rubric, the metric of pass^k looks very different. While agents perform in line with the overall performance as seen in Fig 2 for simple tasks, their performance drops by a huge margin on complex tasks. Function calling agents using older models like Claude sonnet, perform slightly better on simple tasks compared to reasoning models like O3. Reasoning models gain some advantage on complex tasks, but not enough to be optimistic.

Fig 3: Difference in the agent’s pass^k performance on simple vs complex tasks
The above results show that in the context of customer support, the LLMs and their agentic capabilities are not yet at a level of reliability to solve the full range of task complexities that our customers might experience. Low reliability might directly impact CSAT and escalations, while making agent interactions more frustrating for our end users.
So what’s the alternative at this point?
We believe that the primary challenge for LLM based support agents in the immediate future is to maximise their performance along the Agency and Complexity dimension (Fig 1), and progressively move them towards the top right corner of this grid. But there is ample evidence as of today, that the reliability is inversely proportional to the agency of an agent, and this tradeoff seems non-negotiable for the current state of LLMs. So maximising agency on highly open ended and complex customer support tasks would likely come at the cost of the agent’s reliability, and therefore their impact on resolutions.
That said, the results suggest that if we constrain the agent’s autonomy and frame each task as a simple, well-structured instruction—minimising ambiguity and room for interpretation—we might be able to significantly boost reliability of the agent, potentially reaching resolution rates in the high 60s, comparable to those seen with informational queries. This might pose as a great stop gap solution for an agentic customer support tool, before we reach a fully capable agent that could competently handle open ended high agency tasks.
So we need to make a bet. Initial customer research shows that our customers are willing to accept a lower level of agency (and higher level of control) in Fin, if it means a much more reliable product. With this in mind, we have designed the “Give Fin A Task” agent which is going to be an integral part of the Fin Tasks feature in Intercom.



Fig 4: Performance on tasks strictly structured as steps with two variants of agents: an agent designed to strictly follow a steps based execution for a steps based task structure (GFAT agent), and a conventional function calling agent but one that still uses a steps based task structure. We have used Claude 3.5 Sonnet V2 to generate the above results
This agentic product would strongly lean on the following key concepts :
- Control: Since the agent leverages the pre-existing, and familiar workflow-like environment, a teammate would find it easy to build complex workflows with agentic blocks that do narrower well defined tasks, while achieving a bigger and more complex end goal via composition.
- Agency: Although we will trade off agency for control, we would build these agents to be highly expressive via steps based instructions for the agent. The Give Fin a Task agents would be configured using steps based instructions to fulfill a task, where each step is an executable instruction. The agent is designed to follow these steps reliably, allowing high levels of reliability for narrower and moderately complex tasks, while still maintaining all the conversational benefits of an agent powered by a LLM.
- Composability: : The product would allow teammates to break down highly complicated tasks into narrower and more reliable tasks, thereby pushing the overall reliability of the individual task to a much higher level, when compared to an unconstrained agentic approach. This would also allow teammates to compose much more complex workflows, with lots of such individual “Give Fin a Task” blocks.
- Feedback loops: Based on the research done in the team, we will build tooling that
- Helps the teammate to write task prompts that maximise reliability, without meaningfully compromising the agency.
- Provides insights into the reliability of a configured task prompt to the teammate, and provides suggestions for improvements
- Provide meaningful metrics around the agentic blocks in the workflow, once they are in production.
This bet has shown a lot of promise in our early benchmarking tests, as seen in Fig 4. The Give Fin a Task agent shows significantly higher levels of Pass^k performance (k = 3,4,5), on tasks configured using the steps based structure and executed by the steps executor agent (GFAT), which has a tempered level of agency. The steps executor agent is restricted in what it can do, by the steps based structure of its instructions, therefore reducing its agency when compared against pure function calling agents.
The performance on simple tasks show a very high degree of reliability, and we see a considerable gain in performance for moderate tasks for the steps executor. The previous, almost exponential, trend in the drop in the performance with the increase in repetition almost vanishes, indicating a much more stable and reliable agent performance. The benefit of the steps based task structure is also extended to pure function calling agents, where their reliability also receives a considerable boost across the three task categories, and especially for simple tasks.
Conclusions
In summary, the tradeoffs between agency, control, and reliability are central to deploying effective customer support agents. Our analysis shows that while high-agency agents are capable of handling complex tasks, their performance suffers in terms of consistency—a critical factor for customer satisfaction and adoption. By quantifying reliability through a stricter metric (pass^k) and simulating realistic customer interactions inspired from real customer use-cases, we expose the limitations of current LLM-based systems in open-ended environments.
Our solution lies in strategically restricting agency through controlled, modular task configurations. The “Give Fin a Task” agent is a prime example of this approach: by emphasizing step-based instructions and allowing complex tasks to be composed via simpler agentic blocks in workflows, we can achieve higher reliability and ultimately a better customer experience. Although inherent limitations in current LLM technology persist, balancing agency with enhanced control offers a promising pathway to improve performance in real-world customer support scenarios.
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.