Field notes

What to automate first: how we rank operational opportunities

The first agent system a team builds will shape how the team feels about agent systems for years. Build the right one and the next three engagements get internal sponsorship before they’re even scoped. Build the wrong one and the team becomes politely sceptical about everything that follows, in a way that is very difficult to recover from.

This post is the framework we use to pick. It’s the spine of every Diagnosis we run, the one we’d hand to a team that wanted to do their own evaluation without us. The framework is unromantic on purpose. The decisions that determine whether an agent system pays off are mostly not about the agent.

The trap: starting with the loudest pain

Walk into any growth-stage company and ask the leadership where AI should be applied. You will get a confident answer. It will almost always be wrong.

The wrong answer correlates strongly with whichever operational pain has been loudest in the last quarter. Customer support is overwhelmed: build an AI support agent. Sales is missing follow-ups: build an AI SDR. The CFO complained about reporting overhead: build an AI analyst. These are not bad ideas in the abstract, but the loudness of a pain has very little to do with whether agent autonomy is the right intervention for it. Loud pains are often loud because they are visible to leadership, not because they are tractable.

The disciplined move is to ignore the loudness signal entirely in the first pass and rank opportunities on five orthogonal axes instead. The loudness ranking can come back at the end as a tiebreaker. It should not drive the analysis.

The five axes

We score every candidate workflow on five things. Each is a number from 1 to 5, with explicit anchors so the scoring is repeatable across analysts.

Leverage. How many person-hours per week does the workflow consume across the team? A workflow that eats forty hours a week of an operator’s time is more leveraged than one that eats four. Score 1 for under two hours; 5 for over forty. This is the easiest axis to estimate badly because teams chronically underreport coordination overhead. We always validate with calendar audits, not with self-reports.

Frequency. How often does an instance of this workflow occur? Frequency matters because agent systems have a fixed cost (architecture, evaluation, observability) that amortises over invocations. A workflow that runs ten thousand times a month rewards investment that a workflow running ten times a month cannot. Score 1 for monthly or rarer; 5 for many times per hour.

Tractability. How well-defined is the workflow? Can you write down, on one page, the inputs, the steps, the decision points, and the desired outputs? If yes, it’s tractable. If not, the workflow is fuzzy and the agent system will inherit the fuzziness, plus add new fuzziness of its own. Score 1 for “we know it when we see it”; 5 for “we have a written runbook the team already follows.” Tractable workflows are the bread and butter of agent systems. Untractable ones are where Enable engagements live, not Build engagements.

Reversibility. When the agent system makes a mistake, how easy is the mistake to detect and undo? An agent that drafts emails for human review is highly reversible: the human catches errors before sending. An agent that posts content to a customer’s public profile is barely reversible: the post is live before anyone notices. An agent that issues refunds is reversible in the books but not in the customer’s perception. Score 1 for “irreversible and externally visible”; 5 for “no external action without human confirmation, fully audited, undoable in seconds.”

Strategic adjacency. Does this workflow touch the part of the business where new operational capacity matters most over the next twelve months? An agent that frees up engineering time when engineering is the binding constraint is differently valuable from one that frees up engineering time when sales is the binding constraint. Score 1 for “purely cost-saving on a non-strategic function”; 5 for “directly increases throughput on the function that defines the year.”

The composite is multiplicative, not additive

The interesting move is in how the axes combine. We multiply, not add. A workflow that scores 5/5/5/5/1 (high leverage, high frequency, tractable, reversible, but strategically irrelevant) gets a composite of 625. A workflow that scores 5/5/5/1/5 (high leverage, high frequency, tractable, but largely irreversible, strategically central) gets the same. Either of those gets prioritised over a workflow scoring 5/5/3/3/3 (composite 675 if added, but 675 vs 675 is misleading) because the multiplicative score punishes any axis with a low value.

The reason is that the failure modes of an agent system are not additive. A workflow that’s irreversible will eventually fail catastrophically; the leverage and frequency that made it attractive will become the magnifier on the failure. A workflow that’s intractable will either never get built well or will be built fragile and rewritten every quarter. Multiplicative scoring forces the analysis to address the weakest axis first, rather than averaging it out behind strengths elsewhere.

The threshold we use: composite under 200, deprioritise. 200 to 500, candidate but examine the lowest axis carefully. Over 500, serious candidate. Over 1,500, this is what we recommend you build.

The axes nobody mentions

Three secondary considerations matter and rarely make it into anyone’s framework.

The current operator’s enthusiasm. Every workflow has a human owner today. Their relationship to the work matters enormously to whether the agent system succeeds, because they will be the one who notices when it misbehaves, who tunes its prompts as the work evolves, who advocates for or against it internally. An owner who is exhausted by the work and excited to hand it off is the best possible collaborator. An owner who is proud of the work and quietly worried about being made redundant is a good collaborator if you handle the dynamics well, and a saboteur if you don’t. We talk to the operator before scoring the workflow. Their enthusiasm shifts the recommendation more than any single axis.

The data realities. A workflow that depends on data trapped in someone’s head, or in a CRM nobody updates, or in a tool the team has been meaning to migrate away from for two years, will not be automatable until those data realities are addressed. Score the data substrate honestly. Many “obvious” automation candidates are actually data-cleanup projects with an automation reward at the end. We will sometimes recommend that work too, but we will scope it as data cleanup, not as agent-system delivery, because mislabelling it sets the wrong expectations.

Existing tooling collisions. If the workflow is already partially served by a SaaS tool the team uses, the agent system has to integrate, replace, or coexist. Each option has a different cost. The most painful Diagnoses are the ones where the right answer is “your existing tool already does most of this; the right move is to use it properly first.” Sometimes the answer to “what should we automate first” is “nothing yet; you bought the automation already and aren’t using it.”

What the framework rules out, and why that’s the point

Applying this framework well rules out the majority of “obvious” first projects.

It rules out building an AI support agent at most companies, because support workflows score poorly on tractability (every conversation is different) and reversibility (a wrong response is visible to the customer immediately).

It rules out the AI SDR for most companies, because the workflow scores poorly on tractability for any team that hasn’t already systematised its outbound, and the strategic adjacency depends on whether outbound is actually the binding constraint or just the easiest function to point at.

It rules out the AI executive assistant in most cases, because the leverage on a single executive is limited by how much of their work can actually be done without their judgement, which turns out to be less than people assume.

It tends to surface, instead, workflows nobody wanted to talk about: internal coordination work that runs across functions, data movement and reconciliation between tools, structured research that engineers and analysts are doing manually because nobody has built the right tooling, recurring synthesis tasks that fall on whoever has bandwidth in a given week. These workflows are unglamorous. They are also, almost without exception, where the highest-leverage agent systems live.

How we use the framework in a Diagnosis

Inside a real Diagnosis engagement, the framework is the spine of the analysis but not the whole of it. We score every candidate workflow we can identify (typically thirty to seventy across a growth-stage company), rank them, then select the top five for deeper investigation: stakeholder interviews, data audits, integration scoping, rough cost estimation. The output is a ranked recommendation document, and the recommended first build is the one that scores highest after the deeper investigation, not just after the initial scoring.

The framework’s main job, in other words, is not to produce the answer. It is to discipline the search so that the answer isn’t biased by which pain was loudest the week we walked in. The deeper investigation does the rest.

If you want to do a version of this exercise yourself before talking to anyone, the minimum viable version is: list every workflow that takes more than two hours of someone’s week, score each one on the five axes, multiply, and look at the top three. You will discover that one of them is not what you would have guessed, and that is the one most worth examining.

The rest is execution. But the framing decides whether execution is even pointed at the right thing, and getting the framing wrong is the most expensive mistake a team can make in their first agent engagement. Loud pains will still be there next quarter. Build the right thing first.