Marco's Substack

Person-Led AI Has a Ceiling. Operational AI Has Compounding.

Marco — Wed, 13 May 2026 22:35:40 GMT

The conversation could have ended there → Most professional conversations about AI in 2026 stop at “we use it for X and we are seeing gains”.

The gain is real. The conversation is incomplete.

What I tried to add to that exchange is the distinction that decides whether AI compounds inside an organization or stays trapped inside the people who use it. Most enterprise leaders have not yet been given clean language for it.

That distinction is not about the tool. It is about who runs the task.

When you open the computer to ask Claude to draft a specification, you have run the task. The AI helped, you guided, the output is yours. This is Person-Led AI. It is real, it is valuable, and it has a ceiling.

When the agent runs the task on its own, on a configured trigger, and contacts you only when it needs your specific input, you have stopped running the task. This is Operational AI. It is harder to set up. It compounds.

Person-Led AI is real, valuable, and has a ceiling. Operational AI is harder to set up, and it compounds.

Both are legitimate. Both are useful. They are not interchangeable, and the difference will explain a large share of the AI value capture gap that enterprise reports keep documenting.

The operational definition of who runs the task

The simplest test for any AI deployment is the question of who initiates.

In Person-Led AI, a person opens the computer, writes a prompt, attaches a file, asks for help. The trigger is human and ad hoc. The decision to engage the AI is made fresh each time, by the person who needs the work done. The work happens because someone chose, in that moment, to engage.

In Operational AI, the agent starts on its own. The trigger is configured. It can be temporal (every Monday morning), event-driven (a new ticket arrives), threshold-driven (a metric crosses a value), or systemic (another system signals a state change).

No one decides, in that moment, to engage the AI. The AI was already engaged.

This sounds like a small distinction. It is not. The shift from “a person initiates” to “a trigger initiates” changes the economics, the operating model, the failure mode, and the kind of value that gets created.

Take the monthly business review as a concrete example.

In a Person-Led AI setup, on the first of the month a finance analyst opens her tools, queries the data warehouse, pulls metrics into a deck, drafts narratives, sends to her manager for review. Claude or Copilot speeds up each of these steps. The analyst is faster. The work still depends entirely on her starting it.

In an Operational AI setup, on the first of the month an agent runs. It pulls the same data, generates the same draft, and contacts the analyst with specific questions where it needs domain expertise: a variance it cannot explain, a one-off transaction that breaks pattern, a narrative judgment call. The analyst answers the specific questions. The agent finishes the deck and routes it.

Same output. Same monthly cadence. Two profoundly different operating models. When the analyst leaves, the first model leaves with her. The second model survives.

In the first, the value of the work scales with the analyst. In the second, the value scales with the system.

Person-Led AI is real, useful, and bounded

I want to be careful here, because the easy mistake is to read this as “Person-Led AI is bad, Operational AI is good”. That is not what I am saying.

Person-Led AI is generating real, measurable value at enterprise scale. The numbers are unambiguous.

According to Menlo Ventures’ 2025 State of Generative AI, copilots dominate the horizontal AI market: $7.2 billion of an $8.4 billion total, or 86%. Microsoft Copilot, Claude for Work, ChatGPT Enterprise, and adjacent tools are responsible for the majority of enterprise AI spend in 2025-2026.

They are responsible because they work.

A developer using Copilot Workspace ships features faster. A finance analyst using Claude drafts variance commentary in minutes instead of hours. A marketing manager using ChatGPT Enterprise iterates on positioning across audiences before lunch.

These are real productivity gains, captured by individuals, that compound at the level of individual output. According to LinkedIn’s Workplace Learning Report, AI literacy is the number one rising skill on the platform in 2025.

Person-Led AI is not a fad. It is the new baseline.

But it has a ceiling.

Person-Led AI scales linearly with the people who use it. If you have 500 employees who each save 30 minutes a day with Copilot, you have created 250 hours of daily productivity. Real, measurable, valuable.

Now multiply: how do you get to 2,500 hours? You hire ten times more people. The model is per-seat, the value is per-person, the ceiling is your headcount.

McKinsey’s State of AI 2025 puts a sharper edge on this. Only 6% of organizations qualify as high performers, defined as those generating 5% or more EBIT impact from AI.

The 94% that do not capture meaningful value are not, in most cases, organizations that failed at Person-Led AI. They are organizations that succeeded at Person-Led AI and stopped there. They bought copilot licenses, their people got faster, and the financial metrics did not move.

The gap between individual productivity and enterprise value is the gap between Person-Led AI and what comes next.

Operational AI is where the work runs without you

Operational AI is harder to build, harder to govern, and harder to explain to a budget committee. It is also where the value compounds.

The pattern is not theoretical. It is in production now, at scale, in companies most enterprise leaders have heard of.

Salesforce Agentforce, in nearly all its production deployments, is event-triggered. An incoming email, a form submission, a sensor alert kicks off a multi-step workflow that the agent runs to completion or to escalation. According to implementation data from the field, well-configured deployments achieve 40-60% deflection on routine inquiry categories.

The customer never knows a human did not handle it. No human did.

GitHub Copilot Coding Agent, generally available since 2025, inverts the developer relationship with AI. You assign an issue to the agent. The agent works in the background, writes code, runs tests, opens a pull request. You review later.

Early adopters report 30-70% productivity gains for routine development tasks. The trigger is the issue assignment. The work runs autonomously.

Datadog’s Bits AI suite deployed an autonomous SRE Agent that performs hypothesis-driven root cause analysis in under a minute. It runs continuously, triggered by anomalies in observability data. Datadog claims thousands of internal engineering hours saved per week.

No on-call engineer initiates the agent’s work. The infrastructure does.

Financial services may be the most aggressive adopter. According to AWS data on agentic AI in financial services, 53% of financial institutions are running AI agents in production in 2025. Trades execute on signal, not on a trader’s command. Risk models adjust on threshold breaches, not on a quant’s morning coffee.

The reported ROI is $3.50 per $1 invested, with 35% cost reduction and 55% higher operational efficiency.

The shift is happening fast. Gartner predicts that 40% of enterprise applications will feature embedded AI agents by end of 2026, up from less than 5% in 2025.

Menlo Ventures reports that in Q1 FY2026 alone, 160,000 organizations created over 400,000 custom agents using Microsoft Copilot Studio. The agent platform market, currently around $750 million, is forecast to reach $80-100 billion by 2030 according to Gartner’s Hype Cycle for Agentic AI.

The ratio between Person-Led AI and Operational AI in enterprise spend will not hold.

The reason it will not hold is that Operational AI compounds in a way Person-Led AI cannot.

When the SRE Agent at Datadog catches a regression at 3 AM and resolves it before anyone wakes up, the value is captured by the system. When the Salesforce Agentforce instance handles 60% of routine tickets without human involvement, the value is captured by the workflow. When the financial agent rebalances on signal, the value is captured by the architecture.

None of this depends on a specific person being good at their job. The value sits in the infrastructure.

This is not a new framework, it is a translation

Before I go further, a credit owed to people who have been thinking about this for decades.

The distinction between Person-Led AI and Operational AI maps onto a much older concept from automation engineering. Parasuraman and Sheridan, in a foundational 2000 paper, distinguished between human-in-the-loop systems, where the human is a constant active participant, and human-on-the-loop systems, where the human supervises and intervenes only on escalation.

Their model has been the canonical reference for autonomy governance in safety-critical systems for 25 years.

What I am calling Person-Led AI is a strict version of human-in-the-loop, where the human is not just validating but executing every step. What I am calling Operational AI is human-on-the-loop with a particular twist: the agent contacts the human proactively, rather than the human checking in.

I am not proposing this as a discovery. I am translating a concept that automation engineers have used for a generation into language that lands in enterprise AI strategy meetings.

Loop control authority does not survive a budget review. Who runs the task does.

The distinction is the same. The vocabulary is calibrated for the room where the spend is approved.

I mention this for two reasons. First, intellectual honesty: the foundational thinking is not mine. Second, defensive value for the reader. When a critic argues that Personal versus Operational is a simplification of automation theory, the answer is yes, deliberately. The simplification is the point.

The four confusions the lens prevents

Four common confusions that the lens of “who runs the task” surfaces and helps avoid.

The naming confusion. Person-Led AI does not mean “AI for personal use” in the consumer sense. A team of fifty analysts using Claude to write reports is Person-Led AI. A single executive using ChatGPT at home is Person-Led AI. The “person” refers to who runs the task: a person, in both cases.

Reset the term in your head to “person-led AI”, and the framework lands faster.

The dichotomy confusion. No real deployment is purely one or the other. Most workflows have moments of Person-Led AI and moments of Operational AI woven together.

The framework is not a label that goes on a system. It is a question that points at where the fulcrum of the work sits. If the fulcrum is the person initiating and finishing, Person-Led AI is the dominant mode. If the fulcrum is the agent initiating and finishing, Operational AI is the dominant mode.

Most systems sit somewhere on the spectrum, but the dominant mode determines what you have to invest in: training, infrastructure, governance.

The choice-versus-constraint confusion. Often “who runs the task” is not a strategic decision. It is an artifact of what the vendor exposes, what the team can configure, and what the existing workflow tolerates.

An organization that bought a copilot product expecting Operational AI returns may discover that the configuration surface only supports Person-Led AI flows. The framework is useful for diagnosing the present (what do you have today?) and for choosing the next twelve months (where do you want the fulcrum to move?). These are different uses, and they should not be conflated.

The model-capability confusion. This framework does not describe what models can do. It describes how organizations choose to deploy them.

As the underlying models become more capable (memory, planning, meta-cognition), the technical bar for Operational AI lowers. The organizational bar does not. Choosing to let an agent run a workflow without you is, and will remain, an organizational decision more than a technological one.

The framework does not expire when the next model arrives.

A diagnostic you can run this week

Before deciding what to do next, audit what you have today. The diagnostic is short, and the questions are deliberately simple.

Pick one AI deployment that exists in your organization right now. A Copilot rollout, a customer service bot, an analytics agent, anything in production. Then answer five questions about it.

First, who initiates the task? A person, every time? A trigger, every time? A mix?

Second, who finishes the task? Does the loop close on the person’s review, or does the agent close it on its own?

Third, if the people who currently run this work all left tomorrow, does the value persist? Or does it walk out the door with them?

Fourth, what does the trigger configuration look like?

If it is “a person decides to use the tool”, you have Person-Led AI regardless of what the vendor calls it. If it is “a configured event makes the system act”, you have Operational AI regardless of how simple the underlying model is.

Fifth, what is the cost model? Per-seat licensing scales with people, which scales with Person-Led AI. Token usage and infrastructure scales with workload, which scales with Operational AI. The cost shape often reveals what the deployment actually is.

Run this on three or four deployments. The pattern will become clear.

Most enterprises discover, when they do this honestly, that the overwhelming majority of what they have is Person-Led AI, that they have been positioning some of it as Operational AI in conversations with leadership, and that the gap between actual and described is contributing to the disappointment in returns.

The honest caveat: why Operational AI fails today

I have to be careful here, because I do not want to sell Operational AI as a guaranteed win. The data does not support that claim, and the failures are instructive.

Gartner predicts that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls.

According to Harvard Business Review survey data, only 6% of companies fully trust AI agents to handle core business processes. ASAPP reports that 88% of customer service AI agent projects never reach full production.

Klarna, the most public case of aggressive Operational AI deployment in customer service, reversed course in 2025 and began rehiring staff after the cost-driven optimization compromised customer experience.

These failures are not random. They share a pattern.

In every case I have looked at, Operational AI failed for the same reason: no one in the organization had the authority to guide the entire transition. Tool selection sat with IT. Workflow redesign sat with operations. Governance sat with legal. Talent sat with HR. Budget sat with finance.

The agent was deployed, and the surrounding organizational redesign was not.

Operational AI does not fail because the technology fails. It fails because the organization treats it as a tool purchase rather than as a transition.

The cancelled projects, the rolled-back deployments, the trust gaps, all share a structural cause. There was no one whose job it was to own the arc from technology selection through workflow redesign through governance through talent. There was no one with the authority to say no to a deployment that was not ready, or yes to a deployment that needed organizational change before it could land.

The companies that capture value with Operational AI have one thing in common.

Someone, with title and authority, owns the transition end to end. Sometimes it is a Chief AI Officer with real scope. Sometimes it is a VP of Operations with the mandate to redesign workflows and the budget to back it. Sometimes it is a small team that sits across functions.

The role matters less than the authority. What matters is that the transition has a single owner.

This is the conversation I am most interested in for the months ahead. What does that authority look like? What competencies does it require, on the agent side and on the human side? What does the organizational map of an Operational AI deployment actually look like, and how does it get built?

I will return to these questions in upcoming posts.

A position I should state clearly

Before closing, an explicit positioning that affects what comes next in this newsletter.

Person-Led AI has real, durable value. It is not going away. The 86% of horizontal AI spending captured by copilots is not wasted, and the productivity gains are real. If you are running a Person-Led AI deployment well, keep doing it.

But the frontier of enterprise value, the place where AI compounds rather than evaporates, is Operational AI.

That is where I will be focusing my analysis in the months ahead. The framework I am building, the case studies I will discuss, the diagnostic tools I will share, will be oriented toward helping enterprise leaders move from Personal-dominated portfolios to portfolios where Operational AI captures meaningful value.

This is a position, and I am stating it openly because the alternative is leaving it implicit, which would be worse.

Readers who want pure neutrality on the Personal versus Operational question may find the focus of upcoming posts uncomfortable. Readers who want a lens for moving capital and attention toward the part of AI that compounds will find this useful.

What to do this week

Three actions, in order of leverage.

First, run the diagnostic above on your three largest AI deployments. Discover, honestly, what you actually have. The pattern that emerges is more useful than any individual finding. If three out of three turn out to be Person-Led AI, you have a portfolio question. If one is Operational and two are Personal, you have a sequencing question. If you cannot tell which is which, you have a clarity question that needs to be solved before any other decision.

Second, identify one recurring task in your organization that runs today on human initiation, and could plausibly run on a configured trigger instead. Monthly reports, weekly digests, ticket routing, compliance summaries, anything that happens at predictable cadence. Do not commit to building it yet. Just identify it, and notice what stops the trigger from being configured today. The blocker is usually not technological. It is usually that nobody owns the redesign.

Third, identify who in your organization has the authority to guide a transition from Personal-dominant AI to Operational-dominant AI. This is more specific than “who owns AI strategy”.

The owner of the transition needs authority that crosses four domains at once. Technology selection (which platforms, which vendors). Workflow redesign (which processes get reshaped, which get retired). Governance and risk (what the agent is and is not allowed to do, who answers when it fails). Talent (who gets hired, who gets retrained, who exits, what the new roles look like).

Most organizations have someone who owns one or two of these domains. Almost none have someone with formal authority across all four.

The transition requires that person, or a small team that collectively holds those four authorities. Without it, the technology arrives, the workflow does not adapt, the governance lags, the people are caught off guard, and the deployment quietly drifts into Person-Led AI dressed up as something else.

If you cannot name the person in your organization who has this authority, that is your first work.

Tools come later. The owner comes first.

The most expensive errors in enterprise AI in 2026 are happening because the organizations buying Operational AI did not have anyone capable of guiding the transition.

The map of categories from last week, and the lens of who runs the task from this week, are tools to make the conversation possible. The work, as always, is in the rooms where the conversation gets had.

Sources & Further Reading

Parasuraman and Sheridan - A Model for Types and Levels of Human Interaction with Automation (2000) - the foundational reference for human-in-the-loop and human-on-the-loop autonomy frameworks
Menlo Ventures - 2025 State of Generative AI in the Enterprise - copilots at 86% of horizontal AI spend, agent platforms at 10%, 160,000 organizations creating 400,000 custom agents in Q1 FY2026
McKinsey - The State of AI 2025 - 6% high performers attributing 5% or more EBIT to AI, workflow redesign as multiplier
Gartner - 40% of Enterprise Applications Will Feature AI Agents by 2026 - the steepest enterprise adoption curve on record
Gartner - Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 - cancellation forecast, root causes
Gartner - Hype Cycle for Agentic AI - market sizing, agent washing, peak of inflated expectations
Salesforce Agentforce - Implementation Field Data - 40-60% deflection rates in production
GitHub - Copilot Coding Agent - autonomous code generation, 30-70% productivity gains
Datadog - Bits AI Autonomous Agents - autonomous SRE, root cause analysis under one minute
AWS - Agentic AI in Financial Services - 53% of financial institutions running AI agents in production, $3.50 ROI per $1 invested
Harvard Business Review - Only 6% of Companies Fully Trust AI Agents - trust gap data
ASAPP - Inside the AI Agent Failure Era - 88% of customer service AI agent projects never reach full production
Klarna - The Reversal - the canonical Operational AI rollback case
LinkedIn - Workplace Learning Report 2025 - AI literacy as the number one rising skill

Most AI Agents Are Not AI Agents

Marco — Tue, 05 May 2026 12:56:49 GMT

For the past year I have watched the term “AI agent” do something unusual in the enterprise vocabulary. It has become the most common word in AI strategy meetings and, simultaneously, the least well-defined. Different people use it to mean different things, often within the same conversation, without realizing they are not talking about the same product, the same investment, or the same risk profile.

This is not a vocabulary problem. It is a budget problem. When the term means four different things, the request for funding you approve under one definition becomes the deployment that fails under another. The vendor who promised category four delivers category one. The internal team funded to build category two ends up trying to build category four because the executive sponsor expected something else. The post-mortem blames execution. The actual failure was a definition.

What follows is the map I built for myself over the past eighteen months as I worked across enterprise AI deployments. It separates the four meanings of “AI agent” that circulate today, places real vendors and products in each category, and then introduces a strategic distinction (Personal AI versus Operational AI) that determines whether the investment compounds or evaporates. I have never seen the four categories laid out together in a way that survives contact with a real procurement decision. This is my attempt.

The four categories

The categories are ordered by autonomy. Each one is a real, deployable system today. Each one has serious vendors selling it. Each one has appropriate use cases. The error is not in choosing a category. The error is in not knowing which category you are choosing.

Category 1: LLM Chatbot

The first and most basic meaning of “AI agent” is the conversational interface to a large language model. Type a question, get an answer. The system has no persistent state between sessions, no access to external tools, no ability to act in the world beyond producing text.

Examples sit in plain view: ChatGPT in a browser, Claude.ai, Gemini in its consumer interface, the customer service chatbot on a retailer’s website. When a vendor demos “our AI agent” and shows a chat window where a user types a question and receives an answer, this is the category being shown.

The technical literature is consistent on what this is and what it is not. Lilian Weng, formerly head of safety systems at OpenAI and now at Thinking Machines Lab, defined a true agent in 2023 as “LLM + memory + planning + tool use”. A pure chatbot has the LLM and nothing else. It is a powerful system. It is not, by Weng’s definition or anyone else’s, an agent in the technical sense.

The category is real and useful. It accelerates research, drafting, ideation, customer self-service for simple inquiries. The category is not, however, what most enterprise sponsors imagine when they fund “agentic AI”. And this confusion is where the cost begins.

Category 2: AI Assistant or Copilot

The second meaning is the AI assistant or copilot, where the system has access to tools (read your email, search your documents, draft a response) but operates under continuous human control. The user remains the decision-maker. The AI proposes, the user disposes.

Microsoft’s own documentation draws the line clearly. Microsoft says of Copilot that it is “your personal, private assistant that works solely for you, enhancing your capabilities”, and that it “can only see files, emails, and data that the logged-in user has permission to access” and “cannot take action without the user’s input”. GitHub Copilot for code, Microsoft 365 Copilot for office work, Cursor for development, Notion AI for documents: all four are category 2 systems. The human remains in the loop on every meaningful action.

Microsoft itself contrasts this with category 3 or 4 systems, which they call agents and describe as “expert systems that operate autonomously on behalf of a process or company”. The vendor that built the largest copilot product on the market is the same vendor telling buyers, in their own marketing copy, that copilot is not an agent. Few buyers read carefully enough to notice.

The category is enormously useful. The deployment risk is low because the human catches errors. The productivity gains are real and measurable at the individual level. What category 2 does not do, and what no honest vendor claims it does, is operate processes without human supervision. It makes one person faster. It does not make a function autonomous.

Category 3: Agentic Workflow

The third category is where the technical definitions start to bite. An agentic workflow is a system in which an LLM and tools are orchestrated through a predefined sequence of steps, where the LLM may decide which step to take next within the boundaries of that sequence, but where the overall workflow is engineered in advance.

Anthropic, in their December 2024 guide to building effective agents, draws the cleanest distinction in the literature. They write: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths.” The system can be sophisticated. It can call external APIs, query databases, send emails, update CRM records. What makes it a workflow rather than a true agent is that the path is determined by the engineer, not by the model.

This is, in 2026, the category that most vendors are actually selling when they say “agent”. Salesforce Agentforce, in nearly all its production deployments, is a category 3 system. Salesforce’s own metrics report that Agentforce serves 18,500 enterprise customers and processes more than three billion automated workflows monthly. Note the word: workflows. The number is large because the underlying systems are well-bounded, predictable, and engineered. They do not decide what to do. They execute what was designed.

The category is the workhorse of enterprise AI in 2026. It is also the category most often misrepresented to executives, who hear “agent” and imagine category 4, then approve budgets sized for category 4, then receive category 3, then declare the project a partial success because expectations were set against the wrong reference.

Category 4: Autonomous Agent

The fourth category is the autonomous agent in the strict technical sense. An LLM dynamically directs its own processes and tool usage. The system decides what step to take next based on the result of the previous step. It plans, it self-corrects, it can pursue a goal across many steps without being told the path in advance.

Anthropic continues the same passage: “Agents are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” Andrew Ng’s four agentic design patterns (Reflection, Tool Use, Planning, and Multi-Agent Collaboration) all sit in this category. The system reflects on its own output, decides to use a tool, plans multiple steps ahead, or coordinates with other specialized agents to complete a task.

This is the category that gets the magazine covers. AutoGPT in 2023, OpenAI Operator in 2025, the multi-agent research systems built on top of LangGraph and CrewAI, the experimental autonomous coding agents at the frontier labs. The systems in this category are real. They also fail more often, cost more to operate, are harder to govern, and require more sophisticated infrastructure than any of the other three.

Gartner’s 2026 Hype Cycle for Agentic AI places autonomous agents at the Peak of Inflated Expectations and predicts that over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The category is where the most ambitious deployments live, and also where the highest failure rates concentrate. Both facts are true at the same time.

Anthropic’s own guidance to teams considering building in this category is worth quoting directly. They recommend “finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.” A frontier lab is telling the market: most of you do not need this category. Most of you should be building category 2 or category 3.

The strategic distinction: Personal AI versus Operational AI

The four categories above are the technical map. Useful, but not yet sufficient. A technical map tells you what each category is. It does not tell you what each category does for your organization, nor where the value accumulates after deployment. For that you need a second layer of distinction, which is where this article shifts from descriptive to strategic.

The four categories collapse cleanly into two strategic types.

Categories 1 and 2 are what I have been calling Personal AI. They amplify an individual. The output is a more productive person. The value, when realized, sits inside the individual: their faster drafting, their better research, their improved code, their accelerated decisions. When that individual leaves the company, the value leaves with them, because the value was never codified into the organization’s operating infrastructure.

Categories 3 and 4 are Operational AI. They run a process. The output is a workflow that completes without the human bottleneck that previously gated it. The value, when realized, sits inside the infrastructure of the process: the SLA that holds when volume triples, the response time that stays low at midnight, the cost-per-transaction that drops because no human is involved in the routine cases. When individuals leave, the value remains, because the value was codified into how the work runs.

I introduced this distinction in the first article of this series, From ChatGPT to Operational AI, and the more deployments I observe the more central it becomes to predicting which projects compound and which evaporate. Other thinkers have proposed parallel distinctions. Infosys, for example, distinguishes “AI in the Enterprise” from “Enterprise AI” through a lens of governance and decision authority. The framing I use here, developed independently, distinguishes by where value accumulates rather than by who has decision rights. The two lenses are complementary, but they produce different recommendations on where to invest and how to measure return.

The single most damaging confusion in enterprise AI in 2026 is the mismatch between purchase and expectation across these two types. Companies buy Personal AI (categories 1 and 2) and expect Operational AI returns (the kind of P&L impact that comes from categories 3 and 4). The numbers cannot work. They were never going to work. The investment was sized for one outcome and the technology is engineered for another.

McKinsey’s State of AI 2025 puts a sharp number on this. Only 6% of organizations qualify as high performers generating 5% or more EBIT impact from AI initiatives. The high performers are 3.6 times more likely than others to pursue transformational change, and 3 times more likely to have fundamentally redesigned individual workflows when deploying AI. The pattern is unambiguous. Organizations that capture meaningful value are operating in the Operational AI lane, with the appropriate categorical investment. The 94% that do not are, in significant part, organizations that bought Personal AI and waited for Operational AI returns. The wait is structural. It will not arrive.

The next threshold: function-level automation

There is a further step beyond Operational AI that is starting to appear in the most aggressive deployments, and naming it correctly matters because the gap between Operational AI and this next state is a change of nature, not of scale.

I call it function-level automation. The pattern: a category 4 system, or a coordinated set of category 3 and 4 systems, is deployed against an entire business function (customer service, marketing analytics, parts of finance), and that function operates without dedicated human headcount on routine operations. Humans remain involved in strategic supervision, exception handling, escalation, and policy. Routine execution runs autonomously.

This is not the same as Operational AI applied to a single workflow. Function-level automation is the threshold at which an organization stops “using AI” and starts being “partially executed by AI”. A change of category, with implications for governance, risk, headcount planning, and competitive positioning that are different from anything Operational AI raised.

Two cases worth studying together, because they are opposite extremes and both are real.

The aspirational case is Shopify. In a March 2025 internal memo that became widely public, CEO Tobi Lutke wrote that “before asking for more headcount and resources, teams must demonstrate why they cannot get what they want done using AI”, and that AI usage would become “a baseline expectation” embedded in performance reviews. The policy was supported by behavior: between 2022 and 2024, Shopify’s headcount declined from 11,600 to 8,100 while the company maintained roughly 21% annual growth. Function-level automation, deliberately pursued, with measurable financial and operational consequences.

The cautionary case is Klarna. Between 2022 and 2024, Klarna eliminated approximately 700 customer service positions and replaced them with an AI assistant developed in partnership with OpenAI. CEO Sebastian Siemiatkowski told Bloomberg in 2025 that the company was reversing course, citing increased customer complaints, lower satisfaction ratings, and persistent frustration with “generic, repetitive, and insufficiently nuanced” replies. His diagnosis of the failure: “cost was a too predominant evaluation factor”. Klarna’s new model is hybrid. AI handles basic inquiries, humans handle anything requiring empathy, discretion, or escalation. Function-level automation attempted, partially reversed, with the lesson absorbed publicly.

The two cases are bookends. Shopify is the proof that function-level automation can work at scale when the function is amenable. Klarna is the proof that the same approach, applied to a function where empathy and judgment are part of the value, breaks. The lesson is not that one company succeeded and another failed. The lesson is that function-level automation is a categorical decision, not a degree of Operational AI deployment. The questions it raises (which functions, with what governance, accepting which downsides) are different in kind from the questions Operational AI raises.

The further implication, and this is the part most enterprise leaders have not yet absorbed: function-level automation will not arrive uniformly across the economy. It will arrive function-by-function, company-by-company, with significant variance based on whether the function can be specified well enough to be amplified. The companies who reach the threshold thoughtfully will compound advantages. The companies who reach it accidentally, like Klarna did, will absorb expensive corrections.

The costly confusions

Three patterns of mismatch dominate the failed deployments I have observed. Each follows from confusing one category with another. Each is preventable with the map.

The first mismatch is buying category 2 (Copilot) and expecting category 3 or 4 (Operational AI) returns. A CFO approves Microsoft 365 Copilot licenses for 500 employees, expecting a measurable productivity dividend that translates to headcount efficiency or revenue impact. At 12 months, individuals are faster but processes look the same. Personal AI was deployed. Operational AI returns were expected. The mismatch is structural, not executional, and no amount of “change management” will close it.

The second mismatch is buying a category labeled “agentic” from a vendor that is selling category 1 or 2 with new branding. Gartner has named this practice “agent washing” and estimates that, of the thousands of vendors claiming agentic capabilities, only roughly 130 actually deliver autonomous, goal-pursuing systems. The other 95% are existing products with new marketing. The buyer pays category 4 prices, receives category 1 or 2 functionality, and discovers the gap only after deployment. The defense is procurement discipline: ask the vendor explicitly which category they sit in, and ask them to demonstrate dynamic decision-making rather than predefined workflows.

The third mismatch is over-engineering: deploying category 4 (autonomous agent) on a problem that category 2 or 3 would solve more cheaply and reliably. This pattern is most common when an internal technical team is excited about new capabilities. Multi-agent systems built for tasks that needed a copilot. Autonomous agents wired up for workflows that were already well-specified. The result is high complexity, high latency, high operational cost, and no incremental benefit over a simpler design. Anthropic’s own guidance, again worth quoting: “find the simplest solution possible, and only increase complexity when needed”. The discipline of choosing the right category is at least as important as choosing the right vendor.

Using the map next week

Three actions you can take in the next seven days to put the map to work.

First, classify the next AI proposal that lands on your desk before reading anything else. Which of the four technical categories is it? Personal AI or Operational AI? Is it function-level, or does it touch only a single workflow? The classification will take five minutes and will reframe every other question you ask about the proposal.

Second, ask your current AI vendors which category they sit in, and ask them to demonstrate it rather than describe it. A vendor selling category 4 should be able to show you the model dynamically choosing tools and revising plans. A vendor selling category 3 should show you the workflow specification and admit that the path is engineered. A vendor selling category 1 or 2 should not be marketed as “agentic”. The honest answers will tell you who to trust.

Third, audit your existing AI deployments against the Personal versus Operational distinction. How much of your current AI spend is Personal AI? How much is Operational AI? What returns is each generating, and against what expectations were they originally funded? Most organizations discover, when they run this audit honestly, that they are heavily over-invested in Personal AI and under-invested in Operational AI, and that the return profile they were promised cannot be achieved with the current category mix.

The map does not solve the problem. It surfaces it. From there, the work is choosing where to invest next, which vendors to trust, and what conversations to have inside the organization. But none of that work is possible without first agreeing on what the words mean. The most expensive errors in enterprise AI in 2026 are happening because the people in the room are using the same word to describe different products.

The work of clarifying terms is unglamorous. It is also, in my observation across the last year of deployments, the most reliable predictor of which AI investments compound. Better thinking, in front of better technology, produces compounding value. Imprecise thinking, in front of any technology, produces fluffy returns. The category names are a small contribution to better thinking. Use them, refine them, push back on them. But do not enter the next AI conversation without them.

Sources & Further Reading

Anthropic - Building Effective Agents (December 2024, updated 2025) - the canonical distinction between workflows and agents
Lilian Weng - LLM Powered Autonomous Agents (June 2023) - the foundational definition Agent = LLM + memory + planning + tool use
Andrew Ng - DeepLearning.AI Agentic Design Patterns - the four design patterns (Reflection, Tool Use, Planning, Multi-Agent Collaboration)
McKinsey - The State of AI 2025: Agents, Innovation, and Transformation (November 2025) - 6% high performers, workflow redesign as 2.8x EBIT multiplier, scaling agent statistics
Gartner - 2026 Hype Cycle for Agentic AI - agent washing, Peak of Inflated Expectations, deployment statistics
Gartner - Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 (June 2025) - failure rate prediction and causes
Microsoft - Copilot and AI Agents - vendor’s own definitions distinguishing Copilot from autonomous agents
Klarna - CEO Reverses Course on AI Customer Service (Entrepreneur, 2025) - the canonical reverse case
Shopify - CEO’s AI Memo on Headcount Policy (CNBC, April 2025) - the aspirational function-level automation case
Salesforce - Agentforce Production Metrics - the largest production deployment of category 3 systems in 2026
Infosys - Why “AI in the Enterprise” Is Not Enterprise AI - parallel framework distinguishing assistive intelligence from decision-participating intelligence

Stop Investing in Better AI. Start Investing in Better Thinking.

Marco — Tue, 28 Apr 2026 10:43:46 GMT

For the past several months I have been working through a question that I cannot get out of my head.

Every credible study on enterprise AI in 2025-2026 reports the same broad pattern: high adoption, low economic impact, rising abandonment rates. I have read these reports carefully. I have traced the citations back to the primary research. I have laid the surveys side by side, looked at sample sizes, compared methodologies. The pattern is not a research artifact. It is the dominant story of enterprise AI right now.

What confuses me is not the data. It is the response.

Most of the public conversation around AI in 2026 is still about better models, longer context windows, faster inference, smarter agents. The implicit assumption is that the technology is not yet good enough, and that the next release will close the gap. But the data does not say the technology is not good enough. The data says the technology has been deployed at scale and has failed to compound. Those are different problems. They have different solutions.

The more I sit with this, the more convinced I become that the framing is upside down. We treat AI value capture as a technology question because the technology is the visible part. We chase model upgrades and new platforms because those are easy to see and easy to buy. The actual variable that determines whether AI investment compounds is upstream of the model, sits inside the organization, and is harder to see because it is intangible.

This is the argument I want to lay out in this piece: most enterprise AI failures are not failures of technology. They are failures of systematic thinking. The model is not the bottleneck. The clarity that should come before the model is.

The numbers do not describe a technology problem

Start with the headline statistic that has been circulating since last summer. MIT’s Project NANDA, in its July 2025 State of AI in Business 2025 report, found that 95% of enterprise AI pilots delivered no measurable P&L impact, despite $30-40 billion in investment over two years.

That single figure has been quoted in roughly every AI strategy deck I have seen since. What gets quoted less often is the company alongside it.

S&P Global’s Voice of the Enterprise survey of 1,006 IT and business leaders, published in March 2025, reported that 42% of companies had abandoned the majority of their AI initiatives before reaching production. The previous year that number was 17%. The abandonment rate had multiplied by 2.5 in twelve months. A separate finding from the same survey: 46% of proofs of concept were scrapped between PoC and production deployment.

McKinsey’s State of AI 2025 tells a similar story from a different angle. 88% of organizations now use AI in at least one function. Only 39% report any EBIT impact from it. And just 6% qualify as what McKinsey calls “high performers” on AI value capture. Adoption is high. Impact is low. The gap is widening, not narrowing.

BCG, in its September 2025 report The Widening AI Value Gap, classified only 5% of companies as “future-built” on AI. Those 5% achieve 1.7x revenue growth and 3.6x three-year shareholder returns compared to peers. The other 95% are split between 35% scaling without winning and 60% with minimal gains.

Forrester, in its 2026 Predictions released in October 2025, predicted that enterprises will defer 25% of planned AI spending to 2027. The reason is not that the technology disappointed. It is that CFOs are losing patience with strategies that cannot show ROI linkage.

Five independent sources, five different methodologies, one consistent picture. The pattern is structural. It is not a story of immature technology that will mature in another quarter. It is a story of organizations that adopted AI faster than they figured out what to do with it.

If that were a technology problem, you would expect different vendors, different model generations, different deployment patterns to produce different outcomes. They do not. The failure rate is roughly stable across model families, across no-code and code-first platforms, across cloud providers, across consulting partners. What stays constant across all the variation is the human side of the equation.

What MIT actually concluded (and almost nobody quotes)

The 95% number from MIT travels everywhere. The conclusion MIT itself drew from that number travels almost nowhere. Worth reading the report’s own framing carefully.

MIT did not describe the gap as a technology gap. It described it as a “learning gap” and a problem of “organizational integration.” The single most striking finding, buried under the headline number, is that 73% of failed AI projects had no clear executive alignment on what success would look like.

Read that again. Three out of four failed pilots could not produce, when asked, a written agreement among the executives who had funded them about what would constitute a successful outcome. Not a model accuracy target. Not a budget threshold. A definition of success.

If you do not have that, no model will save you. A perfect model produces an ambiguous result, and the ambiguous result gets relitigated by stakeholders with different unspoken priors. Eventually the project gets quietly defunded, not because it failed, but because nobody can defend why it succeeded.

A second finding from the same MIT study reinforces the diagnosis. AI deployments that used external vendor partnerships succeeded 67% of the time. Internal builds succeeded 33% of the time. Half the rate.

The intuitive read is that vendors are technically better. That read is wrong. The teams building internally have access to the same models, the same tooling, often better data. The difference is that vendor engagements force a contract. Contracts force a written scope. Written scope forces the buyer to articulate what they want before money changes hands. The vendor advantage is not engineering. It is the discipline that the procurement process imposes on the buyer’s thinking.

McKinsey’s State of AI 2025 offers a third data point that points the same direction. Organizations that redesigned their workflows to integrate AI achieved significantly more value than those who simply layered AI onto existing processes. McKinsey quantifies this as a 2.8x multiplier on EBIT impact. Workflow redesign, by definition, requires articulating how work currently flows. Most organizations have never done that exercise. AI forces them to, and the ones who actually do it win.

Three independent findings, three different research teams, one consistent direction: the predictor of success is the upstream clarity of the human side, not any property of the AI itself.

AI is an amplifier, not a tool

Here is the mental model that, in my experience, makes the data click into place: AI is not a tool. It is an amplifier.

A tool does a defined thing. A hammer drives nails. A spreadsheet calculates sums. The output is bounded by the tool’s design. You can be vague about your intent and still get a usable result, because the tool’s purpose is narrow enough to compensate.

An amplifier is different. It takes whatever you put in and makes it louder. If what you put in is signal, you get more signal. If what you put in is noise, you get more noise. The amplifier itself is neutral. The character of the output is set entirely by the input.

This is what enterprise AI behaves like in 2026. Feed it a workflow you understand precisely, and it will execute that workflow at scale. Feed it a vague brief, and it will produce something that looks like an output but contains nothing actionable. The system does not refuse. It does not error. It produces what I have come to call the “fluffy” failure mode: text that has the form of a result but the substance of filler.

The fluffy failure mode is the mode that should worry executives most, because it is invisible until you read carefully. A spectacular AI failure ends a project. A fluffy failure can continue indefinitely. The dashboard ships. The agent runs. The reports get generated. Nothing in the system signals that the outputs are empty.

Only a human, comparing what they wanted with what they got, can detect the gap. And humans, busy and trusting, often do not.

The reason this happens is mechanical, not mysterious. Modern language models, agentic or otherwise, are interpretation engines. When given a clear, bounded specification, they execute it. When given an ambiguous brief, they do what they were trained to do: produce plausible-sounding output that matches the surface pattern of the request. Plausible-sounding output is not the same as useful output. It rarely surfaces the friction in your business, because friction is specific and language models reach for the generic.

Examples are easy to find once you know what to look for. The market analysis report that reads cleanly but tells you nothing about your specific market. The customer support agent that handles 10,000 tickets and escalates the same 50% it always did, because the escalation criteria were never defined precisely enough for the agent to do anything but match the historical pattern. The strategy memo that quotes your industry, your geography, your stakeholders, and somehow ends up indistinguishable from the strategy memo for any other company in your sector.

These are not technology failures. The model worked. The pipeline ran. The output was produced. What was missing was the human-side specification of what counts as a useful answer in this specific organization, with this specific data, for this specific decision.

If you’re finding this useful, consider subscribing. I publish data-driven breakdowns on enterprise AI maturity every Tuesday, no hype, no vendor pitches.
Subscribe now

There is a related point worth naming. AI does not just amplify your input. It exposes what your input actually was. A vague brief, fed to a junior analyst, comes back with follow-up questions. The analyst asks for clarification, and the act of clarifying surfaces the vagueness for the person who wrote the brief. AI does not ask. It produces. The producer of the brief is left with output but not with the diagnostic moment in which the brief itself gets better.

This is why I have started to think of AI as a mirror as much as an amplifier. It reflects, with brutal literalness, the quality of thinking that went into the request. The organizations that struggle most with AI are not, in my observation, the ones with the worst data or the smallest budgets. They are the ones whose internal language for what they do has always been imprecise, and who are now meeting a system that cannot translate imprecision into value.

What systematic thinking actually means

If “systematic thinking” sounds like consulting jargon, that is partly because the consulting industry has overused the phrase. Strip the jargon away and the working definition is concrete.

Systematic thinking is the ability to externalize three things in writing: the decisions a process contains, the criteria those decisions use, and the reasoning behind those criteria. It is not raw intelligence. It is not strategic vision. It is not even, primarily, a skill in writing. It is the discipline of moving knowledge that lives in someone’s head into a form that another mind, human or otherwise, can follow.

Most organizations have very little of it. Most executives discover this only when they try to use AI. The discovery is uncomfortable, because the absence of systematic thinking is not anyone’s fault. It is the natural state of organizations that grew through human improvisation, where the people who knew how things worked simply did them, and the doing was good enough.

A simple test. Imagine a smart, capable new hire arrives at your company on Monday morning. They have the right background but no internal knowledge of your specific operation. You assign them a workflow your team owns. Can they, from the documentation that exists, perform that workflow correctly without asking anyone? Or do they need a colleague to fill in the gaps in real time?

The answer is almost always the second. The colleague is the carrier of the unwritten rules. Improvisation, conversation, judgment, and tribal knowledge fill the space between the formal documentation and the actual work. This is fine for a human-only system, because the colleague is always available. It breaks down the moment you replace the colleague with an AI agent. The agent has the formal documentation. It does not have the colleague.

Two consequences follow from this.

The first is that organizations with strong, written process standards get amplified at the level of process. AI in those organizations can take over functions, not just tasks. Workflows can run with limited human intervention because the workflows were specifiable in the first place. This is the path that produces the McKinsey 2.8x multiplier. It is also the path that almost nobody is on.

The second is that organizations whose work depends on improvisation and tacit knowledge get amplified at the level of person. A skilled employee who learns to use ChatGPT well becomes more productive. The same employee, in the same organization, doing the same work, just faster. The amplification is real, but it does not propagate. It does not become organizational capability. The improvement stays inside the individual.

Both modes can be valuable. Neither one is a moral position. But executives who fund AI initiatives expecting the first mode and operating in conditions that only support the second will be disappointed forever. The technology cannot give you organizational capability when the organization itself runs on tacit individual capability. It can only give you faster individuals.

A diagnostic question

The most useful diagnostic I know for AI readiness is a single question, applied to a specific workflow before any AI investment is committed.

The question: where does process end, and personal judgment begin?

Ask this of the team that owns the workflow. There are three honest answers.

The first answer is “I can draw the line.” The team knows which steps are governed by rules and which are reserved for human judgment, and they can articulate the boundary. In this state, AI deployment can be targeted and bounded. The rule-governed parts can be automated. The judgment parts can be supported with information and decision aids. Both layers can be measured separately. This is the rare and valuable state.

The second answer is “the line is blurry, it depends.” The team knows the work but cannot describe precisely when judgment overrides process or vice versa. In this state, AI works at task level but not at workflow level. Individual prompts produce useful output. Attempts to automate the entire workflow fail because the system cannot distinguish between deviation that is wisdom and deviation that is error. This is where most organizations actually live.

The third answer is “I have never thought about it that way.” The team operates entirely on accumulated personal judgment, with no internal model of where process ends. In this state, AI deployment will reliably fail “fluffy” until the team is forced to articulate the workflow, at which point the deployment usually becomes unnecessary because the act of articulating already revealed the better path. This is the state of most pilots that quietly die after twelve months.

The diagnostic is cheap. It takes one meeting. It identifies, before any procurement begins, which workflows are ready for which kind of AI investment, which workflows need clarification work first, and which workflows should not be touched until the team can describe its own work to itself.

The cost of skipping this diagnostic is the cost of the failed pilot. Multiplied by whatever fraction of your AI portfolio is currently funded against workflows in the second and third state.

What to do Monday morning

The next AI investment your organization makes does not need to be a model upgrade, a new platform, or a partnership with a different consulting firm. The highest-leverage thing you can do, before any technology decision, is sit with one team that owns one workflow and write the workflow down.

Inputs. Decisions. Decision criteria. Outputs. Exceptions. The conditions under which a step is skipped, repeated, or escalated. The signals that determine which path the workflow takes. The downstream consequences of each output.

If the team can produce that document in a few hours, the workflow is ready for AI. The investment will compound. If the team cannot, the workflow is not ready. Funding an AI deployment against it will join the 95% that delivered no measurable P&L impact.

Most leaders, faced with this exercise, discover something they did not want to know: their organization has fewer well-defined workflows than they thought, and the ones they do have are not the ones generating the most economic value. The valuable work was being done through skilled improvisation by experienced people. AI cannot amplify what was never made explicit.

The good news is that the path forward is not technological. It is editorial. Every workflow you can articulate is a workflow you can choose to amplify. Every workflow you cannot articulate is a workflow whose value depends on the specific people doing it today, and is not transferrable to any system, AI or otherwise.

The 5% of companies BCG calls “future-built” are not winning because they have access to better models. The models are commodities. They are winning because they did the unglamorous work of writing down how their business actually runs, and only then asked AI to help.

The bottleneck has never been the technology. It is, and has always been, the quality of the thinking that goes in front of it. Better AI without better thinking produces faster fluffy. Better thinking, even with off-the-shelf AI, produces compounding value.

Stop investing in better AI. Start investing in better thinking. The order matters more than the tools.

Sources & Further Reading

MIT Project NANDA: The GenAI Divide - State of AI in Business 2025 (July 2025)
S&P Global Market Intelligence: Voice of the Enterprise - AI & Machine Learning 2025 (March 2025)
BCG: The Widening AI Value Gap (September 2025)
McKinsey: The State of AI in 2025 - Agents, Innovation, and Transformation
Forrester: Predictions 2026 - AI Moves From Hype To Hard Hat Work (October 2025)

105 Days to EU AI Act Enforcement. The Regulators Still Don’t Know What an Agent Is

Marco — Tue, 21 Apr 2026 12:33:25 GMT

On August 2, 2026, the European Commission starts enforcing the high-risk provisions of the AI Act. Penalties run up to 35 million euros or 7% of global revenue. That’s 105 days from today.

In the same window, Gartner projects that 40% of enterprise applications will include AI agents, up from less than 5% a year ago. Many of those deployments will fall squarely inside the Act’s high-risk categories: credit scoring, fraud detection, employment screening, essential services.

Here is what the EU AI Office published on its own FAQ page about how the Act applies to these agents:

“Developments related to AI agents are recent and fast evolving, and the European Commission’s regulatory considerations are only preliminary at this stage.”

That is the enforcement authority, four months before penalties begin, telling enterprises that its considerations are preliminary.

This is the governance vacuum. And it is not an abstract regulatory problem. It is an operational, legal, and financial risk that is already showing up in boardrooms, insurance renewals, and vendor contracts.

What the three frameworks your CISO already follows actually say about agents

Walk into any enterprise security committee and three frameworks get cited: NIST AI RMF, the EU AI Act, ISO/IEC 42001. They are the default language for enterprise AI governance.

Here is what they say about autonomous agents that chain decisions, call external tools, and take actions with limited human oversight.

NIST AI RMF 1.0 (January 2023): nothing. The framework treats AI as a static system that produces outputs for humans to review. It has no concept of an agent that loops, plans, calls tools, or takes action. The companion GenAI Profile released in July 2024 identifies 12 risks specific to generative AI (confabulation, CBRN information, bias) but does not address agents as a distinct architectural pattern.

NIST itself acknowledged the gap only two months ago. On February 17, 2026, the NIST Center for AI Standards and Innovation formally launched the AI Agent Standards Initiative. The public comment period on the concept paper closed April 2. Final standards are years away.

The EU AI Act (entered force August 1, 2024): no specific agent category. The Act classifies AI systems as prohibited, high-risk, limited-risk, or minimal-risk. Enterprises must map their agents to these categories themselves. The Act’s Annex III lists high-risk use cases but says nothing about what makes an autonomous agent performing those tasks different from a classifier or a scoring model.

Enterprises building agents for credit decisions, insurance underwriting, or recruitment must guess which high-risk obligations apply and how, while the authority enforcing the rules publicly describes its thinking as preliminary.

ISO/IEC 42001:2023: a management system standard, not a controls framework. It asks organizations to establish governance structures, risk management processes, and lifecycle controls for AI. It does not define what “good” looks like for delegation chains, tool-use permissions, runtime monitoring, autonomy limits, or shutdown procedures for an agent that goes rogue.

You can be ISO 42001 certified and still have zero specific controls on your agents. The standard certifies that you have a process. It does not tell you what the process should contain.

Three frameworks, three silences. And the enforcement clock is running.

The sector where this gets expensive first: banking

Financial services is where the vacuum hits hardest, because banks have been deploying AI agents faster than almost any other regulated sector.

HSBC reports a 50% reduction in false positives on fraud detection after moving to agent-based anomaly detection. Citi, UBS, DBS, and ING report 20-40% cost reductions on compliance workflows. JPMorgan Chase runs an enterprise ML platform called OmniAI across fraud, risk, and regulatory operations. McKinsey’s latest research on financial crime describes agent-led transformations of KYC and AML as already underway.

Then look at what the bank regulators have written about any of this.

The United States Office of the Comptroller of the Currency’s foundational model risk management guidance, SR 11-7, was published on April 4, 2011. It predates modern deep learning. It assumes models are validated periodically, produce scores, and feed into human decisions. It does not contemplate a system that plans, calls external tools, and takes action in a continuous loop.

Banks now have to reconcile 2011 validation language with 2026 autonomous systems. That reconciliation happens inside each bank, with no consistent supervisory standard.

The European Banking Authority published a factsheet on November 21, 2025 describing the AI Act’s implications for banking. Here is the operative sentence:

“The EBA has not identified any immediate need to introduce any new or review existing EBA Guidelines.”

The supervisor of European banks, writing four months before high-risk enforcement begins, announced it would not update its guidelines. Banks are on their own.

This is not a story about regulators being slow. It is a story about regulators being candid that they do not yet know what to say.

Three risks the vacuum is creating right now

The absence of regulatory guidance does not mean the absence of risk. It means the risk is moving to places enterprise boards rarely look until it is too late.

Risk 1: Insurance is tightening faster than governance is maturing.

Cyber carriers are adding AI endorsements to clarify what is covered and what is not. Professional liability (E&O) policies are adding explicit exclusions for losses caused by AI-driven decision-making, algorithmic non-compliance, and autonomous system failures. ISACA’s 2025 analysis on AI and cyber insurance is direct about the trend: coverage is narrowing.

The practical effect is that governance maturity is becoming a prerequisite for insurability. If an enterprise cannot demonstrate a governance program that covers its agents, some carriers will quietly exclude agent-driven losses, raise premiums, or decline to renew. This is happening now, in renewal cycles across 2026.

Risk 2: Vendor contracts require compliance with frameworks that do not cover agents.

Large enterprises increasingly require their vendors to be “compliant with NIST AI RMF” or “aligned with ISO/IEC 42001” in AI procurement clauses. What this means for an agent-based product is undefined. Vendors sign clauses they cannot fully satisfy because the frameworks they are promising compliance with do not specify controls for the thing they are selling.

This creates legal exposure on both sides. The buyer cannot verify compliance. The vendor cannot prove it. If the agent causes harm, the contract terms will be litigated against frameworks that were silent on the technology.

Risk 3: Liability precedent is being set by what exists, not by what fits.

When an AI agent causes measurable harm in a regulated industry in 2026, the first incident will not wait for NIST to finish its standards initiative or the EU AI Office to issue agent-specific guidance. Courts, regulators, and plaintiffs will reach for whatever governance framework exists.

In banking, that means SR 11-7 becomes the de facto agent regulation, interpreted by supervisors and courts who have no better option. In healthcare, it means HIPAA and the FDA’s software-as-medical-device guidance. In employment, it means Title VII and the EEOC’s algorithmic hiring positions. None of these were designed for agents.

The precedents being set in these early cases will harden into expectations. Enterprises that cannot show systematic thinking about agent governance, documented in writing, with clear decision rights and boundaries, will find themselves defending against frameworks they did not choose and did not design for.

What is emerging, and why it is not enough yet

The vacuum is being filled, slowly, by industry-led frameworks. These are useful. They are not a substitute for regulatory guidance.

OWASP Top 10 for Agentic Applications (December 2025) is the first peer-reviewed threat framework specifically for autonomous agents. Developed by over 100 industry experts over eight months, it catalogs risk categories including prompt injection, insecure tool integration, insufficient access controls, unintended autonomy, and inadequate monitoring. It is the closest thing to a shared baseline the industry has.

MITRE ATLAS added agent-focused techniques in version 5.4.0 in February 2026, including tool poisoning and privilege escalation in agent workflows. It is a threat model for red-teaming, not a governance standard.

Cloud Security Alliance published a research note on April 3, 2026 that explicitly names the agent governance gap and proposes controls. Useful for internal program design, not binding.

The Agentic AI Foundation (Block, Anthropic, OpenAI, Google, December 2025) is developing interoperability standards like Model Context Protocol and the Agent2Agent protocol. These address how agents communicate, not how they are governed.

These are all credible efforts. None of them are regulations. A bank that adopts OWASP Top 10 for Agentic Applications is doing more than the minimum. It is also building on documents that no regulator has endorsed as sufficient. If an enforcement action arrives in August, OWASP compliance will be a defense, not a shield.

The industry is ahead of the regulators. That is unusual and it will not last. The question is how many enterprises will take harm in the gap before the official frameworks catch up.

The honest counterpoint

The vacuum is not equally dangerous everywhere. There is a class of agent deployments where existing governance frameworks, stretched a bit, work fine.

An internal productivity agent that summarizes documents inside a contained environment, with no external actions, no access to customer data, and no autonomous decisions, does not need regulatory-grade governance. Applying the EU AI Act’s high-risk machinery to this kind of tool is overkill and will slow useful adoption.

The vacuum matters for a specific class of agents: those that take actions in the real world, make decisions with financial or legal consequences, operate in regulated industries, or touch customer-facing processes. For those, the absence of guidance is material and current.

The practical implication is triage. Not every agent needs the same governance treatment. But the triage itself is a governance act, and most enterprises are not doing it systematically.

What to do Monday morning

Three things you can do this week, even without waiting for regulators.

First, classify your agents by risk using the EU AI Act categories as a thinking tool, even if you are not in scope of EU enforcement. The Act’s four-tier structure (prohibited, high-risk, limited, minimal) is a useful forcing function. Go through your agent portfolio. Assign each agent to a tier. Document the reasoning in one paragraph per agent. The exercise surfaces the agents where the governance vacuum matters most.

Second, adopt OWASP Top 10 for Agentic Applications as your working baseline until official frameworks catch up. It is the most credible industry document. It is free. Your security team can use it as a checklist for every agent in production or in development. When auditors and regulators eventually arrive with their own frameworks, you will have evidence of systematic thinking, not ad-hoc choices.

Third, document your reasoning in writing for each high-risk agent. What controls are in place. What decisions the agent can make autonomously. What requires human approval. Who is accountable. How anomalies are detected. How the agent gets shut down. Not because a regulator will read these documents today, but because when one does, enterprises that show systematic reasoning get credit and enterprises that do not become easier targets.

These three actions are the minimum viable governance for agents in 2026. They will not make you compliant with frameworks that do not yet exist. They will put you in the top decile of enterprises that can defend their decisions when someone eventually asks.

I am developing a structured approach to this gap called AIMF (AI Maturity Framework), now a registered trademark at the EU Intellectual Property Office. It turns the classification, baseline adoption, and documentation into a repeatable process with scoring and a roadmap. I will share more on it in the coming weeks.

The regulatory vacuum is going to close. It will close faster than most enterprises expect, and it will close in ways that advantage organizations that were already thinking systematically about agent governance. The window to build that muscle, before the first enforcement action makes governance a fire drill, is measured in months.

Next week: the real cost of AI agents. Token economics, infrastructure overhead, operational staffing. Most enterprises budget for agents like they budget for ML models. They are off by 3 to 5 times. Here is where the real costs hide.

Sources & Further Reading

NIST AI Agent Standards Initiative (February 2026)
Federal Register RFI on AI Agent Security (January 2026)
OWASP Top 10 for Agentic Applications (December 2025)
EU AI Act Implementation Timeline
EU AI Office FAQ
EBA Factsheet on EU AI Act Implications for Banking (November 2025)
Gartner: 40% of Enterprise Apps Will Feature AI Agents by 2026 (August 2025)
Gartner: 40% of Agentic AI Projects Will Be Canceled by 2027 (June 2025)
McKinsey: How Agentic AI Can Change the Way Banks Fight Financial Crime (2025)
MITRE ATLAS Framework (version 5.4.0, February 2026)
CSA Research Note: AI Agent Governance Framework Gap (April 2026)
ISACA: Cyber Insurance in Crisis with AI Blind Spots (2025)
OCC SR 11-7 Model Risk Management Guidance (2011)
CSA and Google Cloud: Governance Maturity as Strongest Predictor of AI Readiness (December 2025)
ISO/IEC 42001:2023 AI Management System Standard

AI Agents Are Not Employees. Stop Designing Org Charts for Them.

Marco — Tue, 14 Apr 2026 10:51:40 GMT

Scroll through LinkedIn on any given week and you’ll see some version of the same image: a traditional org chart, hierarchy intact, with robot icons replacing a few human positions. “AI Task Agent” reports to “AI Supervisor.” “Human & AI Teams” sit in one branch, pure AI teams in another. The CEO stays at the top. The structure stays the same. Only the faces change.

The posts accompanying these images ask reasonable-sounding questions. How do we manage AI agents? What does performance management look like for them? Do agents “progress” in capability the way employees grow in their careers? How do we give them feedback?

These feel like the right questions. They’re not. They’re HR questions applied to systems that aren’t employees. And that category error - treating agents as workers who slot into existing hierarchies - is about to produce the same failure pattern we just mapped in the pilot graveyard, except at a scale and speed that most organizations aren’t prepared for.

The hierarchy was designed for human constraints. Agents don’t have them.

The org chart is one of the most durable management inventions of the industrial age. It exists because humans have real cognitive limits: a manager can effectively oversee 5-8 direct reports. Information degrades as it passes through hierarchical layers. Coordination costs increase with every node in the network. Synchronous communication is expensive and slow.

Every structural choice in a traditional hierarchy - span of control, reporting lines, management layers - is an optimization against these human constraints.

AI agents have none of them.

An agent doesn’t need a manager to prioritize its work. It doesn’t forget context between conversations (if designed properly). It doesn’t need to be in the room when a decision is made. It can process input from dozens of sources simultaneously. It can be invoked, configured, composed with other agents, and torn down in seconds.

McKinsey’s research on the agentic organization describes what mature agent deployments actually look like: small teams of 2-5 humans orchestrating 50-100 specialized agents. That’s not a traditional hierarchy with robot icons. That’s a fundamentally different organizational architecture where the ratio of humans to agents makes the old reporting structure irrelevant.

The viral org charts get this exactly backwards. They take a structure optimized for human limitations and swap in entities that don’t share those limitations. It’s like redesigning a factory around horses, then replacing the horses with trucks but keeping the same barn layout.

Neither tool nor employee: the category agents actually occupy

The deeper problem with the “agents as employees” model isn’t just structural. It’s conceptual. And MIT Sloan’s research names the tension precisely.

Organizations have two established mental models for things that do work. Tools are owned, controlled, and predictable. You buy them, deploy them, and they perform the same function until they depreciate. Workers are autonomous, managed through contracts and oversight, capable of judgment and adaptation. You hire them, develop them, and hold them accountable.

AI agents are both simultaneously - and neither fully.

Like tools, agents are owned. There’s no employment contract, no rights negotiation, no career aspiration. They can be duplicated, shut down, or replaced with a better version overnight. Like workers, agents act autonomously. They make decisions, adapt to context, and produce outputs that weren’t explicitly programmed. They can pursue sub-goals, interact with external systems, and take actions that surprise their operators.

MIT’s framing of this tension is direct: “How do you supervise something designed to work autonomously?” The ownership model of tools says you control it. The management model of workers says you guide it. Agents require both - programmatic constraints AND ongoing oversight - which is exactly what neither HR frameworks nor IT asset management were built to provide.

The org chart metaphor collapses this into one model: treat agents like employees, give them roles and managers and performance reviews. The result is that organizations apply the wrong governance framework to entities that need a purpose-built one.

What happens when you manage agents like employees

The consequences of the wrong mental model are already showing up in the data. And they’re not theoretical.

The Cloud Security Alliance’s April 2026 research on enterprise agent governance found that 80% of organizations deploying AI agents have documented “risky agent behaviors” - including unauthorized system access, data exposure, and actions that exceeded intended scope. Among companies with more than $1 billion in revenue, 64% reported losses exceeding $1 million from AI system failures in 2025.

The pattern is specific and instructive. 34% of these incidents occurred in systems where the agent had behavioral instructions in its prompt - but zero programmatic enforcement. The agent was told what to do in natural language. Nobody built the guardrails to ensure it actually did it.

This is the employee model in action. When you onboard a human employee, you give them instructions, context, and trust. You expect them to exercise judgment within reasonable boundaries. If they make a mistake, you course-correct through feedback. The system works because human employees have social incentives, career stakes, and built-in caution about overstepping authority.

Agents have none of these natural constraints. An agent with broad database permissions will use those permissions without hesitation, without social awareness, without the instinct that tells a human “maybe I should check before running a write query on the production database.” When BCG documented a vulnerability where an AI agent had SQL write privileges to a service account exposing 3.17 trillion rows, the root cause wasn’t a technology failure. It was an organizational assumption: the agent was provisioned with the same kind of trust you’d extend to a senior employee. No human employee would ever have had unaudited write access to 3.17 trillion rows. But the employee mental model made it feel natural to grant it.

The multi-agent coordination problem is even more revealing. Research on multi-agent deployment failures found that 36.9% of failures occur at coordination seams - the handoff points between agents. The failure modes are distinctly non-human: infinite loops where agents with conflicting instructions bounce tasks indefinitely, “hallucinated consensus” where agents appear to agree on fabricated data, and rigid escalation chains where outdated assumptions propagate across the system under stress.

No amount of “performance reviews” or “management feedback” addresses these failure modes. They require engineering solutions: programmatic guardrails, monitoring infrastructure, kill switches, and governance layers that operate at machine speed.

The right questions to ask about agents

The questions that matter aren’t about managing agents like people. They’re about governing agents like systems.

Instead of “How do we give feedback to an agent?”, ask: “Have we systematized our own thinking enough to specify what ‘good output’ means for this task?” Giving feedback to an agent is prompt engineering. The real challenge is that most organizations can’t clearly articulate what good output looks like for human employees doing the same work. If you can’t define it for a person, you definitely can’t define it for a system.

Instead of “Do agents grow like employees?”, ask: “Who has the authority to decide what an agent can do autonomously and what requires human approval?” Agents don’t grow. They get replaced by better versions. The organizational question isn’t about their development path - it’s about decision rights, escalation protocols, and the boundary between autonomous action and human oversight.

Instead of “How do we measure agent performance?”, ask: “What output metrics define success, and how quickly can a human verify them?” Agent performance isn’t measured in annual reviews. It’s measured in real-time output validation, error rates, and alignment with defined objectives. This is operational monitoring, not talent management.

These aren’t semantic distinctions. They determine what infrastructure you build. The employee model leads to HR-style governance: periodic reviews, subjective assessments, trust-based permissions. The systems model leads to operational governance: real-time monitoring, programmatic constraints, least-privilege access, automated validation. The data is clear on which approach works: organizations applying traditional governance to agentic systems miss 60-70% of agent-specific risk vectors - including action authorization, decision chain accountability, and emergent multi-agent behavior.

What agent-ready organizational design actually looks like

If agents don’t sit in org charts, where do they go?

The answer requires a shift from static hierarchy to dynamic configuration. A mature agent-ready organization doesn’t have a fixed org chart with agent positions. It has an operational architecture where teams assemble and reconfigure based on the task at hand. Today, a project team might be 2 humans and 5 agents. Tomorrow, a different project runs with 1 human and 12 agents under a different orchestration pattern. The composition is fluid because agents, unlike employees, can be instantiated and decommissioned in minutes.

This isn’t hypothetical. It’s the direction major frameworks are pointing. McKinsey’s six shifts for the agentic organization describe the structural move from “hierarchies to outcome-oriented models” where humans focus on orchestration, not task execution. New roles are emerging to match: Agent Orchestrators who manage workflows and coordinate multi-agent systems, Hybrid Managers who supervise mixed human-agent teams, and Domain Leaders who define objectives and handle edge cases. MIT Sloan’s research found that 45% of organizations with extensive agentic AI adoption expect reductions in middle management layers - not because agents replaced managers, but because the coordination work that justified those layers is now handled differently.

The governance layer is equally different. Gartner predicts that by 2028, 40% of CIOs will deploy “Guardian Agents” - AI systems that monitor, challenge, and contain the actions of other agents. This isn’t agent-as-employee with a robot manager. It’s a programmatic oversight layer that inspects prompts, tool calls, and outputs in real time, enforcing policy boundaries before actions execute. Early adoption is already underway: 17% of CIOs have deployed guardian agents and 42% plan to within one year.

The infrastructure that makes this work is what I call the operational scaffold - the five-layer architecture that sits between “we have an AI model” and “we have an operational AI system.” It includes input constraints (what data the agent can access), decision rules (what autonomy thresholds apply), output standardization (how results are validated), human-in-the-loop checkpoints (when a human must approve), and operational monitoring (how drift, errors, and anomalies are caught). Without this scaffold, you have an agent on the org chart. With it, you have an agent in production.

The honest counterpoint

The employee metaphor isn’t useless. Microsoft’s Frontier Firm framework deliberately uses “digital colleague” as a framing, and there’s a pragmatic reason: it helps organizations culturally accept a new category of workforce participant. When you tell a VP of Operations that agents are “like really good colleagues who never sleep,” adoption barriers drop. The metaphor is a communication bridge.

The danger is when you build governance on that bridge. When the CFO asks “how many FTEs can we replace with agents?”, the employee metaphor becomes a costing model that ignores inference economics, integration overhead, and the governance infrastructure that agents require and humans don’t. Research on hidden AI costs found that 73% of enterprises don’t understand the true total cost of ownership of AI systems, with hidden costs comprising 70% of actual investment beyond licensing.

Use the metaphor to build cultural readiness. Don’t use it to architect governance. Those are different jobs, and confusing them is how 40% of agentic AI projects end up canceled.

What to do Monday morning

If your organization is deploying or planning AI agents, three things you can do this week.

First, audit your agents’ permissions. Apply least-privilege by default, not trust. If an agent has write access to a production database, ask why. If the answer is “it might need it,” that’s the employee model talking. Systems get exactly the permissions they need, documented and reviewed.

Second, for every agent in production, define the output specification in writing. Not “help with customer queries.” Instead: “Respond to billing inquiries using data from the CRM, escalate refund requests over $500 to a human, never access payment card data.” If you can’t write this specification, the agent isn’t ready for production - because you haven’t clarified your own thinking about what the task requires.

Third, stop calling agents “team members” in internal documentation. Language shapes thinking, and thinking shapes governance. When agents are “team members,” they get trust-based permissions. When they’re operational systems, they get engineering-grade guardrails. The second framing saves money, reduces risk, and actually works.

The org chart with robot icons is a comforting vision. It suggests that the agentic future looks like today, just with better colleagues. The research says otherwise. The future organization is structurally different - more fluid, more dynamic, more dependent on operational scaffolds than reporting lines. Building toward that future starts with letting go of the metaphor that’s holding back the design.

Next week: the governance vacuum. NIST, the EU AI Act, and ISO 42001 have zero specific guidance on autonomous AI agents. Meanwhile, every enterprise is deploying them. What this regulatory gap means for your risk profile - and what to do before the regulators catch up.

Sources & Further Reading

The Pilot Graveyard: $30 Billion Spent on AI Pilots. 95% Delivered Nothing.

Marco — Tue, 07 Apr 2026 12:28:47 GMT

Last week I made the case that 17 AI maturity frameworks exist and none of them agree on what “mature” means. The natural follow-up question is: if we can’t even measure maturity, what happens to the companies that try to build AI anyway?

The answer, according to MIT’s July 2025 State of AI in Business report, is that 95% of enterprise GenAI pilots deliver zero measurable P&L impact. Not “underperform.” Not “take longer than expected.” Zero.

That’s $30-40 billion in enterprise AI spending producing nothing that shows up on the income statement. And the trend isn’t improving.

The numbers are getting worse, not better

S&P Global’s Voice of the Enterprise survey found that 42% of companies abandoned the majority of their AI initiatives in 2025 - up from 17% the year before. That’s a 2.5x increase in one year. The average company now scraps 46% of its AI proofs-of-concept before they reach production.

Meanwhile, BCG’s September 2025 research shows the gap between winners and everyone else is accelerating. The 5% of companies BCG classifies as “future-built” achieve 1.7x revenue growth and 3.6x three-year total shareholder returns compared to their peers. The competitive gap has widened consistently since 2016 and shows no signs of closing.

Forrester predicts that enterprises will defer 25% of planned AI spending to 2027, with CFOs demanding stronger ROI evidence before approving budgets. The hype cycle is correcting, and the correction is painful.

The pattern is consistent across every major research house: adoption is near-universal (88% of companies use AI somewhere, per McKinsey), but only 39% report any measurable business impact. The same McKinsey data shows that companies who redesign workflows around AI - rather than bolting AI onto existing processes - are 2.8x more likely to capture that value. AI isn’t failing because companies aren’t trying. It’s failing because the 95% and the 5% are trying in fundamentally different ways.

MIT found the root cause. It’s not what vendors tell you.

The most important line in the MIT report is this: “Most GenAI systems do not retain feedback, adapt to context, or improve over time.”

Read that again. The systems don’t learn. They don’t adapt. They don’t get better. Each pilot exists in isolation, disconnected from the workflow it was supposed to improve, with no feedback loop between what the AI produces and what the business actually needs.

MIT’s diagnosis isn’t about infrastructure, regulation, or talent. It’s about organizational learning - or, more precisely, the absence of it. The study found that “most GenAI pilots are disconnected from real workflows and lack meaningful feedback loops.” The pilot runs in a sandbox. Someone demonstrates it in a meeting. Everyone nods. Then it sits there, untouched, because nobody built the bridge between “technically works” and “operationally useful.”

This is the critical distinction. The failure isn’t at the model layer. The failure is at the integration layer - the point where AI capability meets organizational process. And that failure has a specific, predictable pattern:

Step 1: An executive sees a demo or reads a report. Excitement builds. A team gets budget to “explore AI.”

Step 2: The team picks a use case, builds a pilot, and demonstrates it. The demo looks impressive. The numbers look promising in the controlled environment.

Step 3: Someone asks “how do we put this in production?” The room goes quiet. The pilot was built on sample data, not production data. The workflow it’s supposed to improve has twelve edge cases nobody mapped. The people who actually do the work weren’t involved in the design.

Step 4: The pilot lingers. Updates slow down. The team gets reassigned. Six months later, a new executive arrives with a new AI initiative. The cycle restarts.

This isn’t one company’s story. This is the default enterprise experience with AI. And it compounds.

Now contrast that with what BCG’s top 5% do. Same Step 1 - an executive gets excited. But at Step 2, the team picks a use case attached to a real production workflow, not an abstract opportunity. At Step 3, production readiness isn’t an afterthought because the pilot was built against real data and real process constraints from the start. At Step 4, there’s an owner who stays accountable and a feedback loop that improves the system every week. Same starting point. Radically different sequence.

Innovation paralysis: the psychological trap nobody names

There’s a layer beneath the operational failure that the research doesn’t fully capture, but that anyone working in enterprise AI recognizes immediately.

The AI market generates a constant stream of new tools, new models, new startups, and new claims. Every week there’s a product launch that promises to “manage your entire business” or “replace your workforce.” Every conference keynote shows a demo that looks like the future.

For enterprise decision-makers, this creates a specific form of paralysis. You can’t evaluate every new tool. You can’t pilot every promising model. You can’t keep up with the release cadence even if you tried. So you do what feels responsible: you run lots of small experiments. You “explore” broadly. You keep your options open.

The result is what I call innovation paralysis - the paradox where chasing every innovation blocks all innovation. Companies run dozens of proofs-of-concept, spend heavily on simulations and experiments, but never implement anything at scale. There’s always a newer model, a better tool, a more promising approach just around the corner.

MIT’s finding that “the normalization of perpetual piloting has become the most visible failure of 2025” names the symptom. Innovation paralysis names the cause.

The irony is that beneath all this noise, there’s a clear enterprise roadmap unfolding. Products like Google’s Gemini Enterprise, Salesforce’s Agentforce, and Microsoft’s Copilot follow real business needs, not hype cycles. The enterprise-grade adoption path is slower, more practical, and more productive than the experimental one. But it’s invisible to companies stuck chasing the next breakthrough.

Scar tissue: the compounding cost of failure

Every failed pilot leaves a mark. Not just on the budget, but on the organization.

Crema calls this “organizational scar tissue” - the lingering emotional and structural damage from failed transformation initiatives. It shows up as cynicism in meetings (”we tried AI last year and it didn’t work”), as resistance to new proposals (”show me the ROI before I’ll even consider it”), and as a general organizational immune response to anything labeled “AI.”

The math is brutal. A failed pilot costs $500k-$2M directly. But the downstream damage - two to three years of organizational resistance to the next initiative, and the next one requiring two to three times longer to build trust - compounds the real cost to tens of millions over a five-year horizon.

Transformation fatigue is the compound interest of failure. After two or three rounds of “this time it’s different” that turns out not to be different, the organization’s willingness to change drops to near zero. The problem isn’t that teams don’t believe AI works. It’s that they’ve learned to expect failure from their organization’s way of implementing it.

This is the vicious cycle: bad implementation creates scar tissue, scar tissue creates resistance, resistance blocks the next initiative, and the failure repeats.

The 5% that scale AI successfully have the opposite dynamic. Each win - even a small one - creates organizational proof that “this works for us.” That proof reduces resistance for the next initiative, which ships faster, which builds more proof. Scar tissue compounds failure. Momentum compounds success. The question is which cycle your organization is feeding.

The pattern that works: narrow, fast, real

The research converges on one structural insight: the companies that capture value from AI start narrow and scale gradually. Not narrow as in “small ambition.” Narrow as in “one specific, high-value workflow, done properly, with measurable results.”

That 2.8x multiplier I mentioned earlier - the gap between companies redesigning workflows and those just adding AI on top - comes down to three specific characteristics. The difference isn’t sophistication. It’s focus.

The pattern that works has three characteristics:

1. It targets a specific pain, not a broad opportunity. “We’ll use AI to improve customer experience” fails. “We’ll reduce our average handle time in the support queue from 10 minutes to 8 minutes” succeeds. The first is a direction. The second is a measurable outcome with a clear workflow attached.

The numbers from focused implementations are consistent: customer service handle time drops by 20% ($120k/year in savings for mid-sized teams). Lead scoring conversion improves by 22% (3.2-month payback). Churn prediction accuracy improves enough to reduce churn by 9% in a single quarter (under 2-month payback).

2. It connects to real workflows from day one. The biggest difference between a pilot that scales and a pilot that dies is whether it was built inside the production workflow or next to it. MIT’s core finding - pilots disconnected from real workflows don’t learn and don’t improve - has a practical implication: the pilot has to touch real data, real users, and real processes from the start.

This doesn’t mean shipping untested AI into production. It means the pilot is designed with the production environment in mind: real data inputs, real user feedback, real process constraints. A pilot that works on sample data in a sandbox but hasn’t been tested against the twelve edge cases in the actual workflow is not a pilot. It’s a demo.

3. It’s designed to scale, not to prove a point. There’s a fundamental design difference between a proof-of-concept and what some practitioners call a Minimum Viable Transformation (MVT). A PoC answers the question “can this work?” An MVT answers “can this work at scale, in production, within our organizational constraints?”

The MVT includes things a typical PoC skips: pre-defined success metrics with Go/No-Go decision gates. A feedback loop that routes user corrections back into the system. A plan for what happens when the model is wrong (because it will be wrong). An owner who stays accountable past the demo.

One signal from the MIT study reinforces this: vendor-built tools succeed 67% of the time, while internal custom builds succeed only 33%. This isn’t an argument for outsourcing everything. It’s an argument for speed and specialization over customization. Focused, well-scoped implementations with existing tools outperform ambitious custom builds by a factor of two.

Designing the exit from the pilot graveyard

The sequence matters. Organizations stuck in the pilot graveyard typically try to fix the problem by doing more of what caused it: more planning, more strategy, more comprehensive AI initiatives. The research suggests the opposite.

John Kotter’s change management research offers a useful framework here. Step 6 of his 8-step process is “Generate Short-Term Wins,” and his reasoning is precise: “Wins are the molecules of results. They must be recognized, collected, and communicated - early and often - to track progress and energize volunteers to persist.”

Applied to AI: the first priority isn’t a comprehensive strategy. It’s one visible, measurable win that heals scar tissue and builds organizational confidence that “this time, it actually is different.”

The practical sequence:

Week 1-2: Identify 2-3 workflows where the pain is high, the technical complexity is low, and the ROI is clearly measurable. This doesn’t require a six-month assessment. It requires someone who understands the operations and can spot where time and money are being lost. In most organizations, the people doing the actual work can tell you in 30 minutes where the worst bottlenecks are.

Day 15-90: Build and deploy the first focused implementation. Not a proof-of-concept. An actual working system integrated into the real workflow, with real feedback loops, being used by real people. Measure the before and after. Publish the results internally.

Day 90+: Use the momentum from the first win to scope the second. Each successful implementation reduces organizational resistance and creates advocates who pull the next project forward. This is where the scaling begins - not from a top-down mandate, but from demonstrated value.

The pattern is: Discovery (find the opportunity), Integration (embed AI in the specific workflow), Amplification (extend to adjacent processes), Transformation (rearchitect at scale). Each phase funds and justifies the next.

The honest counterpoint

The objection to this approach is predictable: “Quick wins create silos. You need a holistic strategy and solid data foundation first.” This is a reasonable concern, and in some contexts it’s valid.

But UC Berkeley’s research on AI success metrics suggests the reality is more nuanced. Comprehensive strategies that never ship are the defining characteristic of pilot graveyards. McKinsey’s data shows that organizations starting with specific, high-value workflows and scaling gradually actually achieve more holistic results than those attempting comprehensive transformation from day one.

The key is intent. A quick win built as a terminal point - “we automated this one thing, done” - does create silos. A quick win built as the first step of a deliberate scaling path - with shared infrastructure, common data standards, and documented patterns - creates a foundation. The difference is in the design, not the scope.

There’s also the agent dimension. Gartner predicts that 40% of agentic AI projects will be canceled by 2027 due to escalating costs and unclear business value. This next wave of AI complexity makes the argument for focused, validated implementations even stronger, not weaker. If companies can’t scale basic GenAI pilots, the organizational challenge of autonomous agents will be orders of magnitude harder.

What to do Monday morning

If your organization has more than three AI pilots running right now, pick the one closest to a measurable business outcome. Audit it against three questions: Is it connected to a real production workflow? Does it have a feedback loop? Is there someone accountable for its results past the demo? If the answer to any of these is no, you’ve identified why it’s not scaling.

Then find the one workflow in your operation where the pain is highest and the solution is simplest. Start there. Measure before and after. Publish the results internally. Let the momentum do what strategy decks never could.

The pilot graveyard isn’t destiny. It’s a design flaw. And design flaws can be fixed.

Next week: AI agents aren’t chatbots. They don’t sit in org charts. And they can’t be managed like employees. Why the “agentic org chart” going viral on LinkedIn is wrong - and what organizational design for AI agents actually looks like.

Sources & Further Reading

MIT Project NANDA: The GenAI Divide - State of AI in Business 2025 (July 2025)
S&P Global Market Intelligence: Voice of the Enterprise - AI & Machine Learning 2025 (March 2025)
BCG: The Widening AI Value Gap (September 2025)
McKinsey: The State of AI in 2025 (2025)
Crema: Agency Scar Tissue - How To Heal From Failed Digital Transformations
Dr. John Kotter: The 8-Step Process for Leading Change
Forrester: Predictions 2026 - AI Moves From Hype To Hard Hat Work (October 2025)
UC Berkeley: Beyond ROI - Are We Using the Wrong Metric in Measuring AI Success? (September 2025)

17 AI Maturity Frameworks Exist. None of Them Work. Here's Why.

Marco — Tue, 31 Mar 2026 12:04:57 GMT

Last week I made the case that organizational maturity - not data quality - is the real reason AI projects fail. The research is clear: 95% of AI pilots produce zero P&L impact, and the root cause is “flawed enterprise integration,” not bad models or messy databases.

The natural next question is: how mature are we, exactly? If maturity is the problem, you need a way to measure it.

So you go looking for a framework. And you find seventeen of them.

That’s not a market. That’s a red flag.

17 frameworks, 17 definitions of “mature”

I spent the past month pulling apart every major AI maturity framework I could find: Gartner, McKinsey, Accenture, Microsoft, Google Cloud, AWS, Deloitte, PwC, IBM, MIT CISR, Salesforce, MuleSoft, OneReach, Bain, NIST, ISO 42001, and OWASP.

The first thing you notice is that they can’t agree on the most basic structural question: how many steps does the staircase have?

Three levels, four levels, five levels, six levels. Or no levels. Or binary pass/fail. OneReach’s top level is called “Organizational AGI” - a concept that doesn’t exist yet. Gartner and Google both call their top level “Transformational” but mean completely different things by it. NIST doesn’t believe in levels at all; it uses current-state vs. target-state profiles with gap analysis. And ISO 42001 reduces the entire question to a compliance audit: you’re either certified or you’re not.

If the industry can’t agree on how many steps the staircase has, you’re not climbing the same staircase.

Same company, different score

The structural disagreement goes deeper than level counts. These frameworks measure fundamentally different things.

Accenture evaluates 8 dimensions and produces a score from 0 to 100 - where the median organization lands at 36 and “AI Achievers” average 64. MIT CISR calculates a “Total AI Effectiveness” percentage. Google assigns you a phase (Tactical, Strategic, or Transformational). PwC classifies you into a stage based on responsible AI practices. ISO says you have 38 controls and you either pass the audit or you don’t.

Here’s the measurement landscape across all 17:

Run the same organization through Accenture’s model and you might get a score of 42 - “Implemented AI.” Run it through PwC’s model and you’re “Strategic stage” because you have strong responsible AI practices. Put it through ISO 42001 and you fail certification because three of the 38 controls aren’t documented. Apply the NIST framework and there’s no score at all - just a list of gaps between where you are and where you want to be.

Four frameworks, four verdicts, zero overlap.

This matters because these assessments drive real decisions. They determine where budget goes, which projects get approved, and how leadership reports AI progress to the board. When the tool you use to measure maturity changes the measurement itself, you don’t have assessment. You have astrology with better slide decks.

The vendor problem hiding in plain sight

There’s a structural reason these frameworks disagree, and it’s not academic.

Salesforce’s maturity model measures readiness to adopt Agentforce. Microsoft’s agentic model measures readiness for Copilot and Azure AI Services. Google Cloud’s framework assesses progress toward Vertex AI adoption. IBM’s AI Ladder leads to watsonx. AWS’s model is explicitly about Amazon Bedrock.

These aren’t neutral assessments. They’re sales funnels with maturity language.

The pattern is consistent: score low on dimension X, and the recommended action is to buy product Y - from the same company that built the framework. Accenture’s model, developed with Carnegie Mellon’s SEI, is the most academically grounded of the vendor models. But even their assessment feeds directly into Accenture’s consulting and implementation services.

The independent frameworks - NIST, ISO, OWASP - don’t have this vendor bias. But they have a different problem: they’re designed for compliance and risk management, not for operational maturity assessment. NIST tells you where your risks are. ISO tells you if your management system is auditable. Neither tells you how to get from “we have a chatbot in customer service” to “AI is embedded in how we operate.”

So the vendor models measure operational readiness but point you toward their platform. The independent models avoid vendor bias but don’t measure operational readiness. Nobody covers both.

The agent-shaped hole

All of this might be a solvable problem if the frameworks at least covered the same ground. But there’s one area where the gap is catastrophic: AI agents.

Deloitte’s 2026 survey of 3,235 enterprise leaders found that 73% plan to deploy AI agents within two years. But only 21% have governance frameworks mature enough to handle them. That’s a 52-percentage-point gap between ambition and readiness.

Gartner predicts that 40% of agentic AI projects will be canceled by 2027 - not because the technology fails, but because organizations aren’t prepared.

And the maturity frameworks? Most were built before the agentic shift.

Only Microsoft has a dedicated, structured agent maturity model. Salesforce and OneReach have agent-specific models too, but they’re tied to proprietary platforms. IBM has the most advanced agent governance tooling. OWASP covers agent security. The rest either added agent modules recently, mention agents in passing, or ignore them entirely.

Meanwhile, every enterprise is deploying agents. The frameworks were supposed to tell you if you’re ready. Most of them don’t even know you’re asking the question.

The deeper problem nobody is naming

Even if one framework were perfect - right number of levels, right dimensions, no vendor bias, full agent coverage - it would still miss the most important signal.

Every framework produces a single, organization-wide maturity score. “You are at Level 3.” “Your AI Effectiveness is 62%.” “You are in the Strategic phase.”

That number is as misleading as GDP per capita.

An organization at “Level 2 average” where every team sits between 1.5 and 2.5 is fundamentally different from one at “Level 2 average” where the data science team is at 4, marketing is at 3, and operations and HR are at 0. The second organization has the same average score but radically different operational reality. The data science team is deploying production models. Operations is running everything manually. There’s no cross-functional synergy because half the teams can’t participate.

The standard deviation between teams matters more than the mean.

An organization with low average but tight variance can grow systematically. Train everyone, improve together, build processes that scale across departments. An organization with the same average but high variance has a political problem first and a maturity problem second: the advanced teams resent the laggards, the laggards feel left behind, and every cross-functional AI initiative stalls because the weakest link determines the pace.

No framework measures this. Not one of the 17 I analyzed even mentions intra-organizational variance as a dimension. They all assume the organization is a monolith that can be assigned a single score.

Ask any VP of Engineering whether every team in their org is at the same maturity level and they’ll laugh. Yet that’s exactly what these frameworks assume when they produce a single number.

What this means for you

I’m not going to pretend there’s an easy fix here. If 17 frameworks from the world’s largest consulting firms, tech companies, and standards bodies haven’t converged on a shared definition, adding an eighteenth won’t solve it.

But there are three things worth doing right now.

First, stop treating any single framework as ground truth. If you’re using Gartner’s model because “it’s Gartner,” understand that you’re getting Gartner’s definition of maturity, measured by Gartner’s dimensions, through Gartner’s lens. That’s one perspective, not the answer.

Second, check the agent question. If your organization is planning to deploy AI agents - and statistically, it probably is - ask whether your current assessment framework has anything to say about it. If the answer is “not really,” your assessment is already outdated. The thing you’re about to deploy is the thing your measurement tool doesn’t cover.

Third, look inside before you look outside. Before you pick a framework, ask a simpler question: do the teams in your organization even agree on what “AI mature” means? If the ML team thinks maturity is about model performance, product thinks it’s about feature adoption, and legal thinks it’s about compliance, no external framework will resolve that disagreement. It’ll just paper over it with a score.

The framework problem isn’t going to be solved by a better framework. It’s going to be solved by getting clearer about what you actually need to measure, why, and for whom.

That’s a harder conversation. It’s also a more useful one.

The Organizational Maturity Gap Nobody Wants to Talk About

Marco — Tue, 24 Mar 2026 12:11:44 GMT

Your data isn’t the problem.

That’s the conclusion MIT landed on after studying 350+ employees across 300+ deployments and conducting 150 interviews at companies trying to scale AI. The researchers expected to find the usual suspects: messy data, siloed systems, legacy infrastructure. Instead, they found something more fundamental. The root cause of 95% of AI pilots producing zero P&L impact was “flawed enterprise integration.” Not bad data. Not weak models. Not missing tools. The failure point was organizational.

McKinsey’s latest state-of-AI research confirms it. Only 6% of companies are actually capturing value from their AI investments at scale. The gap between that 6% and the 60% getting zero value isn’t technology. It’s maturity. How companies redesign workflows, allocate budget, clarify ownership, and weave AI into their operating model. That’s all org design, not data science.

Yet walk into most vendor conversations and you’ll hear the same refrain: fix your data, implement our platform, hire our consulting team. Data quality is real (63% of companies lack AI-ready data). But it’s not the primary blocker. It’s a symptom of something deeper: unclear ownership, no data governance, and organizational structures that weren’t built for AI decision-making.

The evidence is there. We’re just looking at the wrong things.

S&P Global asked enterprise leaders why their AI initiatives underperform. Here’s what they found:

Read that table carefully. Data quality and technical maturity are cited at identical rates. And nearly half the room can’t name who owns AI in their organization. That’s not a technology problem. That’s an operating problem.

The concerning part: these percentages have gotten worse, not better. S&P Global found that 42% of companies have abandoned AI initiatives in the past two years, up from 17% two years prior. When you ask what they’re abandoning, it’s not projects that failed because of messy databases. It’s projects that failed because no one agreed on what success looked like, who was responsible for delivery, or how the work fit into existing operational rhythms.

MIT’s “flawed enterprise integration”: the real diagnosis

The MIT finding deserves to sit in your head. Researchers at MIT’s Center for Information Systems Research studied why so many AI pilots generate impressive demos but zero business impact. They tracked 300+ deployments and found that the organizations that successfully moved pilots to production had one thing in common: they redesigned how AI fit into their existing workflows, team structures, and decision-making processes.

The ones that stalled had not. They treated AI as a technology insertion problem. Bolted it onto the side of existing operations. Left the same decision-makers in charge. Didn’t change how handoffs happened. Didn’t redefine what a “success metric” meant for a transformed process.

That’s “flawed enterprise integration.” The technology works. The data is fine. The problem is the organization isn’t designed to use it.

This explains why one of the starkest statistics in AI adoption research barely gets mentioned: 67% of companies that bought pre-built AI solutions from vendors reported positive ROI, versus 33% that built internally. That gap isn’t because vendor solutions are technologically superior. It’s because vendors force organizational structure. They come in, audit your processes, demand you assign an owner, require you to define success upfront, and build governance into their onboarding. You have no choice but to integrate AI into your org design. Internal teams? They often just add another tool to the workflow.

High performers do five things differently

McKinsey studied which companies actually capture value from AI and found a stark split. The top 6% aren’t smarter. They’re just organized differently.

Workflow redesign: 55% of high performers redesign core workflows when deploying AI, compared to 20% of others. That’s 2.8x higher. They treat AI not as a tool to bolt on, but as a reason to rethink how work gets done. Instead of “let’s use AI to optimize what we’re already doing,” they ask “how should we do this differently?”

Budget allocation: High performers allocate 70% of their AI budget to people and process redesign, with 30% to technology. The rest of the industry does the opposite: 93% to technology, 7% to people.

AI ambition: High performers are 3.6x more likely to pursue transformational ambition with AI rather than optimization. They’re not trying to shave 10% off a process. They’re trying to fundamentally change their competitive position.

Digital budget concentration: High performers are 5x more likely to allocate 20% or more of their total digital budget to AI. They’re not dabbling. They’re committing organizational resources at a scale that forces integration.

Governance clarity: Almost universally, high performers have explicitly defined who owns AI decisions, what success looks like, and how it connects to business metrics before they deploy.

If this analysis is useful, subscribe to get the next one. Every Tuesday I publish data-driven breakdowns on enterprise AI maturity, agent strategy, and operations. No hype, no vendor pitches.
Subscribe now

The maturity stages: where most companies are stuck

MIT’s Center for Information Systems Research created a four-stage maturity model for enterprise AI. Where is your organization?

Notice the problem: 62% of companies are in Stages 1 or 2. They’re below the performance baseline. The jump from Stage 2 to Stage 3 is where financial performance starts to recover. Yet one-third of companies never make it there. They stay in the “managed pilots” phase indefinitely, spending millions on tooling and getting nothing in return.

The 7% in Stage 4 aren’t there because they have better data or smarter engineers. They got there because they rebuilt their operating model around AI. They defined clear roles. They aligned incentives. They made decisions faster. They integrated AI into how work actually gets done.

The governance multiplier nobody talks about

Google Cloud and the Cloud Security Alliance studied the relationship between governance maturity and AI adoption outcomes. They found that companies with comprehensive governance frameworks are 2x more likely to adopt agentic AI successfully (46% adoption vs 25% for companies with basic governance). They’re also 45% more likely to keep projects running for 3+ years versus 20% for others.

The catch: only 26% of companies have comprehensive governance today. Most are flying blind. They deployed an LLM-powered chatbot somewhere, nobody documented the decision process, nobody owns the model’s performance, nobody has a way to shut it down if it breaks, and six months later it’s doing things nobody intended.

Governance isn’t bureaucracy. It’s the system that lets you move fast safely. It’s what lets you deploy more, not less. High performers use governance to eliminate meetings, clarify decisions, and move projects from pilot to production in months instead of years.

The data quality trap: it’s a symptom, not the disease

Let’s be direct about data quality because it deserves honesty.

Yes, 63% of enterprises say their data isn’t AI-ready. Yes, data quality is hard. Yes, your databases are probably messy. But here’s what’s really happening: companies with unclear ownership have no one responsible for managing data standards. Companies without governance have no mechanism to enforce data quality. Companies that haven’t redesigned workflows have no clear definition of what “ready” even means.

Fix the org design, and data quality improves.

Why? Because when someone actually owns AI outcomes (which forces them to own the data pipeline), when there’s governance that requires metadata and lineage documentation, when processes are redesigned around data rather than legacy workflows, data quality stops being a nice-to-have and becomes a required input.

The companies that fixed data quality first and expected organizational benefits got almost nothing. The ones that redesigned their org and treated data quality as part of that redesign actually succeeded. The causality flows backward from what vendors want you to believe.

The perception gap: you can’t fix what you can’t see

BCG found that 55% of organizations think they’re further along the AI maturity journey than they actually are. Put another way, they’re less advanced than they believe. And it’s not off by a little. Half the room believes they’re at Stage 3 when they’re actually in Stage 2. Some believe they’re Stage 3 when they’re barely past Stage 1.

How does this happen? Easy.

When you buy an enterprise AI platform, you see yourself as an “AI adopter.”
When you run a successful pilot, you see yourself as “deploying AI.”
When a team somewhere uses an LLM in production, you see yourself as “integrated.”

But none of that means your organization is designed for AI. It just means you have AI running somewhere.

The maturity stages aren’t about whether AI exists in your company. They’re about whether your organizational structure, governance, decision-making processes, and budget allocation are designed around AI. Most companies have the first without the second.

What to do Monday morning

If you lead operations: Run a maturity audit. Not with a vendor. With your own team. Where is your organization actually in this progression? Don’t estimate. Look at real metrics: how many workflows have been redesigned? How many projects have moved from pilot to production? How many people wake up on Monday with AI ownership in their job description? Be honest. Most companies discover they’re one stage behind where they thought.

If you lead engineering: Assess the vendor paradox. If you’ve been building internally and getting nowhere, and vendors consistently outperform, that’s not a technical signal. That’s an organizational one. Consider whether buying a complete solution forces the structural change you need. Sometimes the answer is “yes.”

If you lead AI/ML: Map your maturity stage honestly. Then identify which transition is blocking you. Most get stuck on Stage 2 to Stage 3 (the shift from “pilots” to “integrated operations”). That transition requires redesigning workflows, not upgrading models. Figure out which workflow redesign, if successful, would unlock the most value. That’s your first move.

Three diagnostic questions for everyone:

Can three different people in your organization tell you who owns AI outcomes? If not, you don’t have ownership. You have sponsors.
If a major AI initiative failed tomorrow, how many people would feel it in their quarterly results? If fewer than five, it’s not integrated enough. It’s still a side project.
What percentage of your digital budget goes to AI? Compare that to what percentage of your strategy is about AI. If there’s a gap of more than 5%, you’re not committing organizational resources proportional to the bet.

The 95% failure rate isn’t a data problem or a model problem. The 6% that succeed aren’t smarter. They’re organized differently. They’ve redesigned their operating model around AI instead of bolting AI onto an existing one.

That’s hard. It requires admitting that the org charts don’t fit anymore. It requires clarifying ownership when that’s uncomfortable. It requires redesigning workflows instead of optimizing them. It requires moving budget from technology to people.

But it’s doable. And it’s the only path that actually generates value.

Stop blaming data. Start redesigning the organization.

40% of Agentic AI Projects Will Be Canceled. Here’s Why Yours Is Probably One of Them.

Marco — Tue, 17 Mar 2026 12:05:41 GMT

48% of enterprises plan to add at least $2M to agentic AI budgets this year. Executive interest is at record levels. Every vendor pitch now includes the word “agent.”

The production reality is different. 8.6% of companies have production-ready AI agents, according to a Cloudera and Harvard Business Review Analytic Services survey of 120,000+ enterprise respondents. Not 8.6% experimenting, not 8.6% piloting. In production, processing real data, making real decisions.

Gartner predicts 40% of agentic AI projects will be canceled by 2027. The three reasons are specific: escalating costs, unclear business value, and inadequate risk controls.

Those aren’t vague failure modes. They’re structural problems with names, causes, and observable patterns. Here’s what they look like up close.

Failure mode 1: The cost explosion nobody budgeted for

Enterprise teams budget for AI agents the way they budget for SaaS subscriptions: licensing fees, infrastructure, maybe some integration work. The actual cost structure of agentic AI is fundamentally different.

An AI agent doesn’t make one call to a model. It reasons iteratively. It calls tools. It checks its own work, revises, and retries. A simple customer query that takes 100 tokens in a single-pass chatbot expands to 2,000-5,000 tokens when an agent processes it with tool calls, chain-of-thought reasoning, and output validation. At scale, monthly token bills dwarf infrastructure spend.

The growth is non-linear. A reasoning loop running 10 cycles consumes 50x the tokens of a single linear pass. This isn’t a scaling problem you can forecast from a pilot. Pilots run on curated inputs, happy paths, and manageable volumes. Production runs on messy data, edge cases, and thousands of concurrent users. The cost curve bends sharply upward at exactly the point where you’ve already committed to the architecture.

Research from Galileo AI puts the number at $5-8 per task for unconstrained agents solving software engineering issues. That’s per task, not per user per month. For a customer support operation handling 10,000 tickets per day, the economics get painful fast.

85% of organizations misestimate AI project costs by more than 10%. Nearly 25% miss forecasts by more than 50%. For agentic systems specifically, the typical budget underestimation is 40-60% when you account for the full total cost of ownership: monitoring, guardrails, human oversight, security maintenance, and ongoing tuning.

The pattern is consistent. Enterprises apply GenAI cost assumptions to agentic systems - simple LLM pricing models - when the actual cost includes orchestrators, governance layers, multiple agents running in tandem, and operational overhead that scales faster than usage. By the time the real costs surface, the project is already deep in production and the sunk-cost fallacy kicks in.

Failure mode 2: Agent-washing and the vendor trust gap

The agentic AI market is experiencing a labeling crisis. Every automation tool, chatbot, and workflow builder now calls itself an “agent.” The term has become meaningless as a product descriptor.

Of the thousands of vendors now marketing agentic AI capabilities, approximately 130 are building actual agentic systems - tools that can reason, adapt to novel situations, maintain state across interactions, and make autonomous decisions within defined boundaries. The rest are rebranding scripted automation, RPA flows, and basic chatbots. Analysts have started calling it “agent-washing.”

The adoption numbers reflect this confusion. 30% of enterprises say they’re exploring agentic AI. 38% report they’re piloting. But only 11% have agents actively running in production. The gap between “we’re piloting” and “it’s in production” is the largest in any enterprise technology category right now. For comparison, cloud adoption saw a 2-3 year gap. Agentic AI may see 4-5 years, because the infrastructure requirements are harder to meet.

The consequence for enterprise buyers is expensive. Teams purchase tools marketed as “agentic,” build workflows around them, discover the tools can’t actually handle production complexity, and either rebuild or abandon. The CIO team funded an “agentic AI initiative.” What they got was a chatbot with a nicer interface.

Failure mode 3: The governance vacuum

This is the failure mode that kills projects after they’re built rather than before. And it’s the one most enterprises are least prepared for.

78% of enterprises now embed AI in their strategic plans. Only 19% have fully implemented governance frameworks. That’s a 4x gap between ambition and controls - and it widens for agentic systems, which require governance architectures that most frameworks haven’t even defined yet.

The numbers at the operational level are worse. 64% of organizations experimenting with agentic AI have no formal agent monitoring or escalation protocols. Agents are making decisions in test environments with no structure for what happens when those decisions affect real customers, real revenue, or real compliance obligations.

Only 7% of enterprises report their data is completely ready for AI adoption. 27% say their data is not ready at all. Agents need reliable, structured, governed data to make good decisions. When they get bad data, they don’t just produce bad output - they take bad actions. At enterprise scale, a single agent making autonomous decisions on corrupted data can cascade through downstream systems before anyone notices.

The incidents are already happening. 64% of companies with $1B+ annual revenue have lost more than $1M to AI-related failures. These aren’t hypothetical risks from research papers. They’re production incidents where agents with insufficient guardrails took actions they shouldn’t have, accessed data they shouldn’t have, or made decisions that nobody had designed an override for.

The regulatory environment is catching up, but enterprises aren’t ready for it. The EU AI Act’s high-risk compliance deadline is August 2, 2026. The EU Product Liability Directive, with its implementation deadline of December 9, 2026, explicitly classifies software and AI systems as “products” subject to strict liability for defects. When an agent fails and causes damage, someone is legally responsible. Most enterprises haven’t determined who that is.

What the 6% do differently

McKinsey’s State of AI data identifies a small group of “AI high performers” - organizations attributing 5% or more of EBIT to AI. They represent roughly 6% of the companies surveyed. Their practices are distinct and measurable.

They start with governance, not technology. A Cloud Security Alliance and Google Cloud study found that governance maturity is the single strongest predictor of AI readiness. Organizations with comprehensive governance policies are 2x more likely to adopt agentic AI than those with partial guidelines, and nearly 4x more likely than those with developing policies. The adoption rates are striking: 46% early adoption among companies with comprehensive governance, 25% with partial guidelines, 12% with developing policies.

This reverses the common sequence. Most enterprises buy tools first and build governance later. The 6% build governance first and buy tools that fit within it.

They invest in people and processes, not just software. High-performing organizations allocate roughly 70% of AI budgets to people, processes, and organizational readiness - not tools. 40-50% goes specifically to talent and training in early stages. The typical enterprise inverts this ratio, spending 60-70% on software and infrastructure and hoping the organizational change follows.

They pursue transformational change, not efficiency. McKinsey’s data shows high performers are 3x more likely to pursue transformational change rather than incremental efficiency gains. They don’t ask “how can AI make this process faster?” They ask “should this process exist in its current form?” That question leads to workflow redesign, which is the single strongest correlate with AI-driven EBIT impact.

They build confidence through structure. Organizations with mature governance report 48% confidence in protecting AI systems, compared to 16-23% for those with partial or developing frameworks. This 2-3x confidence gap matters because it determines whether leadership approves production deployment or keeps projects in permanent pilot mode. Governance doesn’t slow you down. It’s the prerequisite for moving fast with institutional support.

The regulatory clock most enterprises ignore

The governance vacuum isn’t just an operational risk. It’s becoming a legal one on a specific timeline.

August 2, 2026: EU AI Act high-risk compliance deadline. Organizations deploying AI in areas classified as high-risk - hiring, credit scoring, law enforcement, critical infrastructure - must have conformity assessments, risk management systems, and human oversight in place.

December 9, 2026: EU Product Liability Directive implementation deadline. AI systems are explicitly classified as “products.” Defects that cause damage trigger strict liability - the standard applied to manufacturing defects, not the negligence standard typically applied to software.

Already active: The FTC’s Operation AI Comply is targeting deceptive AI marketing claims. If your agent doesn’t do what you tell customers it does, enforcement is already underway.

In the U.S., regulatory fragmentation persists - no unified federal framework, with the FTC, NIST, and Department of Commerce interpreting AI within existing mandates. Colorado’s AI Act takes effect June 30, 2026. Singapore released a draft governance framework for agentic AI at Davos in January.

The direction is clear even where the specifics vary: regulation is converging on accountability for autonomous AI systems. Enterprises that haven’t built governance architectures by mid-2026 will be retrofitting under deadline pressure - the most expensive and error-prone way to comply.

The honest counterpoint

The failure narrative is real, but so are the acceleration signals.

Production agent deployment nearly doubled in four months - from 7.2% to 13.2% between August and December 2025. Infrastructure costs are falling: NVIDIA’s Blackwell architecture enables significant inference cost reductions, and open-source models are driving token costs down. 66% of organizations adopting agents report measurable productivity value, according to PwC.

These signals matter, but they need context. “Measurable productivity value” is a lower bar than P&L impact. Inference cost reductions don’t account for the full operational cost of production agents - orchestration, monitoring, guardrails, and human oversight. And the deployment acceleration, while encouraging, is starting from a base of 7.2%. Doubling to 13.2% still leaves 87% of enterprises without production agents.

The optimistic case is that the infrastructure is maturing and costs are declining. The realistic case is that infrastructure maturity and cost reduction are necessary conditions, not sufficient ones. Without governance, organizational readiness, and honest assessment of total costs, cheaper tools just mean cheaper failures.

Diagnosing your position

Five questions that separate the 8.6% from the 91.4%:

1. How many of your AI agents are in production, processing real data, making real decisions? Not in a sandbox, not with curated inputs, not monitored by the team that built them. In production, with production data, under normal business operations. If the answer is zero, your agentic AI initiative is still an experiment regardless of how much you’ve invested.

2. Can you quantify your actual per-task cost for any deployed agent? Not the LLM API pricing from the vendor’s website. The actual cost: tokens consumed per task including reasoning loops, tool calls, retries, and failed attempts. Infrastructure. Monitoring. Human review time. If you don’t know this number, you can’t forecast whether the system is economically viable at scale.

3. Do you have a governance framework that exists independently of any specific AI tool or vendor? Validation rules, escalation policies, audit trails, human-in-the-loop patterns. If your governance is embedded in a vendor’s platform, you don’t have governance - you have a feature that disappears when you switch vendors.

4. Who is accountable when an agent makes a wrong decision in production? Not “who built the agent” or “who approved the project.” Who is legally and operationally accountable for the downstream impact? If the answer takes more than five seconds, you have a liability gap.

5. What’s your timeline for EU AI Act compliance? If you’re deploying agents in high-risk categories and don’t have a conformity assessment plan, you’re five months from a regulatory deadline with no roadmap.

What to do Monday morning

If you lead operations: Pull the actual cost data on your agent deployments. Not the budget, not the forecast - the actual spend. Compare per-task costs to the value each task delivers. If you can’t do this calculation, that’s your first problem. Build the measurement before you scale the system.

If you lead engineering: Audit your vendor stack for agent-washing. For each tool marketed as “agentic,” test whether it can handle a novel situation it wasn’t explicitly designed for. Can it reason about an edge case? Can it recover from an unexpected input? If the answer is no, you have an automation tool, not an agent. Plan accordingly.

If you lead AI governance: Map the gap between your current governance framework and the EU AI Act’s high-risk requirements. Then assess whether your framework covers agentic-specific risks: autonomous decision-making, multi-agent coordination, cascading failures, data access scope. If it doesn’t, you have five months to close that gap.

The pattern in the data is unambiguous. Enterprises that build governance before deploying agents, invest in organizational readiness before buying tools, and rigorously assess costs before scaling are the ones landing in the 6%. Everyone else is funding the 40% cancellation rate.

Your AI Roadmap Is a Single Line. It Should Be Three - (The Three Horizons of AI Adoption)

Marco — Mon, 09 Mar 2026 06:02:41 GMT

In a previous post, I described the gap between passive AI (employees using ChatGPT) and operational AI (AI embedded in business processes with scaffolding). The scaffold - input constraints, decision rules, output standardization, human-in-the-loop, monitoring - is what separates a working demo from a production system.

That post answered the question “what do I need to build?” This one answers “when do I build it - and what comes after?”

The short answer: you need three parallel tracks, not one sequential plan. Most enterprises are running a single-horizon AI strategy. The 6% capturing real value are running three.

The 88-6 gap

McKinsey’s 2025 State of AI reports that 88% of organizations use AI in at least one function. Generative AI adoption specifically jumped from 33% to 72% in a single year. By any adoption metric, enterprise AI is mainstream.

Value capture is not. Only 6% of organizations qualify as “AI high performers” - those attributing 5% or more of EBIT to AI. The other 82% have adopted the technology without capturing the economics.

The usual explanation is execution: companies aren’t implementing well enough. But the McKinsey data points to something more structural. Among high performers, the strongest differentiator isn’t model sophistication or talent density. It’s workflow redesign - 55% of high performers fundamentally redesign business processes around AI, compared to 20% of everyone else. High performers are also 5x more likely to allocate 20%+ of their digital budget to AI.

These aren’t companies that experiment better. They plan differently. They treat AI not as a tool to layer onto existing processes but as infrastructure that requires new processes, new roles, and new governance - built across multiple time horizons simultaneously.

Three horizons, defined

McKinsey’s Three Horizons framework was originally designed for innovation portfolio management. It maps well to enterprise AI because the fundamental challenge is the same: how to allocate resources across current operations, near-term transformation, and long-term disruption.

Applied to AI adoption:

Horizon 1: Experimentation. Employees use AI tools for individual productivity. ChatGPT, Copilot, Gemini. Pilots run in sandboxes with curated data. Value is real but localized - faster emails, quicker code reviews, better first drafts. No integration with core systems. No governance beyond acceptable use policies.

Horizon 2: Operational AI. AI is embedded in business processes. It receives structured inputs, applies business rules, produces standardized outputs, and connects to downstream systems. Humans validate at checkpoints, not at every step. This is the scaffold layer - the architecture that turns ad hoc usage into repeatable, governed workflows.

Horizon 3: Agentic Systems at Scale. Multiple AI agents coordinate across domains. An orchestrator delegates tasks to specialist agents. Agents monitor other agents (guardian agents). The system handles cross-functional workflows autonomously, with human oversight at strategic decision points rather than operational ones.

H1: Experimentation H2: Operational AI H3: Agentic Systems AI role Utility (human decides) System component (AI acts within constraints) Autonomous coordinator (AI orchestrates) Integration None - browser tab Embedded in specific workflows Cross-system, cross-domain Data requirements Whatever the user pastes in Structured pipelines, validated inputs Unified data fabric, real-time access Governance Acceptable use policy Business rules, HITL validation, audit trails Guardian agents, policy enforcement, observability People Individual users Process owners, AI operators Agent orchestrators, AI governance specialists Typical timeline 0-6 months 6-18 months 18-36 months

The horizons aren’t sequential stages where you finish one before starting the next. They’re parallel investment tracks. The infrastructure you build in H1 and H2 becomes the foundation for H3. Skip or delay H2 investment, and H3 becomes impossible to execute on timeline.

Where everyone actually is

The distribution is heavily skewed toward H1. According to Gartner’s 2025 AI adoption data, 62% of organizations are experimenting with AI agents, 23% are scaling them in at least one function, and only 11% have agentic AI in production.

Most enterprise “AI roadmaps” are really H1 extension plans: more use cases for ChatGPT, more pilots in more departments, more prompt engineering workshops. The roadmap moves laterally across the organization rather than vertically through the horizons. Companies add breadth (more experiments) without adding depth (operational infrastructure).

MIT’s 2025 research puts numbers on the cost of this approach: 95% of AI pilots delivered no measurable P&L impact. 88% never reached production. The gap between a working pilot and an operational system is precisely the H1-to-H2 transition - and most organizations have no plan for crossing it.

The failure modes are predictable. Data quality issues surface when pilots meet real systems (43% of failures). Integration complexity is underestimated by 2-3x (31% of technical failures). Use cases are chosen for technical novelty rather than business value (43% of strategic failures). These aren’t pilot problems. They’re infrastructure problems that only become visible when you try to move from H1 to H2.

The infrastructure dependency chain

Here’s why the horizons can’t be sequential: what you build in H2 is a prerequisite for H3, and what you build in H1 should be designed with H2 in mind.

MIT Sloan research found that 80% of the effort in deploying AI agents is consumed by data engineering, stakeholder alignment, and governance - not model development. The model is the easy part. The infrastructure is the hard part. And infrastructure takes time to build, test, and harden.

Consider what H3 (agentic systems) requires that can’t be retrofitted:

Governed data access. Agents need real-time access to data across systems. If your H1 pilots run on manually curated datasets and your production data lives in silos, there’s no shortcut. 90% of organizations with AI workflows in production use an integration platform. The other 10% built their own. Either way, it takes 12-18 months to do well.

Governance frameworks. An employee using ChatGPT who gets a bad answer wastes five minutes. An autonomous agent with bad data executes dozens of flawed decisions before anyone notices. The governance infrastructure - validation rules, escalation policies, audit trails, human-in-the-loop patterns - must exist before agents get decision authority. You build this in H2, not H3.

Observability and monitoring. 91% of AI models experience quality degradation over time. For H3 systems where agents act autonomously, degradation detection must be automated. Gartner predicts that guardian agents - AI systems that monitor, challenge, and contain other agents - will capture 10-15% of the agentic AI market by 2030. But the monitoring patterns start in H2, with simpler operational systems.

API-led architecture. Traditional enterprise systems were designed for deterministic, synchronous workflows. Agents need asynchronous communication, state management across boundaries, and conflict resolution. Event-driven architectures and API-first design are H2 investments that become H3 requirements.

Each of these takes time. Companies that start building them only when they’re “ready for H3” are already 12-24 months behind.

The 70-20-10 allocation

McKinsey’s original Three Horizons framework suggests a resource allocation of roughly 70% to H1 (core operations), 20% to H2 (adjacent growth), and 10% to H3 (transformational bets). Applied to AI investment, the ratios shift - but the principle holds.

Most enterprises are running something closer to 95-5-0: nearly all resources on experimentation and pilots, minimal investment in operational infrastructure, nothing on Agentic architecture.

High performers, based on the McKinsey data, allocate differently. They spend 20%+ of digital budgets on AI (5x the median), and they direct that spending toward workflow redesign and integration - not more pilots. A reasonable allocation for a company serious about all three horizons might look like:

H1 (50-60%): Keep experimenting, but with structure. Run pilots that are designed to test H2 infrastructure, not just model capabilities. Every pilot should answer: “What would it take to put this in production?” If the answer is “rebuild everything,” the pilot is testing the wrong thing.

H2 (25-35%): Build the scaffold. Data pipelines, governance frameworks, HITL patterns, API connectivity, monitoring. This is the unsexy work that makes everything else possible. Fund it like infrastructure, not like innovation.

H3 (10-15%): Start small, start now. Prototype multi-agent patterns. Evaluate orchestration frameworks. Build a guardian agent for one high-risk process. You’re not deploying at scale - you’re building institutional knowledge about what H3 will require.

The key insight from the high-performer data isn’t that they spend more on AI. It’s that they spend differently - investing in the operational layers that connect experimentation to business impact.

2026 is the inflection year

Multiple signals converge on 2026 as a watershed for enterprise AI.

Gartner predicts that 40% of enterprise applications will feature task-specific AI agents by 2026, up from less than 5% in 2025. The autonomous AI agent market could reach $8.5 billion this year. And Gartner separately predicts that 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

Read those cancellation drivers: escalating costs (no infrastructure planning), unclear value (no workflow redesign), inadequate risk controls (no governance). They’re symptoms of companies jumping to H3 without building H2.

The deadline isn’t technical - it’s financial. CFOs have funded two years of experimentation. 92% of companies plan to increase AI investment over the next three years. But the patience for indefinite pilots is ending. Companies that haven’t moved from experimentation to operations by late 2026 will face pressure to justify continued AI spending against zero P&L impact.

A caveat worth noting: much of what gets labeled “Agentic” in 2026 will be supervised AI operating within tight constraints - closer to advanced H2 than true H3 autonomy. Gartner estimates that only about 130 of thousands of “Agentic AI vendors” offer real Agentic capabilities. The rest is rebranding - what analysts are calling “agent washing.” This doesn’t change the planning framework. If anything, it reinforces it: the companies that built real operational infrastructure will be positioned to adopt real agentic capabilities. The ones chasing vendor demos will discover they bought H1 tools dressed up as H3.

What to do Monday morning

Three diagnostic questions:

What percentage of your AI usage is in production, processing real data, integrated into real workflows? If less than 15%, your roadmap is an H1 plan regardless of what it says.
Do you have a governance framework - validation rules, escalation policies, audit trails - that exists independently of any specific AI tool? If not, you have no H2 foundation. Every Agentic project you start will need to build this from scratch.
Who owns the H2-to-H3 transition? Not “who sponsors AI” or “who runs pilots.” Who is responsible for the operational infrastructure that connects experimentation to autonomous systems? If that person doesn’t exist, the transition won’t happen.

From ChatGPT to Operational AI: The Architecture Shift Nobody Plans For

Marco — Fri, 06 Mar 2026 16:29:53 GMT

ChatGPT Enterprise usage grew 8x last year. The tool reached 92% Fortune 500 penetration in under 24 months. Employees are writing emails, summarizing documents, generating code. Adoption, by any measure, is working.

Business impact is not. Only 13% of companies report positive EBITDA impact from AI. Fewer than one-third of enterprise decision-makers can tie AI investments to P&L growth. And 42% of companies scrapped most of their AI initiatives in 2025 - up from 17% the year before.

The gap between adoption and impact has a structural explanation. Organizations have built passive AI - employees using language models ad hoc - and assumed that business results would follow. They don’t. What follows is a plateau: high usage, shallow integration, no measurable return.

The missing piece isn’t a better model. It’s an operational scaffold - the architecture that turns ad hoc AI usage into repeatable, governed, integrated business processes. Most teams never build it. The ones that do scale 3x faster than their peers.

Passive AI and Operational AI are different systems

The enterprise AI conversation collapses two fundamentally different things into one.

Passive AI is a utility. An employee opens ChatGPT, asks a question, gets an answer, decides what to do with it. The human remains the decision-maker. The AI is a tool, like a calculator or a search engine. If the output is wrong, the human catches it (or doesn’t - the stakes are usually low).

Operational AI is a system. AI is integrated into a business process. It receives structured inputs, applies decision rules, produces standardized outputs, and feeds results into downstream workflows. Humans validate at checkpoints, not at every step. The AI has some degree of authority to act.

Passive AI Operational AI Who decides Human AI (within constraints) Integration None - browser tab Embedded in workflows and systems Input Free-form prompts Structured, validated data Output Unstructured text Standardized, schema-validated Failure mode Human catches bad output System catches bad output (or doesn’t) Infrastructure needed Login credentials Scaffold: rules, validation, HITL, monitoring

This distinction matters because the infrastructure requirements are completely different. Passive AI needs almost nothing - give people access and they’ll use it. Operational AI needs an entire architecture layer that most organizations haven’t designed, let alone built.

The OpenAI Enterprise report shows this gap in practice: 40% of knowledge workers use AI tools personally, but describe them as unreliable in enterprise systems. Among ChatGPT Enterprise users, 19% have never used data analysis features, 14% have never tried reasoning tools. Usage is wide but shallow - passive by default.

The pilot-to-production cliff

The data on AI pilot failures is brutal.

95% of enterprise gen-AI pilots deliver zero measurable return on P&L, according to MIT’s 2025 State of AI in Business report. 88% of pilots never reach production. The average organization abandoned 46% of its proof-of-concepts before they became operational.

These aren’t model failures. Pilots typically work fine in controlled environments with clean data and motivated teams. The failure happens at the transition - when a working prototype meets real systems, real data quality issues, real compliance requirements, and real edge cases.

The root causes, as documented across multiple enterprise studies, are structural:

Data quality and governance (43% of failures). Pilots succeed on manually curated datasets. Production requires integration with live systems where data is inconsistent, incomplete, and owned by nobody in particular.

Integration complexity (31% of technical failures). Pilots run in isolation. Production requires API connectivity, workflow integration, compliance alignment, and authentication across multiple systems. Teams consistently underestimate this by 2-3x.

Strategic misalignment (43% of failures). Use cases chosen for technical novelty rather than business value. No clear ROI defined before funding, no clear owner after launch.

Organizational readiness (66% cite this as a barrier). Lack of AI skills and leadership clarity. But this is necessary, not sufficient - skilled teams fail too when the infrastructure isn’t there.

The pattern is consistent: pilots work because humans compensate for missing infrastructure. Production fails because at scale, humans can’t compensate fast enough.

What an operational scaffold actually looks like

An operational scaffold is the infrastructure layer between “the model works” and “the system runs in production.” It’s not a single tool or framework. It’s five interconnected layers that constrain, validate, and govern AI behavior within a business process.

Layer 1: Input constraints. What data can the AI access? Is the input valid, complete, in the right format? Are there access controls that enforce data boundaries? This layer prevents garbage-in problems and limits the AI’s exposure to data it shouldn’t process.

Layer 2: Decision rules and guardrails. What business rules must the AI follow? Which actions can it take autonomously, and which require escalation? At what confidence threshold does it hand off to a human? This is where domain knowledge gets encoded - not in the prompt, but in the system architecture.

Layer 3: Output standardization. Does the output match the expected schema? Does it pass quality checks? What happens when validation fails - retry, fallback to a simpler approach, or escalate? This layer makes AI output consumable by downstream systems and humans.

Layer 4: Human-in-the-loop validation. Where are the checkpoints? What approval workflows exist for high-stakes decisions? How do humans provide feedback that improves the system over time? What audit trail exists for compliance and debugging?

Layer 5: Operational infrastructure. How do you monitor the system in production? How do you keep data pipelines fresh? How do you version the system and roll back when something breaks? This is the layer that keeps operational AI operational - not just on launch day, but on day 200.

McKinsey’s AI guardrails framework maps a similar architecture: input guardrails, runtime policy enforcement, output validation, and configuration controls. The terminology varies across frameworks. The structure is consistent.

None of these layers are optional. Skip input validation and your model processes corrupted data. Skip decision rules and the AI takes actions outside its authority. Skip output standardization and downstream systems can’t consume the results. Skip HITL and you lose the ability to catch systematic errors. Skip monitoring and you won’t know when the system degrades.

Most teams build the model. Some build one or two of these layers. Almost nobody builds all five before launching a pilot. And then they wonder why the pilot doesn’t scale.

How I built this for monthly business reviews

I’ll make this concrete with a system I built: an operational AI pipeline for Monthly Business Reviews (MBRs) in a professional services organization.

The old process: a team spent roughly 20 hours per month collecting data from Salesforce and project management tools, validating numbers across regions, writing narrative descriptions for each KPI, and formatting everything into a leadership-ready document. The work was manual, error-prone, and repetitive.

The operational AI system reduced this to roughly 2 hours - a 90% reduction. But the model was maybe 15% of the work. The other 85% was the scaffold.

Input layer. Structured data extraction from source systems with validation rules - 10 specific checks that flag inconsistencies before any narrative generation starts. Regional sums must match totals. Month-over-month trends must be directionally consistent with the underlying data. Missing data gets flagged, not silently ignored.

Decision rules. The system applies business-specific logic: what counts as an improvement vs. a decline, which KPIs are bonus-linked and require extra scrutiny, how to weight regional performance against global targets. These rules aren’t in the prompt - they’re encoded in the validation pipeline.

Output standardization. Every narrative follows a defined structure: current performance, trend context, root cause analysis, forward-looking implications. The format is consistent across KPIs and months, making it consumable by leadership without retraining their reading patterns.

Human-in-the-loop. A data validation report goes to the operator before narratives generate. The operator reviews flagged inconsistencies, corrects source data if needed, and approves the validated dataset. After narrative generation, a human reviews the output against the data. The system doesn’t publish autonomously.

Operational infrastructure. The pipeline tracks its own accuracy over time. Corrections feed back into the validation rules. The system versions its outputs so you can compare this month’s narratives against last month’s methodology.

Without the scaffold, this would be “use ChatGPT to write some paragraphs about KPIs.” With the scaffold, it’s a governed system that leadership trusts enough to present to the executive team.

The 3x multiplier

The evidence that scaffolded approaches outperform ad hoc ones is strong.

Organizations with formal AI operating models achieve 3.1x faster scaling and 4.4x higher value capture compared to peers running ad hoc projects. They report 2.4x higher returns on AI investments. 72% of organizations successfully using AI at scale follow a defined operating framework.

The OpenAI Enterprise report adds another data point: purchasing AI tools from vendors and building partnerships succeeds 67% of the time, while internal builds succeed only 33%. The difference? Vendors bring pre-built operational scaffolds. You don’t start from scratch on input validation, guardrails, and monitoring - the vendor has already built it.

Frontier firms - the top quartile of AI adopters - show what operational integration looks like at scale. They achieve 1.7x revenue growth, 3.6x greater total shareholder return, and 1.6x EBIT margin compared to laggards. These aren’t cultural differences. They’re infrastructure differences. Frontier firms have built the scaffolds that let AI operate within business processes, not just alongside them.

Gartner predicts 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls. Read those failure modes: they’re all symptoms of missing scaffolding. Cost escalates because you retrofit governance after the fact. Value is unclear because you never defined how the AI connects to business outcomes. Risk controls are inadequate because you never designed them into the architecture.

The 60% that survive will be the ones that treated the scaffold as a first-class requirement, not an afterthought.

A fair objection: not everything needs a scaffold

Some ad hoc AI usage genuinely works. Research published in Nature Human Behaviour shows that LLM-assisted writing improves outcomes even in informal settings - 18% of financial services complaints are now LLM-assisted, and 24% of corporate press releases are LLM-attributable.

This is real value, and it doesn’t require an operational scaffold. When a human writes a memo with ChatGPT’s help and then edits the result, the human is the scaffold. They validate the input (their own knowledge), apply decision rules (their judgment), standardize the output (their editing), and serve as the quality check.

The distinction is between utility and authority. When AI is a utility - humans decide, AI assists - light scaffolding works. When AI has authority - it acts, decides, or integrates into automated workflows - heavy scaffolding is required. The problem isn’t that organizations use passive AI. The problem is that they expect passive AI to deliver operational results.

What to do Monday morning

If you lead operations: Audit your AI pilots. How many are in production, processing real data, integrated into real workflows? If the answer is less than 15%, you have a scaffold problem, not a model problem. Fund the infrastructure, not another pilot.

If you lead engineering: Stop thinking of AI as a feature you ship. Think of it as a system you operate. Design backwards from production constraints: what governs this AI? What validates its output? Who approves high-stakes decisions? What happens when it fails? Build observability into the initial design.

If you lead ML: Your model is 10% of what production requires. The other 90% is input validation, decision rules, output standardization, human-in-the-loop workflows, and monitoring. Learn the frameworks - McKinsey’s guardrails, Stanford’s HITL patterns - that turn models into operational systems. Partner with ops and engineering. You can’t build a scaffold alone.

The AI adoption race is largely over. Most enterprises have given their employees access to language models. The operational race - who can turn those models into governed, integrated, production-grade systems - is just starting. The scaffold is where it’s won.

AI that doesn’t touch the P&L isn’t innovation, it’s overhead

Marco — Fri, 26 Sep 2025 06:08:33 GMT

There’s a quiet frustration spreading across professional services firms. It’s not the technology itself. It’s what happens after the glossy demos, the pilots that win applause in town halls, the decks that promise “efficiency gains.”

The frustration shows up in the P&L. Hours spent testing models that never reach production. Faster deliverables that paradoxically shrink fees. API invoices that grow without a clear link to client value. A portfolio of PoCs that look innovative on paper but, in reality, are margin erosion in disguise.

You can feel it in the Monday meetings. Leaders nod at “innovation updates” but privately wonder why gross margin hasn’t moved a single point. Teams are tired of context-switching between experiments while utilization targets remain stubborn. The real question isn’t can AI do this? It’s should we even be doing this with AI at all?

The pilots that never pay back

The pain runs deeper than wasted hours. Consider how it manifests:

Decades of delivery playbooks get bypassed in the rush to launch “AI pilots” that aren’t scoped, priced, or risk-assessed.
Client-facing deliverables risk errors that no one priced into the engagement, but everyone will be accountable for.
Pressure from sales to “show AI” forces operators to demo half-baked use cases that consume margin instead of creating it.

And yet, it feels lonely to say this out loud. Everyone else seems to be celebrating their pilots, right?

Subscribe now

You are not the only one

Here’s the truth: you’re far from alone.

BCG and Wharton recently ran a controlled experiment with 758 consultants. GPT-4 increased productivity by 25% and improved quality on 18 tasks. But on “out-of-distribution” tasks, performance actually dropped. Translation: without proper gating and governance, AI doesn’t just fail to help it actively hurts outcomes.

GitHub’s own Copilot study showed developers completing tasks 55% faster. Yet firms that didn’t redesign their unit economics saw cycle times fall while fee structures stayed flat. Faster work simply compressed revenue.

Gartner projects that by 2026, 80% of enterprises will run GenAI APIs in production. That means the competitive floor is rising quickly. The bar won’t be “do you use AI?” It will be “is your AI driving margin or bleeding it?”

The reality is simple: service firms that treat AI as a shiny pilot program will watch their margins shrink. Those that treat AI as a product line with unit economics, SLOs, and governance will widen their edge

A different way forward: the TRIAGE map

Instead of chasing every possible workflow, think about AI like an operator would: with constraints, priorities, and a hard eye on economics.

That’s where the TRIAGE framework comes in. It’s a way to select and industrialize use cases that actually move margins.

Throughput: How much cycle time or lead time could this cut?
Risk: What happens if it fails? Is it client-facing (P0) or back-office (P2)? Do we have human-in-the-loop in place?
Inputs: What data is needed? Do we have permission, residency, and confidentiality aligned?
Accuracy: How do we measure quality? Golden sets, sampling, error thresholds?
Governance: Are guardrails, audit logs, and approvals coded into the workflow?
Economics: Does the cost of inference and integration improve realization rate, win rate, or pricing or cannibalize them?

Once you score your workflows, you can literally plot them: Throughput vs Risk. The “quick wins” pop out high throughput, low risk. That’s where you should start.

This isn’t theory. Firms applying a structured filter like TRIAGE are moving from dozens of scattered pilots to five workflows that change margin profiles in 90 days. That’s the difference between “showing AI” and selling AI.

Why this matters right now

Margins in services are under relentless pressure: flat contracts, rising delivery costs, and client expectations inflated by Microsoft and Google Copilot rollouts. The EU AI Act is adding compliance obligations. API costs are dropping, but that just tempts more experimentation without guardrails.

The worst outcome? Competitive parity. Everyone has AI, but no one makes money from it.

The best outcome? You design AI as a profit lever. Faster turnaround becomes a premium upsell, not a freebie. Enhanced accuracy becomes a differentiator, not an unbilled improvement. Guardrails and governance become a selling point, not a compliance headache.

What you can do this week

Here’s a simple starting point: list ten recurring processes in your delivery engine. Score them on Throughput and Risk only. Plot them on a 2x2 grid. Circle the five that land in “high throughput, low risk.”

That’s your shortlist. Not thirty ideas. Not endless pilots. Just five.

Then and only then design SLOs, guardrails, and a shadow P&L. Within two weeks, you’ll know if these workflows are margin-positive or margin-negative. Kill fast or scale fast. Nothing in between.

A practical step to try tomorrow

In your next leadership meeting, skip the “AI update” slide. Instead, bring one number: realization rate. Ask, “Which AI workflows are actually improving this?” You’ll be amazed how quickly the noise falls away.

Let’s talk

I’d love to hear from you:

Your “Strategy” Is Just a To-Do List (and It’s Costing You Growth)

Marco — Wed, 24 Sep 2025 06:08:25 GMT

Let me start with something positive: every team I’ve worked with is full of people who genuinely want to do good work. No one wakes up hoping to create more bureaucracy, more slide decks, or more noise. They want impact. They want clarity. They want to move the ball forward.

And yet every week I see the same silent killer inside organizations: teams confuse a plan with a strategy.

The moment it all blurs

It usually starts in a meeting room (or more likely, a Zoom call). Someone asks: “What’s our strategy?”

What follows isn’t strategy it’s a Gantt chart. A backlog. A roadmap.

In other words: a plan.

On paper, the distinction looks obvious. Strategy is the “why” and “where to play.” A plan is the “how” and “when.” But in reality, the two get mixed all the time and the cost is high.

The pain of mistaking a plan for a strategy

If you’ve ever been in one of these situations, you know the feeling:

A “strategy deck” that’s really just a 40-slide list of initiatives.
Teams pivoting direction mid-quarter because the latest request feels urgent.
Deliverables shipped on time, but with no noticeable change in customer behavior or revenue.
Endless meetings where leaders debate tasks, not outcomes.

It’s exhausting. It’s confusing. And most importantly, it’s wasteful.

PwC found that 67% of managers confuse operational planning with strategy. That means two out of three managers are pushing their teams forward with what amounts to organized noise.

You’re not alone

This isn’t a rare mistake it’s systemic.

McKinsey data shows companies that tightly connect strategy and planning are 3x more likely to outperform competitors on revenue growth.
Bain found that organizations focusing on outcomes (customer retention, churn reduction, market share) have a 2.5x higher project success rate than those tracking outputs (deliverables, task completion).
Gartner predicts that by 2027, 70% of operational planning will be automated by AI. Which means the last real human edge will be in strategy: choosing why and where not how.

When you zoom out, the picture becomes clear: companies don’t fail because they lack plans. They fail because their plans aren’t anchored to a strategy.

Common traps I see every day

The same mistakes repeat across industries:

Turning vision into activity. A bold idea gets diluted into a laundry list of initiatives, none of which connect to the core purpose.
Adding detail without power. Teams think more detail = more control, but what they’re really doing is burying the few decisions that matter.
Planning in silos. Each department builds its own plan, with no view of dependencies or shared goals.

The end result? Momentum without meaning. Lots of movement, little impact.

What actually works

Here’s the shift I’ve seen in teams that get this right:

They ask “why” before “how.”
If the only answer to “why are we doing this?” is “because it’s in the roadmap,” you’re in trouble. Outcomes must drive activities, not the other way around.
They keep strategy short.
No 40-slide decks. Two pages max. If you can’t explain your strategy over coffee, no plan will save you.
They measure what matters.
Activity metrics are fine for daily management, but what moves the business are outcome metrics cycle times shrinking, margins improving, customer loyalty rising.

These teams don’t have less planning they have better anchored planning.

The checklist test

If you’re wondering whether you’re really working on a strategy or just a plan, here’s a quick test. Run your work against these five questions:

Clarity – Can you explain your strategy in two pages or less?
Focus – Have you clearly said what not to do?
Connection – Do your plans tie directly back to strategic outcomes?
Measurement – Are you tracking outcomes (impact, results) instead of just outputs (tasks, deliverables)?
Review – Do you revisit the strategy regularly without rewriting it from scratch every time?

If you can’t check most of these boxes, chances are you’re running on a plan not a strategy.

A practical move for tomorrow

Here’s something you can try right away: for your next project kickoff, ban the words “deliverable” and “timeline” from the first 20 minutes.

Instead, ask: “What will be different if this project succeeds?”

If the answer is still just “we’ll ship X feature by Y date,” go one level deeper. Push until you hit a customer behavior, a business metric, or a competitive position. Only then start talking about the plan.

Closing on a positive note

Here’s the good news: this isn’t rocket science. Most teams already have the raw ingredients for a real strategy they just bury it under activity. When you bring the focus back to outcomes, you unlock energy. Conversations shift from tasks to impact. People stop feeling like they’re chasing deadlines and start feeling like they’re building something that matters.

And that’s what makes work worth doing.

Let’s talk

Now I want to throw the question back to you:

In your daily work, do you see more plans with no strategy, or strategies that never make it into a plan?
What’s the hardest part of keeping those two connected in your organization?
And if you could change just one thing tomorrow to strengthen that connection, what would it be?

I’d love to hear your take drop a comment and let’s spark the kind of conversation that moves this from theory into practice.

RPA, AI, or Agents? The Messy Truth About Choosing Automation

Marco — Mon, 22 Sep 2025 06:08:36 GMT

It always starts the same way. A team is drowning in repetitive work, leaders decide it’s time to “automate,” and suddenly everyone has a different definition of what that means. Some picture rule-based bots cranking through invoices. Others imagine sleek AI systems predicting customer churn. A few are already whispering about autonomous agents that can run entire workflows on their own.

And then comes the paralysis. Which one do we actually need? What’s hype, what’s real, and what will quietly collapse after a year of maintenance headaches?

The Pain of Picking the Wrong Tool

Many organizations jump straight into automation projects with the wrong match of tool to problem. The fallout is costly.

Rigid RPA bots break when even a minor system update shifts a field in the interface. Teams spend more time fixing bots than they save.
AI pilots end up stalled because no one prepared the right data, and the models give outputs that look impressive but don’t actually help anyone make better decisions.
Autonomous agents promise freedom but in practice spin into loops or generate answers that no one can trust without heavy supervision.

The result: leaders who invested millions are left asking why the promised efficiency gains never materialized. Employees feel frustrated they wanted help, not another layer of complexity.

You’re Not Alone in This

If this sounds familiar, it’s because it is. According to Deloitte, around 70% of companies exploring automation now seek more flexible and autonomous solutions after realizing the limitations of rule-based approaches. Meanwhile, Accenture reports that 55% of developers using GitHub Copilot finished coding tasks faster, but many still relied on human review to catch mistakes.

Even at the cutting edge, early adopters admit reality doesn’t always match the pitch. In 2024, pilot projects with autonomous agents showed promise think virtual junior consultants drafting reports or IT agents fixing routine issues but they also revealed brittleness, hallucinations, and a clear need for human oversight.

So if your own projects feel messy or inconclusive, know this: it’s not a personal failure. It’s the stage the entire industry is in.

The Real Question: What Problem Are You Solving?

The temptation is to treat “automation” as a single bucket. In practice, each category traditional automation, AI, and autonomous agents solves very different problems.

Traditional automation (RPA, scripts): Think of it as the specialist for repetitive, structured, and high-volume tasks. It thrives on invoices, reconciliations, and copy-paste operations between systems. When stability and predictability matter, RPA is unbeatable. But once a process requires judgment or adaptation, it falls apart.
AI (machine learning, NLP, predictive models): This is the analyst, the pattern-recognizer, the one that makes sense of messy or unstructured data. It’s how law firms like Luminance cut contract review time by 60%, and how consultancies forecast market shifts faster. The caveat: AI is only as strong as the data behind it, and bias or poor governance can derail results.
Autonomous agents: These are still the apprentices, learning fast but not yet fully trusted. They shine in complex, multi-step workflows where flexibility is essential like diagnosing tech issues end-to-end or coordinating compliance checks across thousands of documents. But think of them like a promising intern: smart, fast, but still needing a mentor’s eye.

When you frame the choice this way, the fog clears. It’s less about which technology is “best” and more about which one fits the shape of your problem.

Why Mixing Matters

The future isn’t one or the other it’s all of them, working together. Some of the most impressive results already come from hybrid approaches.

A US health insurer tripled document processing by combining bots for routine intake with AI models for flagging ambiguous cases. A global credit card company saved $160 million by blending RPA for payables with AI for fraud detection. And in professional services, firms are layering chatbots for client FAQs on top of AI-powered analytics, with humans handling the final review.

It’s like building a team: you wouldn’t hire only accountants or only strategists. You need the right mix of skills for the job. Automation is no different.

A Practical Way Forward

Here’s where optimism meets pragmatism: the solution is rarely about chasing the newest tool. It’s about starting with a clear problem definition, then matching it to the simplest technology that solves it.

If your process is stable, repetitive, and structured start with RPA.
If the challenge lies in patterns, predictions, or messy data AI is your friend.
If the problem is complex, dynamic, and can’t be easily scripted experiment with agents, but keep humans in the loop.

This doesn’t just save money. It builds confidence across your team. People see quick wins from automation, trust grows, and the organization is more willing to experiment with advanced tools down the line.

Subscribe now

One Action You Can Take This Week

Instead of debating abstractly which automation to pursue, pick one process in your team that everyone complains about. Map it out on a whiteboard. Ask: is it structured and repetitive, data-driven and messy, or complex and dynamic? That single classification will point you toward whether it needs RPA, AI, or an agent experiment.

It’s not about solving everything at once. It’s about getting one match right and proving to your team that automation can work for them, not against them.

Let’s Talk

What’s the one process in your organization that drains the most energy?
Have you ever tried to automate it and if so, which approach did you choose?
Where did it succeed, and where did it break?

I’d love to hear your stories. Drop them in the comments not just the wins, but the failures too. Those are the ones that usually teach us the most.

Full Detailed Report on + Executive Summary at the end:

Comparing AI Deployment Models: On-Premises Hardware vs Cloud vs Turnkey Solutions

Artificial Intelligence can be deployed in several ways, each with its own benefits and adoption patterns. The three primary models are AI on physical hardware (on-premises), AI on the cloud, and turnkey AI solutions. Below we explore each model in depth, including recent global trends (focused on the past year) and industry usage statistics.

Should You Own or Rent Your AI? The Hidden Costs Behind Hardware, Cloud, and ChatGPT

Marco — Mon, 15 Sep 2025 06:08:32 GMT

Let’s start on a positive note: it’s an extraordinary time to be alive if you’re curious about AI. For the first time, powerful models that once required research labs and million-dollar budgets are now accessible to startups, mid-sized firms, even solo entrepreneurs. You can spin up a model in the cloud within minutes, or if you’re ambitious download one onto a spare GPU at home. The possibilities are wide open.

But with possibility comes friction. And that’s where the story begins.

The Problem Nobody Likes to Admit

We’ve all felt that pang of doubt when choosing between building in-house versus relying on a service. With AI, the stakes feel even higher.

Imagine this: your team wants to experiment with generative models. Marketing sees the chance to generate on-brand content instantly. Operations dreams of smarter forecasting. Legal, meanwhile, is pacing in the corner, whispering “GDPR” under their breath.

The enthusiasm is real but so is the paralysis. Do you spend six figures on GPUs, hire engineers, and pray your hardware doesn’t become obsolete in two years? Or do you rent elastic power from AWS and risk waking up one morning with a bill that looks like the GDP of a small town? Or do you just swipe your credit card for ChatGPT Plus and accept the trade-offs?

Each option feels like both an opportunity and a trap.

Pain Points in Sharp Relief

The more you dig, the more the anxiety builds.

Hardware lock-in: Buying servers feels bold until you realize NVIDIA has already launched the “next” card three months later.
Cloud sticker shock: Those $2/hour GPU rentals look harmless until you realize your team left three instances running over the weekend.
Data worries: You can’t shake the feeling that pushing client health records or banking logs into a third-party API is asking for tomorrow’s compliance nightmare.
Skill gaps: You may have brilliant analysts, but who in your company is actually capable of fine-tuning a 70B parameter model?

What makes it worse is that you’re not alone in this paralysis. A 2024 Deloitte survey found 53% of executives say AI adoption is slowed primarily by “infrastructure uncertainty.” Not algorithms. Not talent. Infrastructure.

That’s telling. The bottleneck isn’t whether AI works it’s where it should live.

Stories from the Field

The fintech startup drowning in cloud bills
FeroTrize a venture-backed fintech startup delivering real-time lending services at scale, built on a robust AWS-native architecture. With rapid market adoption, the company was processing thousands of transactions per minute. But as innovation surged, so did their AWS bill ballooning from $15,000 to $48,000 per month in just 18 months. As their CTO Marcus Chen admitted: “We were focused on rapid growth and feature development, but our AWS bills were growing even faster than our customer base.” Source
German hospitals resisting the cloud
Asklepios Kliniken, one of the country’s largest private hospital groups, rolled out AI company Aidoc’s aiOS™ platform across 25 hospitals, but deployed it in a way that kept control local to ensure data sovereignty. They spent €800,000 building the on-premise AI cluster for patient data analysis. Twelve months later, the GPUs were maxed out, and they had to rent additional cloud compute anyway. Source

Different choices, different consequences. Both path are messy, but it’s important to see that companies of all sizes are making these bets and adjusting along the way.

The Escape: Clarity on Trade-Offs

The good news? There isn’t a single “right” answer. There are just trade-offs and clarity about those trade-offs is liberating.

On-premise gives you sovereignty. You own the stack, the data never leaves your walls, and you can customize deeply. The trade-off: upfront cost and maintenance headaches.
Cloud gives you speed and scale. You rent world-class GPUs without tying up capital. The trade-off: recurring costs and dependence on a provider’s rules.
As-a-Service (ChatGPT, Claude, Gemini) gives you instant productivity. No DevOps, no capital expense. The trade-off: zero control over the model, and limited data privacy.

Seen clearly, the question stops being “what’s best?” and becomes “what’s best for me right now?”

A Framework That Actually Helps

When you strip away the noise, a practical lens emerges:

If you’re prototyping or solo → SaaS models (ChatGPT, Claude) make sense. Your risk is low, and your speed is high.
If you’re scaling a product where AI is the core → Cloud GPUs plus open-source models like LLaMA 3 or Mistral are your friend. You can scale elastically while keeping some customization.
If your data is mission-critical (healthcare, finance, defense) → On-premise is still the fortress. It’s expensive, yes, but sovereignty often is.

This framework doesn’t eliminate trade-offs, but it gives you a starting map. And a map is better than wandering blind.

Subscribe now

A Practical Note

If you’re standing at this crossroads right now, here’s one small but powerful step you can take this week:

Run a pilot project with a single use case, using two different infrastructures.

For example: try a customer-support summarizer on ChatGPT, and simultaneously test a fine-tuned open-source model on rented cloud GPUs. Track cost, latency, and output quality for just one week. You’ll learn more from that side-by-side than from any whitepaper.

That’s your low-risk way to feel the trade-offs in your bones not just in a slide deck.

I’ll leave you with this thought: we often overcomplicate “strategy” when a cheap, fast experiment can cut through the fog.

Now I’d love to hear from you:
👉 If you had to bet on one setup today hardware, cloud, or SaaS where would you place your chips, and why?
👉 What’s the single biggest fear holding you back from committing?

Drop your thoughts in the comments I want to surface the real stories, not just the glossy case studies.

Full Report on: AI, automation, and autonomous agents in professional services: when to use what

In professional services, technologies like traditional automation, artificial intelligence, and autonomous agents are reshaping operations. They promise higher efficiency, lower costs, and better quality, but each is suited to different scenarios. The real advantage comes from knowing when to use rule-based automation, when to rely on AI and machine learning, and when to experiment with autonomous agents. Many firms still win with automation on repetitive tasks, while a growing share is piloting more flexible, autonomous approaches to stay competitive. Below is a practical, research-informed overview of each category, with guidance on when to use it, common use cases, and real examples aligned to services operations.

AI isn’t the strategy. Leadership is.

Marco — Fri, 12 Sep 2025 06:08:20 GMT

Let’s be honest: AI in business is at risk of becoming the next “digital transformation” cliché. Everyone talks about it, budgets are being poured in, but when you look closely, most initiatives are still sitting in pilot mode or, worse, producing little more than shiny slide decks.

If you’ve felt the frustration of sinking time and resources into AI only to watch outcomes stall, you’re not alone. A recent Gartner study found that over 80% of AI projects never make it past experimentation. That’s not a technology failure it’s a leadership one.

And that’s the real problem: executives are buying tools, but not building the conditions for transformation.

The Pain Point Nobody Likes to Admit

Picture this. A boardroom presentation announces the company’s “AI journey.” The team proudly showcases a chatbot prototype, a few automations in the back office, and some dashboards that promise predictive insights.

Then, six months later, the results are flat. Sales hasn’t grown, costs haven’t gone down, and customer satisfaction hasn’t budged. Everyone involved feels deflated. Some even whisper that AI is overhyped.

The truth? The organization didn’t fail because of the AI. It failed because leadership treated AI as a tech decision not a strategic mandate.

This is the pain point: too many leaders are delegating AI to IT or innovation labs without anchoring it to business outcomes. It’s like installing solar panels on a collapsing roof the technology is fine, but the foundation isn’t there.

Proof That You’re Not Alone

This pattern repeats everywhere. McKinsey reports that while 65% of companies say they’ve adopted AI in some form, only 23% report measurable ROI. In other words, most firms are dabbling, but very few are capturing value.

Even among Fortune 500 companies, success is uneven. Some think JPMorgan or Walmart are achieving billions in efficiency and revenue gains. Others are spinning their wheels, unable to get past fragmented pilots.

The gap comes down to leadership, not technology. The winners have clear vision, strong foundations, empowered people, and disciplined execution. The laggards don’t.

Subscribe now

Four Leadership Mandates

The good news? The pattern is clear enough to learn from. Across industries, the companies breaking through with AI share four common leadership actions.

1. Lead with a Clear, Top-Down Vision

The number one reason AI projects die is lack of vision. Without a north star, efforts fragment into disconnected experiments.

Successful leaders define the “why” up front: Are we using AI to cut cycle times in half? To expand market share by 30%? To reduce customer churn by 15%?

And they articulate it relentlessly. When Satya Nadella reframed Microsoft as an “AI-first” company, it wasn’t a slogan. It became a mandate guiding everything from product to hiring to partnerships.

Your move: If you can’t explain in one sentence how AI ties directly to your company’s competitive advantage, you don’t have a strategy you have a hobby.

2. Build the Foundations: Data and Integration

Here’s the harsh reality: AI is only as good as the data it learns from. And most companies have messy, siloed, low-integrity data.

Deploying sophisticated AI on top of bad data doesn’t just fail it creates sophisticated failures.

That’s why leading firms invest heavily in centralized, clean, and integrated data ecosystems before scaling AI. They also avoid point-solution chaos by building platforms with open APIs, so systems can talk to each other.

Amazon didn’t dominate because it had the best algorithms it dominated because it had the most complete, unified data about customers.

Your move: Don’t just ask “What AI tool should we buy?” Ask “Do we have the foundation for AI to actually work?”

3. Invest in an AI-Ready Workforce and Culture

AI doesn’t replace people it reshapes their roles. But if your team sees AI as a threat, adoption dies before it begins.

The companies pulling ahead are making AI literacy a baseline skill, not a niche one. According to PwC, employees who feel AI-fluent report 96% higher productivity and job satisfaction.

Even more important, these organizations create psychological safety the freedom to reinvent processes without fear of failure. Middle managers aren’t told to “optimize” old workflows; they’re encouraged to design entirely new ones.

Your move: Train your workforce not just to use AI, but to think AI.

4. Execute in Phases, Tied to ROI

The final mandate is discipline. Too many leaders chase “big bang” transformations. They throw millions at AI and hope scale will equal success.

It rarely does. The leaders who succeed take a phased, ROI-driven approach.

They start with one high-impact pilot: “Increase win rate from 22% to 25% in 90 days.” When that works, they expand. Each phase validates the next.

It’s not glamorous, but it’s effective.

Your move: Anchor every AI initiative to a measurable outcome. If you can’t define the metric, don’t fund the project.

Why This Matters Now

The shift to AI-native organizations isn’t a distant trend it’s already reshaping markets. Firms that build these four leadership muscles will widen the gap fast.

Those that don’t? They’ll keep spending money on AI while watching competitors compound advantages. And because AI compounds exponentially, the window to catch up is shorter than most leaders realize.

A Practical Step to Take This Week

Here’s one actionable tip: pick one AI initiative in your organization and force it through a simple filter.

Ask:

Does it tie to a clear business outcome?
Is our data foundation strong enough to support it?
Do our people know how to use and extend it?
Can we measure ROI in the next 90 days?

If the answer is “no” to more than one, pause and fix the fundamentals before moving forward. You’ll save resources, but more importantly, you’ll build credibility when AI projects do succeed.

Your Turn

Now, I want to hear from you:

Which of these four mandates feels the hardest in your organization right now and why?
Have you seen a project die because leadership treated AI as tech instead of strategy?
Or, on the flip side, have you seen a leader set a clear mandate that unlocked real results?

Drop your thoughts in the comments I read every one. The best insights often come from readers like you, not the article itself.

And remember: AI isn’t the strategy. Leadership is.

The Death of Junior Hours: Why Consulting Needs an AI Engine Now

Marco — Wed, 10 Sep 2025 06:08:22 GMT

Most consulting firms I know are running on a fragile equation: fixed-fee contracts + armies of junior hours + inconsistent first drafts. It’s a formula that worked in the 2000s, when clients tolerated bloated decks and longer timelines. Today, it’s a slow bleed. Procurement is tougher, margins are thinner, and clients are asking why they should pay for “training wheels” time when generative AI can deliver baseline work in minutes.

And here’s the uncomfortable truth: in this new equation, junior hours are worth less than ever, but expectations for speed and quality are higher than ever.

Subscribe now

The hidden pain every firm is feeling

You know the symptoms. A partner signs a fixed-fee deal with a client who’s already benchmarked you against three other firms. The scope is ambitious, but your slide factory is overloaded. First drafts take weeks. Quality is inconsistent. Teams quietly pad hours just to hit deadlines.

Margins erode invisibly until the final tally: a project that looked profitable on paper ends up barely covering costs.

Meanwhile, junior analysts once the backbone of leverage are under fire. Clients push back: “Why am I paying for four associates to churn slides when I know AI can draft the same content?” Some procurement teams now refuse to pay for more than a token number of junior hours.

This leaves firms trapped: you can’t deliver without them, but you can’t justify their cost.

You’re not alone this is systemic

This isn’t just a boutique-firm issue. Look at what’s happening around the industry:

Klarna rolled out an AI assistant that now handles 66% of customer chats the equivalent of 700 full-time agents with higher satisfaction scores than humans.
GitHub Copilot: developers completed coding tasks 55% faster with AI, in controlled tests.
McKinsey’s 2024 AI report: 65% of companies already use GenAI regularly, with early adopters reporting measurable cost savings and productivity gains.

The message is clear: AI isn’t a gadget anymore it’s a production engine. If Klarna can run a customer service department at one-third the cost, why should a CFO keep paying for “junior slides”?

The wrong way to use AI in consulting

Most firms are still tinkering. A few analysts experiment with ChatGPT on the side. Someone builds a copilot to draft an RFP response. A practice lead tries an off-the-shelf summarizer. These are point solutions useful, but fragile.

The problem: AI framed as a team helper, not as a governed profit center. Without rules, metrics, or pricing models, AI remains a toy. Procurement notices. Clients notice. Eventually, so will your P&L.

A different path: the AI Engagement Engine

Imagine flipping the frame. Instead of “analysts using AI,” think of a digital operations layer inside your firm.

This layer:

Takes structured briefs and turns them into deliverables competitive teardowns, cost baselines, sensitivity analyses, storyboards.
Runs on A-SLAs (AI Service Level Agreements) that define latency, confidence thresholds, and escalation rules.
Produces transparent metrics: error rates, adherence to templates, factual grounding.
Escalates only the tough edge cases to human consultants.

The result: you replace the weakest link unscalable junior hours with a machine-assisted baseline that’s faster, cheaper, and defensible to clients.

This isn’t sci-fi. Firms can stand up a pilot in 60 days. Call it an AI Engagement Engine: a 24/7 department of micro-services with governance baked in.

A framework to make it real

I call it AMPP Automate, Manage, Prove, Price.

Automate: Identify 8–12 repeatable micro-services (RFP drafts, spend analysis, teardown slides). Define canonical inputs and “Definition of Done” outputs.
Manage: Build a RACI around who leads each task (AI-led, human-led, or dual). Set A-SLAs, guardrails for sensitive data, and a “kill switch” for rollback.
Prove: Run the engine against 50 historical cases. Measure factuality, adherence to templates, groundedness. Track correction rates.
Price: Shift from hourly billing to outcome pricing a floor fee plus a share of automation savings or compressed delivery time.

The pilot canvas can be small: one client, two micro-services, 200 indexed documents, a lead partner plus a QA reviewer. The goal? <30% time to first draft, <10% unnecessary escalations.

Why this matters now

Margins are being squeezed harder than ever. Procurement teams are benchmarking everything. EU regulation on AI is tightening. And the truth is brutal: the firms that turn AI into an operating engine today will dictate tomorrow’s prices and timelines.

Everyone else will be stuck defending hours that clients no longer believe in.

A practical tip you can use this week

Don’t start with a grand strategy. Pick one repeatable service that every associate in your practice already dreads say, drafting the baseline of an RFP.

Collect 50 past examples.
Feed them into a private RAG index with your firm’s style guides and templates.
Define one simple SLA: if model confidence <0.25, escalate to human review.
Time the difference between AI draft vs. manual draft.

That’s your first proof point. From there, you can scale.

Your turn

I’m curious:

Do you see your firm still clinging to the “junior hours” model?
If procurement slashed junior billing tomorrow, how exposed would your margins be?
And what’s the one micro-service you’d automate first if you had a governed AI engine running in the background?

Drop a comment, I’d love to hear how your teams are wrestling with this.

Archaeology Meets AI and the Ops Lesson on Clarity in Chaos

Marco — Mon, 08 Sep 2025 06:08:22 GMT

Let’s begin with something bright and alive: imagine hundreds of digital explorers mostly strangers scattered around the globe, huddled over satellite maps, chat-AI outputs, cluttered desktop screens, all united by a curious itch: what’s hiding beneath the Amazon canopy? That’s our starting point: a challenge that is at once technical, creative, and deeply human. It’s the kind of story that makes you lean in, not because you have to, but because you want to.

When Data Feels Like Dirt Under Your Fingernails

Here’s the real pain we often shrug off: wading through endless, complex datasets. Imagine being handed LIDAR scans, multispectral satellite images, temperature readings, colonial-era survey papers, and oral histories from Indigenous communities and told to surface something meaningful. It’s dense. It’s messy. It’s overwhelming.

You’ve felt that, right? Tap out "data fatigue" when sorting through CSVs. Or freeze mid-scroll because your visualization isn’t communicating anything more than your confusion. That moment where you think maybe this mountain of noise will never turn into insight. The OpenAI to Z community felt that, too. And they didn’t flinch.

We Are Not Alone: Thousands of Us Facing the Same Wilderness

Here’s what matters: you’re not the only one staring into the void. Over 8,000 Kagglers joined the challenge, each wrestling with the same impossibly rich, messy material, each hungering for a breakthrough in the noise. More than 200 submissions, each a unique lens on the same problem, emerged ranging from predictive maps of potential archaeological hotspots to entirely novel discovery methods born of curiosity and experimentation (Kaggle).

That’s not a static leaderboard it’s a forest of approaches, blossoming simultaneously. It’s hope in iteration. It’s proof that the moment when you think you’re alone in your overwhelm, dozens, hundreds or thousands are right there with you.

From Frustration to Flight: The Turning Point

Then there’s the winning team, “Black Bean,” who turned that load of raw data and noise into 67 candidate patches, each about a square mile, identified as promising sites for future archaeological exploration (National Geographic).

They didn’t just lean on brute-force machine learning. Instead they:

Fused public LiDAR datasets, Google Earth Engine satellite imagery, NASA elevation data.
Leveraged GPT-4o to recognize patterns in known sites and extrapolate to new, unexplored areas especially clustered near waterways, which makes intuitive sense for ancient civilizations (National Geographic).
Built something that’s reproducible so archaeologists can actually follow their steps.

That is elegance rising from chaos. It’s the oxygen of clarity breathing through overwhelm. And it’s organic rooted in reason and grounded in common sense, even as it leveraged bleeding-edge AI.

What This Means for You (And Me)

What I take away and what might help you is that the impossibly complex problems we face are also opportunities for collective ingenuity. You don’t have to single-handedly solve it. You can start by:

Embracing complexity instead of shrinking from it.
Mining adjacent disciplines or data sources you wouldn’t normally use.
Letting pattern-recognition tools do heavy lifting but never outsourcing your insight.
Designing for reproducibility, not just results.

The hackathon wasn’t about perfect accuracy. It was about creative portfolios multiple ways forward. That’s your template for experimentation in your daily challenges.

Subscribe now

A Tiny Action Tonight That Changes Everything

Here’s a small, immediate move you can make:

pick one project or problem that’s stalled because of data overwhelm.
Ask yourself: what is one unconventional data source or approach you’ve dismissed but could actually unlock new perspective?
Then spend just 20 minutes exploring it. Sketch a map of "what might this show me…?" Use a search engine, a ChatGPT query, a colleague’s suggestion you’re not committing, just poking possibilities.

That little nudge let curiosity win, not perfection is enough to spark clarity.

So here we are again, at an end that feels like a new beginning. You’re still overwhelmed, yes, but now you have company, inspiration, and a tiny experiment waiting.

What unconventional angle might be lying dormant in your work, if only you gave it 20 minutes of curiosity? I’d love to hear what sparks for you drop a comment and let’s ignite the conversation: what buried connection are you ready to bring to light?

Full Detailed Report on + Executive Summary at the end:

Embracing Complexity and AI in Operations for Professional Services

Introduction:
In today’s tech-driven landscape, operations teams especially in professional services face both unprecedented complexity and opportunity. A recent internal AI hackathon at a tech firm demonstrated how embracing new AI tools can streamline workflows and spark innovation. However, the hackathon is just a starting point. The real focus is how operations in tech and professional service organizations can strategically leverage AI over the next three years. This report examines four key principles for operational excellence in an AI-enabled era, combining strategic vision with practical action. Each principle is grounded in real cases and current practices, showing that these ideas are feasible and can lead to concrete improvements rather than being merely theoretical.

Why 80% of AI projects fail (and how to avoid it)

Marco — Fri, 05 Sep 2025 06:08:26 GMT

It usually starts with excitement. A new budget cycle, a fresh initiative, the CEO inspired by the latest headline about how “AI is transforming every industry.” A team gets the green light for an AI project, the consultants are called in, and the mood is optimistic.

But underneath the enthusiasm, there’s often something unspoken: the expectation that AI will act like a magic pill. A quick fix for old inefficiencies. A way to leapfrog years of messy operations and finally land in the future.

That’s the myth. And it’s a dangerous one.

When Expectations Collide with Reality

If you’ve ever been inside one of these projects, you know the story. The kickoff workshop feels full of possibility everyone talking about automation, intelligence, cost savings. But then comes the first real roadblock.

The data isn’t clean enough.
The workflows aren’t standardized.
Nobody can agree on the actual outcome the project should deliver.

What happens next is almost predictable. Meetings drag on. The vendor insists on more custom work. The “pilot” never fully launches. And in the end, the project gets shelved quietly, labeled as “interesting exploration” rather than the transformation it was supposed to be.

People rarely admit it, but the real frustration is this: AI wasn’t a cure for broken processes. It just made the weaknesses more visible.

You’re Not Alone: The Data Proves It

This isn’t just anecdotal. Gartner has found that 80% of AI projects never make it past the pilot phase. Forrester has reported that many initiatives stall because they lack clear business outcomes from the start.

And when you look closer, the failures don’t come from the algorithms themselves. They come from human factors: vague goals, unrealistic expectations, and the hope that technology can “cover” structural gaps in how work actually happens.

If you’ve felt the sting of an AI project that didn’t live up to the hype, you’re far from alone. In fact, you’re in the majority.

The Quick Fix vs. The Long Game

The trap is easy to see in hindsight. Companies treat AI like a patch a way to fix inefficiencies without addressing the underlying causes.

Need better customer service? Let’s put AI on top of the chaos.
Struggling with supply chain delays? Let’s add AI dashboards and hope they solve it.
Sales data all over the place? Maybe an AI assistant can make sense of it.

But AI isn’t duct tape. It doesn’t stick neatly over structural problems. It’s more like a magnifying glass: it amplifies what’s already there. If the foundation is shaky, the cracks widen.

The organizations that actually win with AI approach it differently. They don’t see it as a shortcut. They see it as a strategic investment that only works when paired with clear, measurable outcomes.

What the Winners Do Differently

Look at the companies that have made AI more than a buzzword. They start not with “let’s add AI,” but with “what outcome do we need to achieve?”

Reduce manual work so teams get back 10 hours a week.
Shorten customer onboarding time by 30%.
Increase forecast accuracy by 15%.

Once the outcomes are clear, AI becomes a tool to reach them not the other way around. And suddenly, the hype turns into something more sustainable: progress you can measure.

The best part? This shift in mindset doesn’t just make AI more effective it makes teams more aligned. People stop chasing shiny tools and start focusing on results that matter.

A Small, Practical Step for Right Now

Here’s something you can try immediately. Next time an AI proposal crosses your desk, ask one simple question:

“What’s the outcome we’re committing to, even if we didn’t use AI?”

If the answer is fuzzy, pause. That’s your signal the project is based on hope, not strategy. If the answer is clear, then AI has a real chance to add value because it’s tied to something concrete.

Let’s Open This Up

I’ll leave you with this: the myth of “AI will fix everything” isn’t going away. The headlines will keep coming, the pressure to “do something with AI” will stay. But the real opportunity belongs to the people who resist the myth, cut through the noise, and focus on outcomes first.

Now I want to hear from you:

Have you ever been in an AI project that fell flat what went wrong?
Do you think most organizations are chasing AI as a patch, or investing in it as a strategy?
What’s the clearest, most outcome-driven AI initiative you’ve seen?

Drop your thoughts in the comments I’m curious to hear your stories and perspectives.

Professional Services as We Know It Is Dead - Here’s What Comes Next

Marco — Wed, 03 Sep 2025 06:08:32 GMT

The Dawn of a Brighter Professional Services Era

I hope this finds you curious and cautiously optimistic. Let’s talk about a shift that feels both unsettling and exciting: what happens when AI starts doing nearly half of what you'd once think was purely human work?

The Hidden Load You Carry

Imagine your team at the start of the week: juggling scoping documents, detailed delivery spreadsheets, project kickoff decks, status updates, invoice checks, repeatable configurations, quality reviews all the small tasks that devour time and life. You feel like you’re firefighting, not strategizing. Maybe you catch yourself muttering, “If I could spend more time thinking, not just doing…”.

That’s the real pain. When nearly 40–50% of operational, repeatable tasks especially language-based and documentation-heavy work are swallowed by AI or ready to be, it doesn’t just shift who does the work. It rattles your sense of purpose in the professional services world.

You're Far From Alone

You’re not the only one staring at this transformation. Accenture research estimates that around 40% of working hours especially in tasks that rely on language can be impacted by LLMs, with about 65% of those hours ripe for automation or augmentation (Accenture).

And this isn’t some distant, hypothetical future. Accenture reports that 74% of organizations investing in generative AI and automation are meeting or exceeding expectations and these “reinvention‑ready” firms are already achieving 2.4× productivity improvements and 2.5× revenue growth (Accenture Newsroom).

TSIA points at the move toward AI‑powered, subscription‑oriented, outcome‑centric models what they call PS 2.0 as the way forward. It’s not just about automation, but reshaping delivery models to continuously drive measurable outcomes (TSIA).

When Systems Do the Routine, You Can Lead

The light at the end of the tunnel: AI doesn’t erase need for human insight it frees us to focus on strategy, creativity, empathy, interpretation. It demands new roles, new expertise, and new business models.

Accenture’s “reinvention services” is a case in point they’re blending strategy, consulting, tech, operations into one AI‑powered unit. One client example? An AI‑enabled ship that predicts its own maintenance, manages energy, and communicates with ports (Business Insider). That’s not replacing people; it’s redefining how people and systems collaborate.

And this isn’t theoretical. The World Economic Forum notes that roles like AI/ML specialists, data analysts and transformation strategists are growing rapidly, even as more clerical or secretarial duties recede (World Economic Forum).

What’s the Path Forward In Practice

Here’s where we zoom in. Skipping flashy promises for grounded, purposeful strategy:

Redesign your delivery model: Shift from “billable hours” to “PS‑as‑a‑service,” where outcomes are the unit of value (TSIA).
Conceive new roles: Intelligence‑curators, AI‑workflow architects, client outcome designers roles that blend domain knowledge with AI‑fluency.
Build your AI governance and strategy stack: Align senior leadership, roadmap, KPIs, training, and feedback loops. McKinsey finds this kind of setup correlates strongly with actual bottom‑line impact (McKinsey & Company).
Embed learning and incentives: Fed by Accenture and McKinsey findings, employees are eager to use AI but they need training, seamless integration, and encouragement, not just mandates (McKinsey & Company, Accenture).

This isn’t about reacting to AI’s arrival. It’s a deliberate invitation to reinvent how we deliver value in professional services.

Action Happens Now A Practical Kick-Start

Here’s a concrete move you can take today: run a 90-minute “AI-task mapping session” with your team. Grab a whiteboard (virtual or physical). Ask: Which tasks do we repeat most? Which bleed us dry? Then, categorize them:

Automatable
Augmentable
High-value, human-only work

Within an hour and a half, you’ll not only illuminate where AI could rescue bandwidth, but also discover where your people’s strategic thinking should go next. That simple session often unlocks the jump to PS 2.0.

You’re not just surviving this transition you can steer it.

I’d love to hear from you: In your work today, where do you already feel the drag of repetitive tasks? What would you do if 50% of that workload were suddenly handled for you? Share your examples or wildest dreams in the comments. Let’s unpack what “reinvention” could look like, together.

Here’s to building smarter, more human-centered professional services.