Most of Your AI Work Is About to Need Redoing

Every workflow you ship is built against a model. When the model gets shut off, the workflow needs rewriting.

By The Benchmark 7 min read
Most of Your AI Work Is About to Need Redoing

OpenAI told you the work doesn't carry over

On October 23, 2026 — a Friday five months from now — four OpenAI models go dark on the same day. That is not a rumor. It is the date on OpenAI's own published deprecations page, set against four model versions a great many enterprise workflows are running on right now. Every prompt your team wrote against any of them needs to be found, rewritten against the replacement, and re-tested before the lights go out.

OpenAI is not subtle about which part of that is your job. Its own prompt-guidance page tells customers, in writing, to "avoid carrying over every instruction from an older prompt stack." The reason it gives: "Legacy prompts often over-specify the process because earlier models needed more help staying on track. With GPT-5.5, that can add noise, narrow the model's search space, or lead to overly mechanical answers."

That is the vendor telling you the work does not move. The four retiring identifiers — gpt-4-0613, gpt-4o-2024-05-13, gpt-4.1-nano-2025-04-14, and o4-mini-2025-04-16 — are listed on the deprecations page with one shared shutdown date. Every prompt running against any of them needs to be reworked against the replacement, because the replacement does not read the way the old model does.

OpenAI knows it. The same docs ship an automated migration command — $openai-docs migrate this project to gpt-5.5 — because the vendor itself acknowledges customers have rework on their hands. Tooling that exists is tooling somebody decided was necessary.

Five months out, here is the question. Does your team know which workflows run on the retiring models, who owns the rewrite, and whether anyone has started? If the answer is "we'll check," October 23 is closer than it looks.

The clock you didn't set

Microsoft publishes the math in its own documentation. General-availability models in Azure Foundry come with a retirement date set programmatically at 18 months from launch. At the 12-month mark, the model moves to a deprecated stage — existing customers can keep using it, but new customers are blocked.

Preview models move faster. You get at least 30 days' notice before you're force-upgraded to a newer preview, pushed to general availability, or simply cut off. There is no option to stay on the preview model you built against.

That is not a rumor. That is the vendor telling you, in writing, when your work stops working.

Now look at how fast you're building against that clock. Gartner projected in August 2025 that 40% of enterprise apps would have task-specific AI agents by 2026, up from less than 5% in 2025. That is roughly an eightfold jump in the AI surface area inside companies, in twelve months. Each of those integrations is a prompt built against a specific model version. Each one carries an expiration date your company probably hasn't written down.

Call it prompt debt: the rework bill that arrives whenever the vendor decides your model is done. Nobody on the buying side priced the rework in. It compounds because every new AI workflow your team ships adds another future obligation to a list nobody's keeping.

Here's the part nobody's pricing in. The companies that deployed hardest in 2024 and 2025 are now sitting on a retirement wave landing in 2026 and 2027. Whether they know it depends entirely on whether anyone kept a list.

Your prompts don't move with the model

Most teams assume the prompt is a one-time cost. You build it, you test it, it works, you move on. The model upgrade gets framed as the vendor's problem.

It isn't. MIT Sloan reported on a controlled experiment with about 1,900 participants. When users were given a more advanced model, half the performance gain came not from the model — it came from the prompts the users rewrote to take advantage of it. The prompts that worked on the prior model had to change.

That is not a calibration nudge. That is the workflow your team is running every day. A prompt that pulls clauses out of a vendor contract today is reading patterns specific to one model's reasoning. The replacement reasons differently. The output drifts. Microsoft's own retirement guidance tells you what to do about it, in the only sentence on the page that gets quoted verbatim: "Apply prompt engineering and fine-tuning as needed to match prior accuracy."

That work doesn't happen on its own. Somebody has to know what the prompt was supposed to do. Somebody has to know what correct output looks like. Somebody has to test it. None of that is on the vendor's plate.

For one workflow, this is a manageable hour or two. For an organization that has shipped dozens of these across legal, HR, procurement, and customer service over eighteen months, it is a queue nobody staffed and no budget line covers.

Whether your team wrote down how the work actually got done before AI ran it is the single biggest variable in whether any of this survives. Stanford's Digital Economy Lab studied 51 enterprise AI deployments and found exactly that — the difference between success and failure wasn't model quality. It was whether anyone had actually mapped the workflow first. If your team documented the process, the prompts survive a model swap. If your team papered over the messy parts, the model swap finds every one of them.

Only about a quarter of companies are getting real value out of AI. The ones who are spent most of their effort on people and process, not tooling. BCG surveyed roughly 1,000 senior executives in 2024 and found 74% had nothing to show. The winners ran it 70/30 — process redesign first, tooling second. The companies running it the other way are the same ones piling up prompt debt the fastest, and they've got nothing under it to absorb the bill.

Klarna already showed you what that looks like operationally. The company replaced work equivalent to around 700 customer service agents with an AI assistant, then publicly reversed course in 2025 and started hiring humans back.

Everyone read that as a customer service problem. It is also a prompt-debt problem. The cost of keeping the AI workflows reliable exceeded the cost of the people. The headline got the rehire. It missed the rework.

Auto-upgrade doesn't get you off the hook

Microsoft does auto-upgrade some deployments. Standard, Data Zone Standard, and Global Standard tiers get the model swapped underneath your workflow and you don't lift a finger. If the swap is automatic, where's the debt?

Two places. First, auto-upgrade moves the model. It does not check the output. Microsoft's own page tells you the customer is on the hook for prompt rework. The infrastructure is invisible; the validation isn't. In a compliance workflow, "invisible swap, undegraded output" is the assumption you do not get to make.

Second, the carve-out doesn't cover everyone. Provisioned deployments — the ones companies use when they need guaranteed throughput — are not auto-upgraded at all. The migration is manual. If your team is running anything serious enough to provision dedicated capacity for, you are running the upgrade by hand.

The objection narrows the surface of prompt debt. It does not eliminate it. The auto-upgrade gets you a free swap. The MIT Sloan finding tells you the prompts on top still need work. Microsoft's own guidance tells you it's your work to do.

What the list will tell you

RAND's 2024 study of AI project failures found that AI projects fail at roughly twice the rate of non-AI IT projects, north of 80%. The primary failure mode wasn't the model. It was workflow integration — undocumented exceptions, unclear handoffs, no clear owner for the edge cases. The same failure mode comes back at retirement. The migration becomes a second integration project, against a workflow nobody's owned for fourteen months, run by a team that may not be the team that built it.

Here is what the October 23 shutdown is about to test. Not the model. The list. The companies that know which workflows run on the four retiring models, who owns each one, and what "correct output" looks like before the swap have a five-month rework project on their hands. The companies that don't are going to find out one drifted output at a time, starting October 24. Software teams have version control and dependency alerts for exactly this reason. Prompt inventories live in shared drives, in team wikis, in the head of the person who built the workflow — until that person changes jobs.

You don't need a transformation program to fix this. You need three things on a spreadsheet, before your next AI deployment ships: the model version each workflow runs on, the date that model retires, and the name of the person responsible for the rework when it does. The control belongs to the function head. Nobody upstream has to sign off.

The companies that survive the retirement wave will be the ones that kept the list.

The ones that didn't will discover their prompt debt the way Klarna discovered the limits of its automation — operationally, after the bill has already arrived.

You didn't buy a tool. You signed up for a maintenance schedule the vendor controls. The clock is already running. The only question is whether anyone on your team is reading it.

Sources

Subscribe

The writing senior operators send each other.

The Benchmark is a subscription publication. New writing arrives in your inbox each week. The full archive — essays, briefing notes, and reviews — is for subscribers.

No takes. No frameworks borrowed from a deck. The work is the synthesis — what's actually causing the thing you're already watching, and which of the moves on the table hold up under pressure.

01

For directors, VPs, chiefs, founders.

Written for the operator in the seam — context-rich, running execution, expected to influence strategy without owning it. You already see the systems. The publication gives you the language and the evidence to name them in the room.

02

Diagnoses, not prescriptions.

Each piece works out what's causing what you're already watching, and lays out the realistic moves. We are not in the business of telling you how to run your function. You make the call.

03

Research, not filler.

No takes. No seven-tweet summaries. No frameworks lifted from someone else's deck. Every piece is built on primary sources and named so the argument holds up the first time you read it.