HITL&ER – A Theoretical Framework for the Decline of Human Oversight in AI-Generated Code

HITL&ER – A Theoretical Framework for the Decline of Human Oversight in AI-Generated Code

The Slow, Inevitable Death of “Someone Needs to Check the AI’s Homework”

Look, the whole “human in the loop” thing in AI-generated code? It’s dying a gore, horrific death… only not dramatically, not overnight — but measurably, and with increasing speed, driven by benchmark data that’s honestly kind of alarming, real-world deployment numbers, and the simple fact that developers are just… trusting the machine more and more… it’s like watching your child learning to ride a bicycle, and taking the training-wheels off… There will be injuries; some milder than death.

This paper pulls together (like the flesh of bicycle related injuries when applying sutures) what we currently know to sketch out a four-phase model of oversight decline, figure out what’s actually slowing it down, and — perhaps most importantly — identify the irreducible chunk of governance that no amount of AI capability is going to make go away before 2040.

The core argument: code review as a form of HITL is basically heading toward extinction in everyday software domains by the early 2030s. But don’t pop the champagne yet, because execution review — runtime-level oversight — is quietly ascending to fill the void.

The legal, ethical, and security-mandated human-in-the-loop? That one’s not going anywhere. Ever, probably.


1. Introduction: Yes, This Is Actually Happening

The whole concept of “human in the loop” has served two purposes — partly as quality control, and partly as an honest-to-God admission that AI wasn’t (and probably never will be) good enough to be trusted on its own. Which, for most of the 2010s, was a pretty fair assessment. AI code generation was basically a novelty. A neat magic-trick. Something you’d demo at a conference and then quietly put it back in the box when you get back home, and not use in production because you know the mechanics of then magic trick; other than the dopes you duped at the conference.

By 2026, though? Roughly 41% of all code produced globally is AI-generated, with 84% of developers reporting daily use of AI coding tools. Yes, it’s laziness primarily because we’re human… and therefore not a novelty anymore. That’s a structural shift, and it demands we actually sit down and ask the terribly uncomfortable question like at what point does the HITL requirement stop being a practical necessity and start being just… bureaucratic theatre? (so the coders can become even lazier)

And this isn’t an abstract philosophical question like that about the existence of the Flying Spaghetti Monster, by the way. It has very concrete implications — for software engineers who’d perhaps like to know if their job will still exist in a decade (spoiler: nope), for shysters trying to figure out who to sue when AI-generated code burns down a hospital’s records system including the hospital actually built around it, for cybersecurity people who are already exhausted, and for the regulatory bodies across the OECD that are scrambling to draft governance frameworks fast enough to actually matter, which they won’t when their quarterly pentest just called them on the golf-course to let them know the breach already happened.

This paper isn’t here to tell you that human oversight is useless. It’s here to argue that what oversight looks like is going to transform in ways that most people aren’t prepared for and that there will come a time in the not so distant future when we’ll have to let go and let our “little ones” move on when they’re all grow’d up.

One distinction has to be nailed down right at the start, because conflating these two things leads to complete nonsense conclusions. “Human in the loop” covers two structurally different activities:

  • Code review — a human looks at AI-generated code before it runs and decides whether it’s acceptable
  • Execution review — a human (or system) monitors, traces, and audits what an AI actually does at runtime

These are not the same thing. Their trajectories are, in fact, going in opposite directions. As AI gets better at writing correct code, code review becomes less necessary. But as AI agents get more autonomous and their behaviour more complex, execution review becomes more necessary, not less. Conflating the two leads to the (wrong) conclusion that HITL is disappearing wholesale. It isn’t. One form of it is receding; another is growing to replace it.


2. The Numbers, Since We’re Apparently Doing This

2.1 Benchmarks: The Steep and Mildly Terrifying Trajectory

The SWE-bench Verified leaderboard is probably the most useful public lighthouse we have for measuring genuine autonomous coding capability. It tests whether or not AI systems can resolve real-world GitHub issues — multi-file reasoning, context understanding, test validation — under standardised conditions. Not theoretical toy problems. Real ones.

The numbers went from 4.4% in early 2023 to 71.7% by end of 2024. That’s a roughly 16-fold improvement in under two years. By mid-2025, the Refact.ai Agent hit 74.4% under strict pass@1 conditions. The Epoch AI analysis of 2025 forecasting accuracy found that AI progress exceeded predictions in most areas — with SWE-bench being one of the few where performance slightly underperformed the most aggressive forecasts. A rare plateau signal, or perhaps just a speed bump. We’ll come back to that.

2.2 Anthropic’s Real-World Data (February 2026) — And It’s Kind of Wild

The most directly relevant dataset here is Anthropic’s February 2026 study “Measuring AI Agent Autonomy in Practice”, which dug through millions of real Claude Code sessions. The headline findings:

  • The 99.9th percentile of autonomous session duration nearly doubled between October 2025 and January 2026 — from under 25 minutes to over 45 minutes
  • This growth was gradual and model-agnostic… in layman’s terms it wasn’t just one new model release causing a jump — it’s trust accumulation, deployment maturation, people getting comfortable (so it might be due to a gradual shift in human mindset, too)
  • Human interventions per session dropped from 5.4 to 3.3 over the same period, while task success rates on the hardest problems doubled — so less oversight produced better results. (see where this is going?) Which is, depending on your perspective, either reassuring or deeply unsettling… take your pick.
  • Among new Claude Code users, roughly 20% of sessions use full auto-approve mode; among users with 750+ sessions, that’s over 40%

Anthropic’s own interpretation is worth noting. They argue that “autonomy co-constructed by the model, the user, and the product” is more appropriate than fixed HITL mandates, and that the AI’s own uncertainty detection — pausing to ask clarifying questions — already outperforms mandated human checkpoints at the same risk thresholds. Critically, the AI asks clarifying questions more than twice as often as humans intervene on the most complex tasks. Which somewhat inverts the entire conventional assumption about who initiates oversight.

2.3 Where Agentic AI Is Actually Being Used

Software engineering currently accounts for approximately 50% of all agentic tool-use actions in production APIs — making it the leading domain by a significant margin. About 80% of current API actions are subject to protective measures like permission restrictions and human approval, while irreversible actions constitute only 0.8% of all actions. That sounds reassuring until you notice that a growing subset of frontier deployments involves financial transactions, medical record updates, and security privilege escalation — domains where that 0.8% irreversibility carries consequences bat-shit disproportionate to its volume.

2.4 The Rise of Execution Review (The Thing Everyone’s Ignoring)

As pre-execution code review retreats, something structurally different has emerged to partially replace it: execution review. Execution tracing exposes the sequence of reasoning steps an agent followed, the tools or functions it invoked, and the inputs and outputs at each stage of execution, enabling root-cause analysis that static code inspection simply cannot provide. This matters because agentic AI systems are nondeterministic — the same codebase can produce wildly different execution paths depending on context, live data, and model state. As one framing captures it, “the ‘why’ lives inside the trace — not the codebase”.

This has real compliance implications. In traditional software, you keep execution logs for 30–90 days and can reconstruct decisions from version-controlled source logic. That model collapses with agentic AI for very obvious reasons. When deterministic business logic is intertwined with nondeterministic agent behaviour, execution traces can’t just be short-term operational logs anymore — they have to become long-term accountability artefacts. The evidentiary basis for answering “why was this loan denied? why was this clinical recommendation made? why was this vulnerability introduced or did it already exist when the code was written?”

The current state of affairs is, frankly, a bit of a charly-foxtrot. As of early 2026, only 47.1% of organisations’ AI agents are actively monitored or secured — meaning more than half of all deployed agents are running without execution-level visibility. Emerging frameworks like the MI9 runtime governance architecture are attempting to close this gap by integrating telemetry capture, authorisation monitoring, conformance checking, drift detection, and containment execution within a unified framework — but as of this writing, nothing has achieved widespread production adoption and why the hell would it have since we’re talking about an emerging landscape.


3. The Four-Phase Model (A Timeline No One Will Stick To, But Here We Are)

Drawing from all of the above and observable industry trends, here’s a theorised four-phase model. Each phase describes the concurrent state of both code review and execution review, because the whole point is that these two aren’t the same thing and their trajectories diverge significantly.

Phase I — The Co-pilot Era (2022–2027)

In this phase, the human is the primary agent. AI is a subordinate tool. The workflow is unambiguous: human formulates intent → AI generates candidate code → human reviews and merges. AI agents in software development as of early 2026 handle boilerplate, test generation, and refactoring autonomously, but strategic decisions, architecture, and production deployment remain firmly in human hands.

Code review: Universal. Culturally uncontested. Nobody ships AI-generated code without looking at it first.

Execution review: Nascent. Trace logs exist for debugging purposes but aren’t treated as governance artefacts. The oversight burden is front-loaded onto code review because the execution monitoring infrastructure isn’t there yet.

Phase II — Supervised Autonomy (2027–2030)

As SWE-bench-style performance approaches 90%+ and multi-file reasoning matures, AI systems will handle complete features, isolated bug fixes, and dependency management end-to-end without per-step human review. Human oversight retreats to a checkpoint model: architectural reviews, security audits, production gate approvals. Former Google CEO Eric Schmidt predicted in April 2025 that “the vast majority of programmers will be replaced by AI programmers” within a year — which is probably overstating the speed, but the direction is roughly right (as Google Maps is) for this phase.

Code review: Selective rather than universal — reserved for security-sensitive modules, architectural boundaries, and production deployments.

Execution review: Formalised. Organisations start retaining execution traces as compliance records. Audit trail engines capturing reasoning paths, tool calls, escalations, and handoffs become standard architectural components.

Phase III — Exception-Based Oversight (2030–2035)

Human involvement shifts from reviewing code outputs to reviewing decision boundaries — the policies that define what the AI is permitted to execute without authorisation. Gartner projects that 80% of software engineers will need reskilling for non-coding roles by 2040 (see? They won’t be out of a job after all), implying that engineering labour in this phase migrates from code production to AI governance, systems design, and exception handling. This is either a terrifying or exciting prediction, depending on who you’re sitting… I, for one, believe they’ll be able to be retrained.

Code review: Exists only at system architecture level. The pull request as a governance mechanism is largely obsolete for commodity domains.

Execution review: This is the primary form of HITL now. Humans review behavioural drift reports, anomaly alerts, and execution summaries. Continuous monitoring with real-time alerting and intervention capabilities when behaviour deviates from expectations replaces the pre-merge pull request as the dominant oversight mechanism.

Phase IV — Governance-Mandated Residual (2035+)

Technical necessity for human code review in commodity software largely dissolves. The residual HITL persists not because AI can’t do the job reliably, but because no legal framework assigns liability to a non-human agent. One February 2026 analysis captures this neatly: “AI writes the code; humans own the system” — a principle that will harden into regulation rather than dissipate.

Code review: Legally mandated only in high-stakes domains — critical infrastructure, medical, defence, finance. Absent everywhere else.

Execution review: Legally mandated as a universal audit trail requirement, even in domains where no human intervention occurs at the code generation level. Think of it as the technical equivalent of a flight data recorder — mandatory not because the pilot is incompetent, but because accountability requires evidence.


4. What’s Actually Slowing This Down

4.1 Benchmarks Hit a Wall at the Edges

SWE-bench progress is not linear at the margins. Epoch AI’s January 2026 forecast accuracy analysis identifies the final ~20% of hard, novel, multi-system problems as a genuine capability frontier — one where performance improvements are diminishing relative to compute investment. These “last-mile” problems are disproportionately the most important ones in production: security-critical components, distributed systems logic, race conditions, cross-cutting architectural concerns. An AI that solves 80% of benchmark tasks is not, it turns out, 80% ready for zero-oversight production deployment. Perhaps obvious in retrospect, but worth stating plainly and an argument I am currently making in an analysis about a market-consolidation if/when the “AI-bubble” bursts out of which newer, more performant models will be born (think DeepSeek)

4.2 The Attack Surface Is Getting Bigger, Not Smaller

Autonomous coding agents introduce novel vulnerability classes that flat-out don’t exist in human-authored code workflows, which is making insurance companies shit the bed as we speak. Prompt injection appears in over 73% of production AI deployments, and just five malicious documents can manipulate an AI agent 90% of the time via RAG poisoning. The CVE-2025-53773 remote code execution vulnerability in GitHub Copilot, assigned a CVSS score of 9.6, exemplifies precisely how AI coding agents operating with fewer human checkpoints can be weaponised through their own tooling interfaces. Reducing HITL without resolving these vulnerabilities doesn’t produce efficiency. It produces an expanded blast radius if you’re pessimistic yet we can also turn this stake around and – at the same time – make an argument that OpenClaw has taught us about secure communication channels and reduction of other attack surfaces… so delay that heart attack just a bit longer.

4.3 OWASP Has Formally Called This Out

The OWASP Top 10 for LLM Applications (2025) formally codifies Excessive Agency (LLM06:2025) as a primary risk category: the condition where an LLM is granted too much autonomy, permissions, or functionality, leading to unintended actions beyond their intended scope (which don’t all have to be bad). OWASP’s recommended mitigations include applying the principle of least privilege, requiring human-in-the-loop oversight for high-risk actions, and implementing multi-step verification for automated high-impact decisions. The fact that a leading open-source security framework has codified HITL as a formal security control — not merely a quality heuristic — signals that its decline is not purely a technical question, but a thing we have to work on continuously… no, I’m not saying we’re going to build the plane while it’s flying yet improvise along the way; but VERY fast!

4.4 Nobody Knows Who’s Legally Responsible (This is TOTALLY FUN!)

No jurisdiction currently assigns actionable legal liability to an autonomous AI system for defective software output (beside the EU scrambling like blind men speaking of colour). Until a legal infrastructure for AI-generated code accountability is established — through product liability reform, software assurance legislation, or AI-specific tort frameworks — an identifiable human must remain in the loop to serve as the party who gets sued. Anthropic itself noted in its February 2026 study that policymakers should focus on “whether humans are in a position to intervene effectively” rather than mandating specific interaction patterns — an implicit acknowledgement that the HITL requirement is becoming a governance construct rather than a purely technical one… hey, someone has got to take one for the team and pick up that bar of soap, right? RIGHT? Well, not really because wearing THIS hat has going to cause significant headaches because the speed at which AI generates code far supersedes human capacity, which in turn will leave only those few with very little bad luck in thinking things through to the end available for such high-stakes positions… and boom, there’s the next speed bump.

4.5 You Can’t Remove Code Review Until Execution Review Is Actually Ready

This dependency is underappreciated and it’s kind of a really big fucking deal. Removing pre-execution human oversight before robust execution monitoring is in place doesn’t reduce governance burden — it displaces it invisibly. The top code execution risks in agentic AI systems in 2026 include privilege escalation through tool chaining, exfiltration via autonomous API calls, and undetected data poisoning mid-task — none of which are visible in static code review. Moxo’s February 2026 agentic observability analysis states the dependency directly: “Orchestration fails when humans are removed. It works when they’re supported” — supported, crucially, by trace logs that correlate agent actions with human intent across the full workflow.

This creates a hard sequencing constraint: Phase II and Phase III transitions cannot proceed safely until execution tracing infrastructure is standardised, legally recognised, and operationally integrated. Organisations that reduce code review oversight ahead of that infrastructure maturity aren’t accelerating the transition — they’re operating in an ungoverned gap that expands their legal liability rather than reducing it.


5. Discussion: The Bit That’s Not Going Away

The four-phase model converges on a conclusion that is probably less dramatic than the AI discourse usually wants: HITL won’t disappear. It will metamorphose. The trajectory from universal code review to exception-based oversight to governance-mandated accountability mirrors how other engineering disciplines matured. Aircraft are overwhelmingly flown by autopilot systems that technically require no human input for most of the flight (keep in mind how long that took, and what assistive tech was necessary). And yet, two licensed pilots are legally required in the cockpit of every commercial aircraft. Not because the autopilot can’t do it. But because when it fails, society needs a human face on the failure (which in the aviation industry is kinda ironic because they pilots die, too).

Execution review deepens the analogy quite a bit. A modern commercial aircraft generates continuous flight data and cockpit voice recordings — the execution trace — that persist as mandatory legal artefacts regardless of how little the pilots actually did. The FAA doesn’t mandate those because it thinks pilots are incompetent. It mandates them because accountability requires evidence (even if they also died in the crash). AI-generated code in production is heading toward the same institutional conclusion: the execution trace, not the code diff, will become the primary accountability document.

For high-stakes software domains — think critical infrastructure, medical device firmware, nuclear facility management, weapons systems, large-scale financial settlement — a governance-mandated HITL will persist well past 2040, as projections on AI capability timelines from bodies like Oak Ridge National Laboratory suggest, not because AI lacks the capability but because democratic accountability structures require a human face (preferably alive, other than in the civilian aviation industry) on consequential decisions. That’s not a technical argument. It’s a political and ethical one, and those tend to be more durable.


6. Conclusion: HITL Isn’t Dying, It’s Moulting to HITL&ER (and eventually disappearing altogether?)

The “human in the loop” requirement in AI-generated code is not approaching a binary off-switch. It’s undergoing a functional decomposition: the code review component will become obsolete for most software domains by roughly 2030–2032, while the execution review component simultaneously matures into mandatory governance infrastructure. Both converge, eventually, on a residual accountability layer — legally and ethically mandated — that will persist indefinitely in high-stakes domains.

The empirical evidence — SWE-bench’s 16x improvement in two years, Anthropic’s data showing autonomous session durations doubling in three months, OWASP formally codifying excessive AI agency as a primary security risk, and only 47.1% of deployed AI agents currently being monitored — collectively defines a transition point somewhere between 2028 and 2032. After that, HITL in software engineering shifts from an operational default to a deliberate policy choice (for better or for worse depending in which jurisdiction you live in).

The organisations that navigate this safely will be those that treat the decline of code review and the rise of execution review not as sequential events but as a single managed substitution. What remains after that point isn’t a legacy of limitation. It’s a feature of accountable governance. A distinction worth keeping hold of, probably.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.