Agent-First Systems and the Future of Software

On harnesses, verifiability, and why human-in-the-loop is not the answer for safe AI agents

/ January 2, 2026 / 12 min read / 2274 words

Over the holidays, I built an AI Assistant using Cloudflare Durable Objects. The source code is available meleksomai/os and you can try it by sending an email to hello@somai.me. Building on that experience, this essay explores how agent-first systems are not merely a new feature layer atop existing tools, but a complete shift in the way we design and interact with systems.

Over the past few months, and more precisely following the release of Claude Opus 4.5 and Gemini 3, I have become increasingly convinced that the future of software lies in building agent-first systems.1 By these systems, I mean software in which the task to be performed by an end-user is further removed from the end-user. As current AI agents empowered by LLMs are increasingly capable of long-context reasoning, tool use, and multi-turn interaction, the tasks they can handle are becoming progressively (and then suddenly) the property of the agent. Agents manage the intent interpretation, memory, decision-making, action execution, and ultimately remove further the user from the task itself. The shift could be thought of as analogous to the way modern high-level programming languages such as Python, JavaScript, and others abstract away garbage collection, memory allocation, etc. However, the similarity ends here.

The magnitude of this paradigm shift is revealing itself through the radical change in the interfaces through which we as human interact with machines. The first inflection point2 and glimpse toward agent-based software is without doubt ChatGPT. The conversational interface made it easy to express intent, iterate on ideas, and experience the capabilities of the LLM model. The second and more powerful Aha! moment came more recently when I started using AI-assisted coding with Codex, Claude Code, and Amp. All these coding tools use the Command-line Interface (CLI).3 Those CLI-native tools seem to surpass the more complex IDE-based tools such as VS Code Copilot, Cursor or Zed in my day-to-day programming. This trend for simpler, more direct interfaces is not coincidental. Taking the example of IDEs, it is certainly tempting to keep the human in control of the majority of the coding and having the LLMs as a side bar assistant. While it is a logical step, I share the same opinion as Andrej Karpathy's recent tweet about how the struggle to increase productivity with LLMs is perhaps more of a human skill issue. I will also add that it is a tools issue and that we should not try to enforce LLMs into our routine tools but instead adopt the tools that are natural to an LLM to operate under.

This is not merely constrained to programming. Across industries, the current interfaces of modern software—from spreadsheets to dashboards to — are often complex, cumbersome, and frustrating to use. The interface became the thing rather than the medium to achieve a goal. In medicine today, physicians are spending more time interacting with the software than performing their core work, including delivering direct patient care.4 They have to navigate menus, click on buttons, and fill out forms. Modern software interfaces are thus designed to keep humans as the primary agents of understanding and action, with machines serving as fast, precise, but ultimately inert objects.5 They are perfected to guide and structure human reasoning, not replace it since the thought was that machines can't reason.

Agent-first systems challenge this order; and the shift from sophisticated6 interfaces towards simpler interfaces is the premise towards a more fundamental shift in how to build software.

But this reconfiguration of human and machine roles does not arrive without friction. To move toward agent-first systems, we must confront a new class of questions that sit uncomfortably between engineering and philosophy. The challenge with the new paradigm is not simply enabling agents to do more, but ensuring that their expanding scope of action remains aligned with human values. Because LLMs are optimized for next-token prediction rather than grounded agency, alignment cannot be guaranteed by the model's learned parameters alone. LLMs are unlikely to be trusted in their current architecture to perform complex, critical tasks, such as treating a patient. Hence, additional layers of techniques and tools must be considered. Unsolved, the cost of the opportunity and risk will remain significant. This is exemplified by the current gap between the excitement around the most recent LLMs and the fact that “95% of organizations are getting zero return” from AI initiatives.7 As long as we don't have an answer to those questions, the new wave of agent-first systems will be difficult to implement at scale.

Again, I don't think that it is solely the responsibility of LLMs and frontier model companies such as OpenAI and Anthropic to solve those challenges. Instead, these constraints suggest that the limiting factor for deploying agent-first systems is becoming less about the model intelligence (if you are a skeptic, performance) alone. What determines whether an LLM can act safely and usefully is increasingly the environment in which it operates. The problem of alignment is going to require at least for the foreseeable future integrating LLMs within an environment that allows them to operate safely—a harness that constrains, guides, and supervises a language model's behavior, rather than expecting those properties to be embedded within and expressed by the LLM model itself. The harness is the environment around the model that makes it usable, safe, and reliable in practice. The combination of the harness and LLM is what creates an agent-first system. We see techniques such as Model-Context-Protocol (MCP), Agent Skills, and Spec-driven Development emerging in the last few months as tools for effective harness. The role of the harness is to ensure systematically and reliably that the LLM is operating within safe bounds and, above all, verifiable.

Verifiability has emerged as a central driver of recent advances in LLM capabilities, particularly with Reinforcement Learning with Verifiable Rewards (RLVR).8 Results such as DeepSeek's use of RLVR through Group Relative Policy Optimization (GRPO) demonstrate that when task outcomes can be reliably evaluated, models can be trained to preferentially select correct trajectories, yielding substantial gains in domains like mathematical reasoning.9 The broader insight is structural: LLMs improve most reliably in environments where success criteria are explicit, testable, and machine-verifiable. This suggests that the environments in which LLMs operate must be designed to allow for verifiable outcomes if we are to unlock their full potential as agents. Conversely, tasks whose outcomes cannot be reliably specified, observed, or verified remain fundamentally fragile for LLMs. In such contexts, deploying LLMs without an adequate harness is not merely ineffective—it is potentially unsafe.

Any machine constructed for the purpose of making decisions, if it does not possess the power of learning, will be completely literal-minded. Woe to us if we let it decide our conduct, unless we have previously examined the laws of its action, and know fully that its conduct will be carried out on principles acceptable to us!

Norbert Wiener, The Human Use of Human Beings - Cybernetics and Society

A common — and understandable though naive — approach is to deploy agents only within tightly scoped environments with human-in-the-loop able to verify or intervene at every step. This reflects a defensive posture shaped by legitimate concerns around safety, risk, and accountability. However, as AI capabilities continue to advance, the limiting factor will increasingly be neither the model nor the task, but the human oversight or the constraints of the environment itself. Crucially, this is not an argument against human involvement in ensuring safety. Rather, it challenges a deeper and often implicit assumption that continuous human intervention can serve as a reliable, scalable mechanism for safety in agent-first systems. In practice, this assumption breaks down at scale. A case in point is the recent Waymo incident. When an outage occurred in San Francisco, multiple Waymo cars stalled and became inoperable, blocking roads and intersections. Waymo had to cease operations for several hours. The failure, as it was revealed, was not in the AI engine, but in Waymo's control system. Waymo's system relied on a human-in-the-loop fleet response as part of its safety harness — a solution that appeared robust until it was not under the sheer weight of scale. This episode underscores that in critical systems, human-in-the-loop is not a viable control strategy. It collapses precisely when scale, speed, and coordination matter most. It merely shifts risk from model error to coordination failure between machines and humans.10 The same pattern will repeat across domains unless we move beyond human-in-the-loop as a safety crutch and toward environments—and harnesses—that make autonomy governable by design.

For instance, health care operates under uncertainty, delayed outcomes, and asymmetric risk. Yet it is precisely this complexity that makes health care so compelling: the cost of human cognitive overload is real, and the promise of delegation is enormous. The limiting factor is not that agents cannot reason about medical knowledge, generate differential diagnoses, and propose treatment plans with accuracy,10 but it is that the environments in health care are poorly structured for verifiable reward mechanisms and largely underprepared for safe autonomy. Current proliferation of AI governance playbooks, health care AI model cards, and institutional AI policies — while well-intentioned — risk deflecting attention from the harder, more consequential work. The emphasis on oversight by committee and printed documentation over the design of operational environments in which autonomy can be exercised, constrained, and audited in practice may provide the appearance of control while doing little to advance the safe and effective deployment of agent-first systems. Until the surrounding systems—clinical workflows, information architectures, audit mechanisms, and governance models—are redesigned to support machine verification, supervision, and traceability, agent-first systems in healthcare will remain limited by their environment rather than by the models themselves.

The transition to agent-first systems is as much a question of design philosophy as technical capability. What is at stake is not whether agents can reason, but whether we can build environments in which their reasoning—however primitive—operates safely, transparently, and in service of human intent. This requires moving beyond the reflex to keep humans in every loop and toward the harder work of designing systems where autonomy is structurally governable. The human role shifts from producer of knowledge to supervisor of outcomes.11 The future of software lies not in interfaces that guide human thought, but in agents that perform tasks based on human objectives. The challenge before us is to build the harnesses that make this new paradigm safe, reliable, and aligned with our values. If (when) successful, what emerges instead is a model in which knowledge production is increasingly delegated and knowledge becomes more abundant.

The next generation of software will be built around agents, not user interfaces. For builders, I offer the following simple litmus test: Remove the agent from your system. If the product collapses, you are on the right track - but it is not guaranteed that you are building an agent-first system. If it does not collapse and your system is still operable though, AI remains a feature rather than a foundation and you are certainly not building an agent-first system.

Footnotes

  1. This is making the assumption that Frontier LLMs will continue to improve in capability, reliability, and safety over time; and that whether we achieve AGI or not, these models will be powerful enough to handle a wide range of tasks that we currently rely on traditional software for.

  2. While I describe LLM progress in terms of a step function, the underlying improvements are smoother and predictable. What feels like a sudden capability jump is often a perceptual bias shaped by our limited human capacity to measure change over a continuous spectrum. I highly recommend the following paper:
    Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” arXiv [Cs.AI]. arXiv. http://arxiv.org/abs/2304.15004.

  3. I highly recommend Ghostty as a terminal. It is such a delightful work by Mitchell Hashimoto. Though I have seen a trend to build complex UI interfaces through the CLI using frameworks like TUI and Ink. I personally worry that this could lead to anti-patterns where we are recreating the same complexity of traditional UIs in a CLI context.

  4. Hill, Robert G., Jr, Lynn Marie Sears, and Scott W. Melanson. 2013. “4000 Clicks: A Productivity Analysis of Electronic Medical Records in a Community Hospital ED.” The American Journal of Emergency Medicine 31 (11): 1591-94.

  5. Important to notice how sudden the shift in our view of what a good user experience looks like. Just a year ago, building a web app with a sophisticated and well-crafted UI was considered innovative. Consider Linear, Figma, Vercel, and others. Now, it feels outdated and mostly an anti-pattern compared to agent-first systems.

  6. By sophisticated here I do not mean elegant, minimal, or refined in the sense of simplicity. I mean sophisticated in the more crude sense of feature-dense, highly parameterized systems that attempt to bring complexity directly to the user.

  7. Nanda, M. I. T., Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari. 2025. “The Genai Divide State of Ai in Business 2025.” https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf.

  8. Wen, Xumeng, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, et al. 2025. “Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs.” arXiv [Cs.AI]. arXiv. http://arxiv.org/abs/2506.14245.

  9. Shao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv [Cs.CL]. arXiv. http://arxiv.org/abs/2402.03300.

  10. Chiodo, Maurice, Dennis Müller, Paul Siewert, Jean-Luc Wetherall, Zoya Yasmine, and John Burden. 2025. “Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility.” arXiv [Cs.CY]. arXiv. https://doi.org/10.48550/arXiv.2505.10426. 2

  11. I attended a health care-focus AI round-table with Greg Beckman from OpenAI where he stated that “we should think about our role when the cost of knowledge gets to zero with LLMs”