
Over the past two years, the pace of innovation for AI code assistance has been nothing short of astounding. We’ve moved from “enhanced autocomplete” systems to ecosystems of AI agents capable of completing complex tasks and cranking out prodigious amounts of code. At the same time, developers are being asked to build, test, and deploy applications that rely on specialized accelerator hardware to run training or inference workloads.
Between the volume of new code and the diversity of hardware required to run it, we’re putting more load than ever on our software testing infrastructure. Given that many larger open source projects already struggle to afford their existing continuous integration (CI) test bills, we need new strategies to ensure projects and teams can deliver quality code. This requires a fundamental shift: we must reduce the burden on traditional CI systems by bringing more testing and validation closer to the developer, be it human or agent-based.
Various groups in the open source community have been laying the foundations for this shift, among them the CNCF Sandbox container build framework project I work on, Shipwright. Together, I’m optimistic that we can forge a future for software development in the age of agentic AI that’s resilient, scalable, and no less trustworthy than what we expect today.
The Demand for Testing Compute
The current edge of generative AI software development is multi-agent orchestration. Experiments such as gastown envision teams of agents working together with each agent given a specific role or skill. Frameworks like OpenClaw reinforce this notion of agent specialization – just like a real software engineering team, multiagent workflows need bots with differentiated expertise whose value multiplies when their powers combine. But amid all this autonomous activity, what holds our machines accountable for building the right thing and leaving behind a system that is maintainable? For many in this frontier, the answer is “spec driven development” powered by clear architecture rules, automated testing, continuous integration and rapid deployment.
In this model, the demand for “testing compute” will exponentially increase under current best practices. Many projects set themselves up to execute all tests when change requests arrive, or run no tests at all when code is submitted in a “draft” or “work in progress” state. Tests in CI environments are often defined in YAML or other configuration files that are not portable to local development environments. I have seen my own projects struggle with “push and pray” validation of CI configuration, as well as test execution that is nearly impossible to replicate outside of the CI environment. This won’t work for multiagent software development. Rather, tests need to “work on my machine,” running locally to the furthest extent that they can so validation occurs prior to code submission.
This strategy of decentralizing CI offers two critical advantages. First, shifting some testing load onto the parties creating that load encourages contributors—be they human or agent—to be more careful about the volume and quality of their contributions. Code validated locally through an agent instruction or old fashioned contributor guide ensures the compute dollars spent on CI is run against high value code. Second, consistent validation experiences can reduce the test burden for software that leverages specialized hardware (such as model training and inference). Tests that work on any machine can pass core business logic checks on less expensive commodity systems, reducing the uncertainty of CI checks failing on more expensive hardware. This focus on an accountable, local feedback loop is non-negotiable for the age of agentic AI.
Multi-Architecture Becomes a Requirement
The innovation of LLMs and their underlying inference engines have disrupted our fundamental assumptions about hardware. Over the past two decades, the software industry has tried to pull off the magic trick of making hardware disappear, from virtual machines to Kubernetes and “serverless” platforms. Through their unique hardware requirements AI systems have demanded that we halt and reverse these patterns.
“Works on my machine” must now also mean delivering code that can be run on any machine, regardless of the hardware running underneath it. Multi-architecture (multiarch) support has shifted from a “nice-to-have” feature to a hard requirement across almost every language ecosystem. ARM CPU chips—once considered a “niche” for mobile devices—are now mainstream for daily software development and production deployments. Furthermore, applications that run training or inference workloads will need their own flavors and variants for specialized accelerator hardware. The InstructLab project, for example, maintains multiple container images that are tailored to specific GPU providers. Meanwhile, much of the software engineering world still struggles with teams that mix ARM-based Apple Silicon machines and with those running Linux or Windows on x86_64 architectures.
This demand for multiarch and hardware specialization is where modern, cloud native tools step in. The Shipwright project is designed to help teams produce container artifacts that “work on any machine” with its upcoming API for multiarch builds. Once this feature is added to the Build Kubernetes Custom Resource (CR), developers will be able to execute multiarch container builds without worrying about the intricacies of container image indexes and Kubernetes node selection. The Build CR also offers finer-grained scheduling control through the use of standard Kubernetes Node Selectors and Tolerations. This allows developers to target nodes with specific attributes – for example, a GPU-enabled node required for model training. With these features combined, developers will receive a single image reference that is portable to any machine. This core solution is an essential first step toward enabling the fully decentralized, local CI that the age of AI demands.
The Future of CI and Agentic AI
The work we’ve done around multiarch in Shipwright demonstrates how modern, cloud native tools are essential for the age of AI. However, as agentic AI systems continue to increase the frequency and stakes of engineering challenges, the most critical lesson remains that AI does not replace fundamental engineering practices—it makes them more important than ever. The path forward will require adapting our practices and tools, and here are three areas where we can focus our efforts.
- Standardize Agent Rules and Documentation
The future of software engineering is multi-agent AI systems coordinating together to implement a desired feature or behavior. Knowledge of how to implement these features consistently must be embedded in rules documented in codebases. Today, every AI agent vendor has its own convention to specify these rules, which isn’t just silly—it is toil for engineers. For open source, this is even worse. It is time the industry standardizes on conventions for code base rules that benefit agents and their human contributor counterparts. Maintainers, for their part, will need to write down concisely (and in English) rules and requirements that may have only been spread through word of mouth and mentorship.
- Prioritize Local Execution
“Tests passing on my machine” will be vital to these agentic AI workflows. More can certainly be done to make CI testing locally reproducible. Current test orchestration providers like Jenkins, Tekton, and GitHub Actions can do better by providing means for test scripts and actions to be locally executed. Such a feature set is far more feasible now that container technology is ubiquitous. I am holding myself accountable here—Shipwright too is guilty of not providing a local build experience. This gap must be closed, as replicating the cloud CI environment locally is a critical need for controlling costs and ensuring tests are executed against high-quality contributions.
- Reduce Friction in Test Feedback
Debugging a failing test is a rite of passage for most software engineers. Nearly all samples, tutorials, and training on automated testing includes code that implicitly assumes a “happy path.” The result is that when tests fail unexpectedly, most output does not provide clear indicators as to where and why the error occurred. Fixing these errors without context requires developers to parse substantial log files, navigate stack traces, and step through code logic to determine what went wrong. Today’s AI tools are limited by the amount of context they can ingest, and large contexts are known to substantially degrade the performance and accuracy of LLM outputs. Thankfully developers can take action now by providing failure descriptions in their tests. Almost all test assertion frameworks support this feature; by treating every check as a user-facing error, developers can provide clues that let agents (and their future selves) fix tests faster.
The daunting pace of agentic AI may tempt us to conclude that we’re facing a brand new set of problems, but in fact, these new technologies are really only accelerating existing, fundamental challenges in modern software engineering. The complexity of hardware architectures, the explosion of code volume, and the need for resource optimization demand modern tooling and reproducible testing. By spreading out the load of CI testing and thinking critically about how code is verified, we might come to find that even in the age of AI, all flakes are shallow.
KubeCon + CloudNativeCon EU 2026 is happening in Amsterdam from March 23-26, bringing together cloud-native professionals, developers, and industry leaders for an exciting week of innovation, collaboration, and learning.
