Michael Tromba
Michael Tromba

Testing is fucking awesome

I used to have a pretty negative sentiment on testing. Unit tests, integration tests, anything that required writing code to test other code I already wrote.

Not because they are not useful.

But because I found that in the majority of cases, the cost-benefit ratio of writing them was unfavorable given the context of the things I was building as a solo, bootstrapped founder.

I would still write them for specific, fragile, high-risk code paths as a way of clearly defining expectations, guardrailing my implementations with a red-green approach, and preventing future regressions.

E.g. if I was writing some kind of glob matcher, sure - I'd want to make sure it wasn't behaving stupidly.

But in most cases they were a huge waste of time and introduced even more surface area for bugs that I'd have to maintain indefinitely.

They also added a layer of cement around my codebases, making things way harder to change - the defining hallmark of a shitty codebase.

But not anymore.

Now, I cannot imagine a world without them. I write a ton of tests - of all kinds.

And by write, I mean, Opus 4.6 or GPT 5.4 writes them for me.

Why? The cost-benefit ratio has completely flipped on its head.

The costs are now negligible, and the benefits have been multiplied.

1. They have become effectively costless.

The old costs:

  • Cognitive burden and time required to write them
  • New surface area for bugs
  • Double the code to maintain over time
  • Cement around my codebase

...are no longer even close to as costly as they were pre-coding agents.

Now, there is close to zero cognitive burden to writing them or even thinking of which cases to write in the first place. The LLM is exceptional at defining cases.

The increased surface area and maintenance is now almost costless - the AI can write new tests, add cases, and fix broken cases in seconds.

They are no longer cement. The minute a test is no longer serving its purpose, the agent can easily bulldoze over it and write a new one from scratch. rm x.test.ts, done.

2. The benefits are now amplified.

Feedback loops are the essential backbone of effective agentic systems.

When the agent creates a mutation, it needs a way to understand the impacts of its mutation, validate its correctness, and confirm it has not introduced any regressions.

Automated testing is the perfect tool for the job.

Sure, you could tell your agent to verify its work and hope it listens. And this works to some degree, especially with the frontier models. It will run for 10 mins straight executing little ephemeral shell scripts and commands and reading / judging the effects of its work.

But that process is fragile, fundamentally not reproducible, and does nothing to prevent future regressions once you have defined expected behaviors.

Alternatively, you can instruct it to write a comprehensive test suite that captures all important code paths and edge cases for the module in question.

And on top of that, you can wire up your testing harness to a precommit hook (or some other similarly useful engineering lifecycle trigger) and have the tests run automatically prior to any potentially bug-containing code being shipped.

Now, you have not only created feedback for your coding agent so that it can more effectively write code that conforms to clearly defined expected behaviors. But you have also created a living specification for the given module to follow for the lifetime that it exists within your codebase. If its behavior changes or breaks suddenly - you will know immediately, before it hits users.

The amount of code we ship is accelerating exponentially

I no longer review every line of code that hits my codebase. At most I'll skim through and only look more deeply into the risky bits, mostly to ensure that the architecture is sound and maintainable and that my codebase isn't trending toward becoming a big bowl of spaghetti. But I rarely read the actual code tokens beyond entity identifiers and file names. AI is generally good at naming things, so I can quickly deduce what and how a thing works just by glancing at the names and general flow of things.

If this makes you cringe, and you're still reviewing line-by-line - that's likely because you have not yet adopted and seen for yourself the power of well-engineered testing & verification systems within your codebase. The reason I do not need to read the code is that I already know it works and does what it says it does because we have sharply defined test suites which confirm that better than I could with my own two eyes, reading line-by-line.

Ok maybe not always better - but sure as hell faster. 100 or 1000x faster.

My brain validates the high-level intent/architecture (how it works), and automated systems validate the rest (IF it works).

A few pointers

Iteratively creating test suites with the agent

The process of integrating tests into a codebase often looks like walking through a manual verification and testing process with the agent within a chat session. This allows it to very easily and efficiently write and run those ephemeral test scripts on the fly and iterate on them until they function properly and test the target behaviors effectively. I may nudge the AI a few times with prompts like: Have you tested all critical edge cases?, What about if x happens, did you verify that works?, Anything else that we should verify in order to be 100% confident this works as expected and is prod-ready?.

After a few rounds of that (and, more than likely, squashing a few discovered bugs along the way), I then instruct the AI to turn the manual testing process we just did into a test suite. Within moments we now have a suite that will always run prior to any changes that could potentially break and regress the behavior of what we built.

Design your systems to be amenable to verification in the first place

If you have a shitty spaghetti architecture with trashy interdependency littered in every corner, good luck integrating automated testing.

Thinking in terms of expected behaviors and verification loops from the get-go forces you to design your systems in a high quality way and naturally nudges you (and your agents) away from poor designs.

Give your agent arms & legs

agent-browser is a CLI tool released by Vercel Labs that lets you provide your agent with a clean, context-optimized interface to fully control a browser. This unlocks the ability for your agent to do iterative end-to-end testing.

There are other tools on the block, and new ones coming out every day, too. But at the time of writing this, I've found agent-browser and the agent skill that ships with it to be extremely effective and useful.

After implementing a feature, I'll often have the agent a) define a full QA process that we should follow to ensure the real application behaves as expected, and then b) use the agent-browser skill/CLI to run the QA process itself, taking screenshots along the way to verify the UIs render properly (and so that the agent can present those screenshots to me at the end).

Once we've gone through the manual process, same as before - I have the agent serialize it into an automated end-to-end test suite.

Run automated testing at high-risk steps in your engineering lifecycle via hooks.

I use precommit hooks in every project I work on now. I have my agent write them so that they automatically run all relevant quality checks (lints & tests) prior to sending any code upstream.

Neither I nor the agent has to remember to regression-test things. It just happens.

Two tips for this:

  1. If you have a large codebase with lots of surface area/tests, design your runner to only evaluate the affected code (staged, committed, etc.)

  2. Eliminate context-bloating noise. Ensure the automated testing succeeds quietly. In other words, if everything is passing, there is no need to log anything to the console other than that fact. Many testing / linting systems were designed prior to the concept of agent context windows existing. Don't distract your agent with irrelevant token bloat.

Make it easy for your agent

In my projects, while in development, I have converged on a pattern where I define specific email address identifiers (e.g. qa@mydomain.ai) which upon logging into within a dev environment a) automatically authenticate the user without requiring password, email magic links, etc. and b) reset the account to a fresh state for QA purposes (wiping any previous side effects created by prior QA testing procedures).

I remove all friction so my agent can just pop open the browser and start using the product.

On top of this, I also give my agent a custom qa account state seeding CLI it can use to seed in certain account states so it can more rapidly test the specific parts of the application it needs to, without having to manually create account states by clicking around in the UI forever.

To train the agent on using this system effectively, it is as simple as adding a skill, qa-testing, which documents what I've described here.

Not only is doing this incredibly useful for the manual shell testing your agent does, but you can reuse these same systems in the automated end-to-end test suites that your agent writes.