For the past two weeks, I’ve been deep in a spec-driven approach to working with AI agents. At first, it felt great — like we were rolling forward on a sturdy crate, shipping fast and checking boxes.
But then I started testing.
I went spelunking into the codebase, expecting progress. What I found was something else entirely.
The code looked fine. It compiled. It had types. It followed common patterns.
But under the surface, it had quietly metastasized — like cancer.
Each iteration, while meeting the task I’d assigned, seemed to build independently of the last. Slightly different types proliferated. Functions with similar names and almost-identical logic duplicated across files. Some singletons followed the pattern, some didn’t. Parameters were inconsistent. The drift was real.
There were even two versions of the same function… in the same file.
It wasn’t malicious or “wrong” in the way broken code is wrong. It was subtle. Slow. Creeping.
The worst part? It looked well-formed. But it wasn’t cohesive.
So I started deleting.
I detangled types, merged functions, and audited patterns. I rewrote rules and began tracking type flow between packages. I'm now setting stricter boundaries on where new types can be introduced, and when existing ones should be reused.
The plan moving forward:
- Take smaller steps again
- Write clearer specs, but also layer in opinionated rules
- Be deliberate about type ownership and propagation
- And most importantly: review the code like I would if it came from another dev on my team
The big takeaway?
Progress can feel real — and look real — while hiding foundational rot.
I’m still excited about agents. They’re amazing for narrow tasks and very focused changes. But for larger systems, we need more than just specs and linters. We need vigilant reviewers. We need agents that watch for architectural drift and repeated logic. Maybe a “solutions architect” agent who’s tasked with seeing the bigger picture.
Until then, it’s me. Watching. Reviewing. Learning.
Right now I’m still just using Cursor, Codex as a passive contributor, and ChatGPT for second opinions. Not even an advanced toolchain. But after this milestone, I plan to try more robust multi-agent setups to see what’s out there.
It’s not all bleak.
This work is still powerful — just easy to misuse.
And sometimes… deceptively good at hiding the cracks.