Anveshak

What if failed AI runs weren't wasted?

The Problem

A coding agent spends twenty minutes working through a bug. It reads the right files, finds the right function, even identifies the root cause halfway through. Then its context window fills up. It compacts what it remembers, loses the thread, and spirals — retrying approaches it already rejected, re-reading files it already understood. Eventually it gives up. You hit retry and it starts over from scratch, as if nothing happened.

That failed run contained the answer. The agent found the problematic file, traced the logic, narrowed it down. Research on failed agent trajectories shows they correctly locate the right file 72-81% of the time. The diagnostic work was done. The agent just couldn't hold it together long enough to finish.

Today, that work is thrown away. Observability tools like LangSmith and AgentPrism let humans inspect what happened, but the traces themselves are treated as exhaust — something to visualize in a dashboard, not something another system can act on. The only recovery options are to retry from scratch (expensive, amnesiac) or to compress the trace into a summary and keep going (lossy, degraded). Neither option uses what the agent already figured out.

As coding agents move into production workflows — handling pull requests, debugging CI failures, resolving incidents — the volume of failed runs is growing fast. Each one costs real compute and real time. Throwing that work away and starting over is not just inefficient. It's a compounding cost that scales with every deployment.

What We're Exploring

We think failed agent traces are not waste. They're forensic documents — rich, structured records of diagnostic work that a second agent can read, interpret, and finish.

The insight is simple: the original agent failed because of context limitations, not because the task was beyond it. It ran out of room to think. But the full record of what it did is still there in the trace. A second agent — smaller, cheaper, purpose-built for reading long forensic records rather than solving problems from scratch — should be able to pick up where the first one left off. Read the autopsy report. Find what the original agent found but couldn't retain. Produce the fix.

The picture we're working toward: a failed agent run produces a trace. Instead of discarding it or retrying blind, you hand the trace to a lightweight recovery agent. It reads through the full record — every file the original agent opened, every hypothesis it formed, every error it hit — and extracts the diagnostic signal that was lost to context degradation. It produces the patch the original agent couldn't finish. The cost is a fraction of a full retry.

Getting there raises questions we find genuinely hard:

Signal versus noise. A failed trace can run to 100K+ tokens. Most of it is mechanical — file reads, boilerplate, dead ends. How does a recovery agent distinguish the diagnostic signal from the noise without domain-specific heuristics that would limit generality?

Context rot diagnosis. The original agent lost the thread at some point. Can a second agent reliably identify where the degradation happened — the moment the agent stopped connecting its early observations to its later actions — and use that as a starting point for recovery?

Generality across agent scaffolds. Different agents produce radically different trace formats. An approach that only works on one agent's output is a parlor trick. Can trace forensics generalize across the major coding agent architectures without format-specific engineering?

We're building an evaluation that directly measures recovery: take traces from failed coding agent runs, hand them to the recovery agent, and compare its success rate against a fresh agent that gets no trace at all. The trace either helps or it doesn't. That's the test.

Status

Active Research