Kshamta

What can your vibe coding tool actually ship?

The Problem

A founder opens a vibe coding tool, types "build me a SaaS dashboard with auth and payments," and gets a working app in three minutes. The demo is impressive. The landing page looks polished. Then a real user visits on their phone and the layout breaks. The login form has no rate limiting. The page has no meta tags. The app works, but it can't ship.

Nobody is measuring this gap.

The benchmarks that exist were built for a different world. SWE-Bench, the gold standard for coding evaluation, tests patch-level fixes on existing codebases — the opposite of building something from a blank prompt. The few benchmarks that do target app generation test the underlying model, not the product a user actually interacts with. Two tools running the same model can produce wildly different results because of their scaffolding, deployment pipeline, and interface design. And the evaluations that do look at the output only check one dimension — one recent study found that most vibe-coded apps are functionally correct but fewer than one in ten are secure. Nobody is testing the full surface that determines whether an app can actually ship: accessibility, security, mobile layout, SEO, design coherence.

There is a deeper problem. Even if a tool nails the initial build, what happens when the user says "now add dark mode" or "now swap the contact form for a blog"? Most tools can generate an app. Far fewer can evolve one. The ability to handle multi-round changes without breaking existing functionality is where toy demos and real products diverge — and no benchmark tests it.

What We're Exploring

We think the right way to evaluate vibe coding tools is behavioral, not structural. Don't inspect the code. Don't assume a stack. Score the running app in the browser, the same way a user would experience it.

We're building Kshamta as an open benchmark that asks three questions about every tool. First, how hard an app can it build? A static landing page is table stakes. A full SaaS product with authentication, database persistence, and real-time features is a different challenge entirely, and most tools hit a ceiling well before they get there. Second, can it evolve what it built? After the initial app works, the benchmark sends follow-up prompts — add a feature, delete a feature, change the data model — and checks whether each round of changes breaks what was already working. Third, is the output production-ready? Every deployed app gets audited on the things that determine whether it can actually go live: can a screen reader navigate it, are secrets exposed, does the layout survive a phone screen, will search engines find it.

The result is a multi-axis profile for each tool. A tool that builds beautiful static sites but falls apart when you ask for authentication gets a score that shows exactly where the ceiling is — not a single leaderboard number that hides what matters.

The questions that make this interesting:

Evaluation surface. Can you build a credible, tool-agnostic benchmark when every product generates different code, uses different frameworks, and deploys to different infrastructure? We think deployed URLs are the answer — the only artifact every tool produces.

Evolution as a discriminator. Initial build success is table stakes. Is multi-round change handling genuinely where tools separate, or do they all fail at roughly the same point?

Quality at scale. Can accessibility, security, and mobile checks be automated reliably enough to produce scores that developers and tool teams actually trust?

The benchmark is designed for public release. The evaluation is fully deterministic, the submission format is a deployed URL, and the test suite is open so any tool team can run it themselves.

Status

Active Research