Evaluation & Benchmarks

You can't build what you can't measure.

The AI industry ships fast and measures later — if it measures at all. Memory features launch without standardized tests for whether they actually improve user experience. Agents are deployed without benchmarks that test them against the messy, contradictory environments they'll face in production. Models claim multilingual capability without evaluations that go beyond surface-level translation.

We believe that building the right evaluation comes before building the right system. A well-designed benchmark doesn't just score existing tools — it redirects an entire field by showing researchers what actually matters. MMLU reshaped how people thought about model knowledge. We want to do the same for the capabilities the industry is currently ignoring.

Our benchmarks are designed to be open, reproducible, and adopted. We build evaluation frameworks that any team can plug into, and we release everything — datasets, harnesses, baselines — so the community can build on top of our work.

Projects

Hindsight

Document AI models extract the right values but assign them to the wrong fields, and no benchmark measures this. We build an evaluation that tests concept-binding across paystubs, invoices, contracts, and tax forms.

Active Research

Kshamta

Vibe coding tools produce apps that work in demos but fail on mobile, security, and accessibility. We build a benchmark that scores the running app in the browser, not the generated code.

Active Research

SMRITI

AI memory systems benchmark on recall, not on whether they actually change behavior for each user. We build an evaluation that tests whether systems adapt tone, timing, and restraint across sessions.

Active Research