Hindsight

What if your AI reads the number right but understands it wrong?

The Problem

A model reads a paystub and reports $4,523.67. The number is perfect. But is that net pay or gross pay? In mortgage underwriting, getting the wrong one means an $833/month error in qualifying income. The loan gets approved when it shouldn't, or denied when it should. The model didn't misread anything. It misunderstood.

This happens everywhere structured documents matter. A model extracts a date from a contract perfectly but confuses the execution date with the effective date. It reads an invoice total correctly but can't tell the pre-tax subtotal from the post-tax amount due. These are not extraction errors. They are comprehension errors. The model saw the right text and assigned it to the wrong concept.

Nobody is measuring for this. Every document AI benchmark in production today measures one of a few things: character-level accuracy, field extraction F1, structural fidelity, or schema validity. Benchmarks like OmniDocBench, DocILE, CORD, and the recent ExtractBench all ask some version of the same question — did the model extract the right value? None of them ask: did the model assign it to the right field in the first place?

This means every model comparison, every vendor evaluation, every "state-of-the-art" result on document understanding is potentially overstating readiness. The gap between extraction accuracy and actual comprehension is invisible — because nobody has built the instrument to see it.

What We're Exploring

We think the field has been measuring precision of extraction without measuring correctness of understanding. These are different things, and the difference has real consequences.

We're building a benchmark that tests concept-binding — whether a model can assign an extracted value to the correct semantic concept, not just extract the value itself. The benchmark covers four document domains: paystubs, invoices, legal contracts, and W-2 tax forms. Each domain has its own set of concepts that are genuinely confusable. Paystubs have gross versus net, regular versus overtime. Contracts have termination-for-cause versus termination-for-convenience. W-2 forms, where the IRS mandates fixed field positions, serve as a control — if models still confuse concepts there, the problem runs deeper than layout variation.

The picture we're working toward: you run a model through this benchmark and get back not just an accuracy score, but a confusion matrix showing exactly which concepts the model mixes up with which. A mortgage underwriter can see that a model reliably confuses biweekly and semi-monthly pay. A legal team can see that a model conflates two types of termination clauses. The failure modes become visible and specific, not hidden inside an aggregate F1 score.

Getting there raises questions we find genuinely open:

Domain transfer. Do the same semantic families cause confusion across different document types? If "amount" concepts are confusable on paystubs and invoices alike, that points to a fundamental limitation in how models bind values to meaning — not a domain-specific quirk.

Surface versus structure. How much of concept confusion is driven by visual layout and label variation versus genuine semantic ambiguity? Paystubs use dozens of different surface labels for the same concept. Contracts bury distinctions in paragraph structure. These are different failure modes and may need different evaluation strategies.

Measuring comprehension. Existing metrics were designed for extraction. Scoring whether a model assigned a value to the right concept — especially when concepts are semantically close — requires new evaluation approaches. Getting this right is as much a contribution as the dataset itself.

The core empirical question is the delta: how large is the gap between traditional field extraction scores and concept-binding accuracy? If the gap is large and consistent across domains, it changes how the field should evaluate document AI systems.

Status

Active Research