SMRITI

Does your AI actually know you, or just remember you?

The Problem

Every major AI memory provider today benchmarks on the same thing: recall. "What is the user allergic to?" The answer is "Peanuts." The system gets a point. That is not personalization. That is a flashcard.

Real personalization is behavioral. You have been talking to an AI for weeks. It knows you default to casual language, think in analogies, and hate long-winded explanations. Then you ask it a hard technical question. Does it explain the way you learn, or does it give you the same generic textbook answer it gives everyone else? No benchmark in the world tests for this. Not LoCoMo, which the entire memory industry uses as its scoreboard. Not LongMemEval, PersonaMem, or MemoryAgentBench, which push further into multi-session reasoning and preference tracking but still frame every evaluation as a question-answering task: given this user's history, pick the correct answer.

The gap is not academic. Memory providers -- Mem0, Supermemory, Zep, Cognee -- are competing to prove they enable personalization. They have no scoreboard for it. They benchmark on recall because that is all anyone has built a benchmark for. Meanwhile, the systems shipping to users are getting judged on a metric that has nothing to do with what users actually experience.

What We're Exploring

We think there is a fundamental difference between remembering a fact about a user and adapting behavior because of it. Current benchmarks collapse the two. We're building one that keeps them apart.

SMRITI is a benchmark for behavioral personalization. It provides multi-session user journeys across diverse personas and tests whether AI systems change how they act -- not just what they recall. Does the system adapt its tone to match how the user communicates? Does it surface relevant information without being asked? When a user's preferences evolve over months of interaction, does the system update or go stale? When a memory would be inappropriate to surface, does it stay quiet?

These are the questions we find genuinely hard:

Observability. Personalization is subjective. Two humans might disagree on whether a response was well-adapted to a user. How do you build an evaluation that is reliable enough to be a benchmark without reducing personalization to something trivially measurable?

Temporal dynamics. People change. A preference stated in session 2 may be obsolete by session 12. How do you test whether a system tracks drift versus clings to first impressions?

Restraint. Knowing something about a user is not the same as knowing when to use it. Some of the hardest personalization failures are not failures of recall -- they are failures of judgment about when to stay silent.

We're designing SMRITI to be as easy to run as the benchmarks the industry already uses. An open dataset, an evaluation harness that plugs into existing memory providers, and a public leaderboard where systems are scored per-category, not collapsed into a single number.

Status

Active Research