The TEMPR Pipeline: How Our AI Decides What to Remember

Q: What does TEMPR actually stand for?

Five stages: Tag (classify incoming information by type and salience), Evaluate (score each tagged item against retention criteria), Merge (consolidate overlapping or redundant memories into single canonical records), Prune (drop low-score items that haven't been retrieved within a decay window), and Retrieve (fetch the right memories at the right time, ranked by contextual relevance). Every piece of information entering an AI session passes through all five, in order. Nothing gets stored raw. N

Q: How does the Evaluate stage decide what's worth keeping?

Evaluate scores each tagged item on three axes: recency (when was it mentioned?), frequency (has it come up before?), and utility (did the model actually use it in a response?). Each axis produces a score between 0 and 1, combined with weighted coefficients we tuned over roughly three months of production data.

Every AI system handling ongoing conversations faces the same brutal question: what do you keep, and what do you throw away? Get it wrong in either direction and you've built something useless. Keep too much, and the context window bloats, latency spikes, and the model starts hallucinating connections between things that have no business being connected. Keep too little, and users repeat themselves endlessly, trust evaporates, and the whole thing feels dumb.

We spent a long time getting this wrong before we got it right. The result is what we call the TEMPR pipeline — Tag, Evaluate, Merge, Prune, Retrieve. It's the decision engine sitting underneath our AI products, running in near real-time, determining what a session should remember, what it should compress, and what it should let go.

This is the explainer I wish had existed when we started building it.

What does TEMPR actually stand for?

Five stages: Tag (classify incoming information by type and salience), Evaluate (score each tagged item against retention criteria), Merge (consolidate overlapping or redundant memories into single canonical records), Prune (drop low-score items that haven't been retrieved within a decay window), and Retrieve (fetch the right memories at the right time, ranked by contextual relevance). Every piece of information entering an AI session passes through all five, in order. Nothing gets stored raw. Nothing gets deleted arbitrarily.

And it has to be fast — sub-200ms for the full cycle on a typical session update.

The design assumption underneath all of it: memory isn't a filing cabinet. It's a living graph that changes shape as sessions evolve.

Why does tagging matter before anything else?

Tagging is the foundation. Misclassify something at the start and every downstream stage compounds that error.

The Tag stage classifies each incoming fragment across four dimensions: type (factual, preferential, procedural, or ephemeral), entity (who or what it's about), confidence (how certain the extraction was), and salience (how likely this is to come up again).

Ephemeral tags — "user said they're in a hurry right now" — get a short decay timer set immediately. Factual tags — "user's company name is Bridgewater" — get flagged for long-term storage and deduplication. Preferential tags — "user prefers bullet points over prose" — get merged into a persistent user profile layer that survives across sessions.

We use a small, fast classification model for this, not the main LLM. Tag runs in about 40ms. It doesn't need to be clever. It just needs to be consistent.

How does the Evaluate stage decide what's worth keeping?

Evaluate scores each tagged item on three axes: recency (when was it mentioned?), frequency (has it come up before?), and utility (did the model actually use it in a response?). Each axis produces a score between 0 and 1, combined with weighted coefficients we tuned over roughly three months of production data.

Utility turned out to be the most important signal by a significant margin — and that genuinely surprised us. We'd originally weighted recency highest, which felt intuitive. But in practice, something mentioned once and then used in every subsequent response is far more valuable than something mentioned five minutes ago and never touched again. We were wrong about that for longer than I'd like to admit.

Evaluate also flags items for the Merge stage if they're semantically similar to something already in the memory store. That handoff — Evaluate to Merge — is where the pipeline earns its keep, because without it you end up with five slightly different versions of the same fact living in memory simultaneously.

What goes wrong if you skip the Merge step?

We skipped a proper Merge step in our first version. Didn't even have one. We just appended to a growing context string and hoped the LLM would sort out contradictions on its own.

It didn't.

You'd end up with things like: "user works at Acme Corp" stored alongside "user's employer is Acme" stored alongside "user mentioned Acme earlier" — three separate memories, eating tokens, occasionally confusing the model into treating them as distinct facts.

Merge takes flagged items from Evaluate, runs a semantic similarity check against the existing memory graph, and either consolidates them into the canonical record or creates a new one if no match clears a threshold. We landed on 0.82 cosine similarity after a lot of testing. Lower than that and you get false merges. Higher and you get duplicate proliferation.

The merged record inherits the highest utility and frequency scores from its source items, so consolidation doesn't accidentally demote something important. You can see how this plays out in production in our work on Forge, where memory coherence was critical to the whole product experience.

Why does pruning need its own stage?

Pruning isn't just deletion. It's a scheduled, criteria-driven decay process, and it runs on a different clock from everything else — which is exactly why it needs to be its own stage. Tag, Evaluate, Merge, and Retrieve all fire per-event, triggered by new input. Prune runs on a timer: every 60 seconds in active sessions, every 10 minutes in idle ones.

Items get pruned when their combined score drops below a floor threshold AND they haven't been retrieved within a configurable decay window. The defaults: 15 minutes for ephemeral items, 7 days for factual ones, indefinite for preferential ones.

This is also where I made one call I'd still defend: Prune is fully auditable. Every deletion is logged with the reason — score too low, decay window exceeded, explicit user request. When things went wrong (and they did), we could trace exactly why a memory disappeared. That audit trail saved us weeks of debugging. If you're building something similar, don't skip the audit log. It's not glamorous, but it's how you diagnose problems at 2am without losing your mind.

Our Telehance project pushed us hardest on pruning logic. Long async conversations made every pruning decision visible to users in a way that shorter sessions never had — there was nowhere to hide a bad call.

How does Retrieve know what to surface?

Retrieve is the stage users experience without knowing it exists. When a new message arrives, Retrieve queries the memory graph and returns a ranked list of relevant memories to inject into the prompt context. Ranking combines the utility and frequency scores from Evaluate with a query-relevance score computed at retrieval time.

We cap injection at 800 tokens of memory content. That cap matters — it forces Retrieve to be genuinely selective rather than dumping everything and hoping for the best. The top-ranked memories get injected in a structured format that tells the model what kind of memory each item is — factual, preferential, procedural — so it knows how much weight to give each one.

We're currently testing dynamic injection caps that adjust based on available context window space and query complexity, which should improve performance on longer sessions. Get in touch if that kind of architecture problem is something you're actively wrestling with — we've made enough mistakes here to have useful opinions.

You can see how different Retrieve strategies affect real products in our UI/UX skills work, where surfacing the right memory at the right moment is the difference between a product that feels intelligent and one that just feels like a search engine with a chat interface.

What does TEMPR look like running in production?

In a live session, TEMPR processes every user turn in under 200ms end-to-end. Tag and Evaluate run in parallel after initial classification — that's the single biggest latency saving we found. Merge and Retrieve run sequentially, because Retrieve needs the post-Merge graph to be accurate. Prune runs async on its timer and doesn't block anything.

The memory graph for an average session tops out around 40–60 nodes. Complex, multi-session users hit 200+ nodes — that's where the scoring and pruning logic really proves its worth. Without it, retrieval from those sessions would be unusably slow.

We run TEMPR across multiple products now. Each has its own tuned coefficients — the right weighting for a customer support product is different from the right weighting for a research assistant. But the five stages don't change. The architecture holds.

If you're building AI products and wrestling with memory, take a look at what we build at Nuclear Marmalade — or if you're an engineer who wants to work on problems like this, we're hiring.

Key Takeaways

TEMPR stands for Tag, Evaluate, Merge, Prune, Retrieve — five sequential stages that handle everything an AI session should or shouldn't remember
Utility score (did the model actually use this?) matters more than recency — that one surprised us and changed how we weight the Evaluate stage
Skipping Merge early on created duplicate memory records that confused the model and wasted tokens — don't skip Merge
Prune needs its own clock, separate from the event-driven stages — and every deletion should be logged so you can debug at 2am without losing your mind
The five-stage structure stays constant across products; only the coefficients change to fit the use case