Posts10/23/2025 by Chris Clark

Is Implicit Caching Prompt Retention?

OpenRouter monitors every endpoint, from every provider, to understand their current data retention policy. We publish this in our API, in our docs, and support ZDR routing rules in our API. But data retention is not black and white; specifically Google has taken the stance that models with implicit prompt caching are in fact “retention”. Most providers would disagree – so we set out to form an independent opinion.

What is implicit caching?

Caching in general matters primarily because it can save an enormous amount of money. Every token an LLM generates depends on everything that came before it; recomputing those dependencies from scratch is wasteful. By caching (in a storage mechanism) the key and value tensors that represent the model’s intermediate activations (known as the “KV cache”) providers can skip prompt prefill and continue the conversation using much less compute. Implicit caching simply means that this cache is maintained automatically, behind the scenes, without explicit user control.

Explicit caching accomplishes the same thing, but the user sets control headers around the sections of the prompt that they want the provider to cache. It’s explicit that the data is being stored for reuse. For that reason, explicit caching is never considered data retention. The question is whether implicit caching, which happens automatically within the inference system, should be treated as retention under Zero Data Retention (ZDR) standards.

Three Tiers of Storage

There are (roughly) three hierarchical tiers of storage & memory used for serving large models:

On-device memory (HBM / GPU VRAM): This is where the active inference happens. Keys and values live here while tokens are being generated in real time. No controversy here; obviously the KV weights need to be loaded into the GPU for output generation.\
Host memory (CPU DRAM): When GPU memory fills up or context switches occur, the KV cache may spill over to host DRAM. In fact, many inference configurations are rapidly paging KV cache in and out of on-device memory between tokens so that each inference consumer gets a consistent stream of tokens back, without being blocked by a long generation. Again, this is volatile memory and no one would consider it “retention” of the prompt data.
NVMe / SSD paging: The majority of commercial inference setups extend caching one step further, paging KV tensors to fast SSD storage via NVMe. This allows models to handle more concurrent sessions than the sum of their GPU and DRAM capacity would normally allow. This is, largely, where true “prompt caching” happens, and where the controversy resides.

What’s in the box???

It should be the case that only the KV tensors (floating-point matrices that encode the result of pre-filling the prompt into the LLM) get paged out to the attached SSD. A one-way hash key from the incoming prompt is stored alongside the tensors, so that as new prompts come in a lookup can be performed to see if the precomputed KV cache can be loaded from the SSD. The token IDs or text fragments of the original prompts should not be stored alongside them.

We have verified with the hyperscale clouds that host proprietary models that indeed no tokens or text are being paged out to SSDs. We have also validated this specifically with Groq and Cerebras (who, in fact, offer implicit caching but store everything in DRAM; no SSDs involved at all!). We have yet to find a provider that is paging text or tokens to SSDs, but intend on continuing this verification process for all providers.

Lastly, it is worth noting that many more providers implement this SSD architecture, than the number who publicly offer cached pricing / implicit caching. This architecture of caching KV tensors makes the inference more efficient; providers are incentivized to implement it, and are not obligated to pass the savings on, or publicly offer lower prices for cached prompts. This is why we are continuing to dig into provider setups and verify how this truly works under the hood, not just for providers publicly offering implicit caching.

The Question of Retention

So, is implicit caching data retention? We don’t think so. While caches do technically preserve derived data for a brief period, they do so only for performance continuity, and not in a form that is recoverable or meaningful as user data. The KV tensors are ephemeral, unaligned with raw tokens, and in most architectures are cleared when the session terminates or the cache is evicted. Additionally, these SSDs are physically alongside the GPU racks, and aren’t e.g. being retained in log storage or some other system inside the provider’s cloud.

KV tensors have historically been viewed as challenging or impossible to reverse into tokens/plaintext; they are model-dependent representations that lose alignment with discrete vocabulary information once the forward pass completes. Even with direct access to the tensors, recovering the original prompt is likely difficult. With that said, this assumption has been challenged recently, and KV cache data should be considered something worth protecting - but we don’t believe that ephemeral storage via on-premise SSDs meaningfully increases the attack surface area vs. having the exact same values in CPU DRAM.

So could a sufficiently motivated actor (say, a state-level adversary with full physical access and unlimited time) recover information from paged-out caches? Perhaps, but this is true of any transient compute memory. And unlike the tensors in the volatile memory, the data on the SSDs is typically encrypted. Practically speaking, the information is not retained in any operational or extractable sense.

Provider Differences and Interpretation

We believe that Google’s stance that Gemini’s implicit caching disqualifies it from ZDR is overly conservative and not what customers are asking when evaluating ZDR stance.

OpenRouter is taking a pragmatic view. Based on our understanding of how implicit caching actually works - and our conversations with providers - we consider ephemeral KV caching not to constitute data retention. Therefore, endpoints with implicit caching but no other form of persistent storage qualify as ZDR-compliant on our platform.

Still, transparency matters. We document precisely which endpoints support implicit caching, so customers who may be concerned about this form of storage can avoid these endpoints. If you disagree with our interpretation, you can simply configure your routing rules appropriately.

Conclusion

Implicit caching is a performance optimization, not a data retention mechanism. The distinction matters: ZDR should protect user data from persistence, not from transient in-memory tensors that vanish and are unavailable for inspection outside of the inference process. We do not believe our customers intend to “turn off caching” when they indicate a preference for ZDR.

But most importantly, OpenRouter will continue working with our providers to ensure proper protection of their most sensitive data, and to invest in understanding, documenting, and making this information actionable for our customers.