← all writing

The 1M context window is a trap (for now)

Bigger context windows are exciting and useless without a retrieval strategy. A pragmatic guide to picking what to put in.

When Gemini announced 1M token context windows, the discourse split into two camps: “RAG is dead” and “RAG is not dead.” Both camps were mostly arguing about the wrong thing.

The relevant question isn’t whether a large context window can hold your documents. It’s whether the model can attend to them usefully.

The lost-in-the-middle problem

Research from 2023 established what practitioners had already noticed: language models attend poorly to information in the middle of long contexts. Performance peaks when relevant information is at the beginning or end. Stuff it in the middle and accuracy drops.

This doesn’t go away with 1M token windows. It gets worse. The middle is now much larger.

Stuffing your entire codebase into a context window and asking “what does this function do” works. Asking “are there any security vulnerabilities” does not, reliably.

Context windows and cost

At current pricing, a 1M token context call costs real money. If you’re making thousands of calls per day, sending 200k tokens of context each time adds up fast.

Retrieval is cheap. A hybrid search + rerank pipeline adds ~50ms and returns 2,000 tokens of highly relevant context. That’s almost always better than 200,000 tokens of broadly relevant context — for both accuracy and cost.

When large context windows actually help

There are cases where large context windows are genuinely the right tool:

  • Code review — reviewing a full diff including all touched files benefits from coherent context.
  • Document summarization — processing a single long document end-to-end.
  • Multi-turn conversations — keeping a long conversation history without summarization loss.

In all these cases, the key property is that the relevant context is known and bounded. You’re not searching for a needle; you’re processing a known artifact.

The retrieval strategy that survives

For most production use cases:

  1. Hybrid search (BM25 + dense retrieval) to get top-50 candidates.
  2. Cross-encoder reranker to score and select top-5.
  3. Inject the top-5 chunks at the beginning of the context, before any conversation history.
  4. Track which chunks were used — this is how you improve the retrieval pipeline over time.

This pipeline is boring, reliable, and debuggable. That’s what you want in production.


Large context windows are a tool, not a strategy. The strategy is still: figure out exactly what the model needs to know, and put exactly that in the context.

← back to writing Subscribe to The Cue →