Context window management when 128k still isn't enough

Context window management when 128k still isn't enough

Larger context windows were supposed to make context engineering obsolete. They didn’t. The needle-in-a-haystack benchmarks show that models can find a single fact in 128k tokens; the real failures happen when you ask them to use that fact in a multi-step reasoning chain. Recall is not reasoning, and the gap between the two grows with context size.

Where long context degrades silently

Information in the middle of long contexts is recalled less reliably than information at the start or end. This isn’t a small effect — it’s the dominant performance drop in long-context tasks. Reasoning quality drops faster than recall: a model that can find a fact at 100k tokens often cannot use that fact correctly at the same depth. And costs scale linearly with input tokens, so the price of “just put everything in context” gets ugly fast.

What works better than maximal context

Selective retrieval still beats stuffing. A vector search that returns the top 8 chunks usually outperforms putting the entire corpus in context, even when the corpus fits. Hierarchical summarization — summarize, then reason on the summary — uses an order of magnitude less compute for tasks where the full text isn’t needed. Reserve the long-context budget for cases where the structure of the document genuinely matters.

The context window is a budget, not a free shelf. Spend it like a budget.

Related Posts

RAG that beats fine-tuning, and the cases where it doesn't

RAG that beats fine-tuning, and the cases where it doesn't

RAG won the early-deployment war for good reasons: ...