What 'context window' really means, and why the numbers keep going up
Rohan Mehta of Pulse AI explains AI context windows, why bigger isn't better, and the real number builders should care about beyond the 1M token hype.

Last month, I was wrestling with a project for a client, trying to build a chatbot that could act as an expert on their company's entire library of technical documentation. My first thought was simple: just copy-paste everything into the prompt. I fed a few hundred pages of PDFs into the latest model API, watched my token counter spin up like a casino slot machine, and held my breath. The initial answer was impressive. But when I asked a follow-up question about a specific detail buried deep in the middle of a troubleshooting guide, the model confidently hallucinated. It gave me a plausible-sounding, but completely wrong, answer. The promise of a gigantic 'context window' felt like a magic solution, but my experience showed me the messy reality. The numbers are going up, but our understanding of what they mean needs to catch up.
The 'context window' is one of the most talked-about specs in AI, but it's also one of the most misunderstood. In simple terms, it's the model's short-term memory. It's the maximum amount of information—both your prompt and the model's own generated response—that the model can 'see' at any one time. Anything outside this window is forgotten. This memory isn't measured in words, but in 'tokens'. A token is a piece of a word. For example, the word 'chatbot' is one token, but 'chatting' might be two: 'chat' and 'ting'. As a rule of thumb, for English text, 100 tokens equals about 75 words. So, a model with a 100,000-token context window can process roughly 75,000 words, or a book the size of *The Great Gatsby*.
We are in the middle of a context window arms race. A couple of years ago, we were excited about the 4,000 tokens in OpenAI's GPT-3. Then came GPT-4 with 8K, then 32K, and even 128K versions. It felt like a massive leap. Then, earlier this year, the landscape shifted dramatically. Anthropic launched its Claude 3 model family, offering a 200,000-token window as standard. Not to be outdone, Google dropped a bomb with Gemini 1.5 Pro, demonstrating a functioning one million token context window, which they just recently extended to two million for developers. That’s enough to process 1.5 million words, or the entire *Lord of the Rings* trilogy several times over. These numbers are staggering, but they can be as misleading as the megapixel count on a cheap smartphone camera.
Here’s the dirty secret: a model’s ability to *contain* information is not the same as its ability to *use* that information effectively. This brings us to the 'lost in the middle' problem. A fascinating study from Stanford researchers in 2023 showed that many LLMs exhibit a U-shaped retrieval curve. They are excellent at recalling facts placed at the very beginning or the very end of their context window, but their performance drops off significantly for information buried in the middle. My own failed experiment with the documentation bot was a perfect example. The model 'saw' the entire text, but when it came to plucking out a single detail from that vast middle section, it got lazy and guessed. It’s like a student cramming for an exam by reading a textbook cover-to-cover in one night; they’ll remember the first and last chapters, but everything in between is a blur.
This is why we need to distinguish between *context length* and *effective context*. The first is the marketing number—1 million tokens! The second is what actually matters—how much of that can the model reliably use? To measure this, researchers developed a clever benchmark called 'Needle in a Haystack' (NIAH). In this test, a single, specific piece of information (the 'needle'), like "The best snack to eat while coding is a samosa," is inserted into a huge volume of random text (the 'haystack'). The model is then asked to retrieve that specific fact. The researchers can move the needle to different positions—the beginning, middle, or end—to see how well the model performs. Both Google and Anthropic have touted near-perfect recall on NIAH tests for Gemini 1.5 Pro and Claude 3, respectively, even at massive scales. This is a genuinely impressive engineering feat, proving that they've made huge strides in solving the 'lost in the middle' problem.
So if the new models have near-perfect recall, why not just use a million-token window for every single task? The answer comes down to two very practical, and very painful, constraints: latency and cost. Sending a million tokens to a model isn't instantaneous. It takes time to process that much data, and the response will be slower. For a conversational chatbot, that delay can kill the user experience. More importantly, it is incredibly expensive. API pricing is typically based on the number of input and output tokens. Using a massive context window for a simple task is like renting a 10-tonne truck to deliver a single tiffin box. You wouldn't. If you only need to analyze a 20-page document, which fits comfortably within a 16,000-token window, using a 200K or 1M window is burning money for no reason. Your cloud bill will thank you for being frugal.
There's another, more subtle problem with giant context windows: distraction. Giving a model more context isn't always better, especially if that context is noisy, contradictory, or contains irrelevant information. Imagine asking an employee for a summary of one specific customer email, but to do so, you first make them read every single email the company received that day. By the time they get to your actual question, their brain is saturated. They might confuse details, blend facts from different sources, or get pulled down a rabbit hole by an irrelevant but interesting-sounding tangent. LLMs can behave similarly. Overly large and unfocused context can degrade reasoning and increase the likelihood of the model fixating on the wrong details, leading to less accurate and less helpful responses.
This is precisely why Retrieval-Augmented Generation, or RAG, isn't going anywhere. For the uninitiated, RAG is a technique where you don't stuff your entire knowledge base into the model's context. Instead, you keep your documents in an external, searchable database (often a vector database). When a user asks a question, your system first performs a rapid search to find the handful of most relevant text chunks. Then, you feed *only those relevant chunks* into the LLM's prompt, along with the original question. The RAG system acts as a highly efficient librarian, finding the exact books the model needs instead of making it read the entire library for every query.
Some big-context evangelists claim RAG is now a clunky, obsolete workaround. I completely disagree. For any serious enterprise application, RAG offers two killer advantages that large context alone cannot. First is verifiability. With RAG, you know exactly which source documents were used to generate an answer. You can display these sources to the user, allowing them to verify the information. This is non-negotiable for legal, medical, or financial applications where 'trust me, the AI said so' is not an acceptable citation. Second is scalability and freshness. Your company might have petabytes of data that are updated daily. With RAG, you simply update the search index—a fast and cheap operation. With a pure context-window approach, handling that scale and dynamism becomes an expensive, slow-moving nightmare.
So, after all this, what is the 'context window' number that you, as a builder or a professional, should actually care about? It’s not the headline one or two million token figure. That number is a statement of technical capability, a North Star for the research labs. The number that matters for your project is the *effective, affordable, and fast* context window. For many, many applications today—from customer service bots to document summarizers—that sweet spot lives somewhere between 32,000 and 200,000 tokens. That's large enough to handle complex documents, full transcripts, and lengthy conversations without the crippling cost and latency of the bleeding edge.
We’re fortunate to be moving from an era where context was a severe limitation to one where it is a powerful resource to be managed with intelligence and strategy. My advice is simple: don't be hypnotized by the vanity metrics. Think like an engineer, not a marketer. For each task, ask yourself: 'What is the *minimum* amount of context needed to get a great result?' Start there. Use the powerful, mid-range windows offered by models like Claude 3 or GPT-4. Explore Gemini 1.5 Pro’s massive window for specialized tasks that truly require it, like analyzing an entire codebase or processing hours of video. And for everything else that involves a vast, dynamic library of knowledge, embrace the targeted elegance of RAG. The future of building with AI isn't about using the biggest hammer; it's about choosing the right tool for the job.
Get the 50 AI Tools every Indian professional should know in 2026.
One email a week. Free PDF on signup. Unsubscribe anytime.
Why it matters
- 01The advertised context window (e.g., 1M tokens) is a measure of length, not a guarantee of the model's ability to recall information from it.
- 02Large context windows have significant costs in latency and money, making them overkill and inefficient for many common tasks.
- 03Retrieval-Augmented Generation (RAG) remains essential for its cost-effectiveness, verifiability, and ability to handle massive, dynamic knowledge bases.