LLM Context Window: how it is consumed and why it matters

    Created:2026-06-11
    Updated:2026-06-11

    Every time you send a message to an AI coding assistant, your message is just the last brick in a much bigger wall. By the time your prompt arrives, the model has already read a system prompt, tool definitions, instruction files, and the whole conversation so far.

    I wanted to know what that wall actually looks like, not guess. So I inspected the real requests my coding agent sends and read the research behind the behaviors I kept hearing about. This post is the result.

    In this post, I cover:

    • what the context window actually is
    • in what order context fills up, based on what I could verify
    • what happens when context fills up, gets poisoned, or biases the model

    Quick answer

    The context window is a single, ordered sequence of tokens. Stable configuration (system prompt, tools, instruction files) sits at the front, your conversation accumulates after it, and your latest message lands near the very end. That ordering is not cosmetic: it drives caching, attention biases, and most of the weird failure modes people complain about.

    What is the context window?

    The context window is the maximum number of tokens a model can attend to in a single request. Everything the model "knows" about your session must fit inside it: instructions, code, chat history, tool results, and the response it is currently generating.

    Two things surprised me when I first dug into this:

    • The model has no memory outside the window. Each request is stateless. The "conversation" exists only because the client re-sends the whole history every single turn.
    • Input and output share the budget. A long, cluttered session leaves less room for everything, which is one reason quality tends to degrade late in a session.

    How the context actually fills up

    I did not want to repeat folklore here, so I checked what my own setup sends. VS Code lets you inspect the exact prompt GitHub Copilot builds (the chat debug logs show the full request), and the layout was consistent across sessions.

    Here is the order I observed, from the front of the context to the back:

    PositionContentExamples
    1System promptidentity, behavioral rules, safety policies
    2Tool definitionsevery tool the agent can call, with full JSON schemas
    3Instruction and config filescopilot-instructions.md, AGENTS.md, skill listings, memory notes
    4Workspace contextfolder structure, open file, environment info
    5Conversation historyall previous turns, including tool calls and their results
    6Your current messagethe thing you just typed, plus attachments
    cached prefix
    ~9% of the windowcached prefix

    The model's standing orders. Identity, behavioral rules, and safety policies. You never see it, but it is read first, on every single request.

    Contains: Identity, behavioral rules, safety policiesPosition effect: High-attention front of the window

    ~11% of the windowcached prefix

    Every tool the agent can call, with its full JSON schema. Each connected MCP server adds permanent weight here, whether you use it or not.

    Contains: Tool names, descriptions, full JSON schemasPosition effect: Grows with every MCP server you connect

    ~9% of the windowcached prefix

    Your copilot-instructions.md, AGENTS.md, skill listings, and memory notes. Re-sent with every message — treat these files like a hot code path.

    Contains: copilot-instructions.md, AGENTS.md, skills, memoryPosition effect: Not free: rides along with every request

    ~7% of the window

    A snapshot of where you are: folder structure, the currently open file, and environment info like your OS.

    Contains: Folder tree, open file, environment infoPosition effect: Changes as you move around the project

    ~20% of the window

    Every previous turn, including tool calls and their verbose results. The biggest and fastest-growing region — and the unreliable middle where attention goes to die.

    Contains: All previous turns, tool calls, tool resultsPosition effect: Lost-in-the-middle zone: lowest attention

    ~6% of the window

    The thing you just typed, plus attachments. It lands at the very end of the sequence — the second high-attention spot. Use it: repeat anything critical here.

    Contains: Your prompt, attached files, selectionsPosition effect: High-attention end of the window

    next messages…

    💡 Tap a region to see what lives there. Sizes are rough token proportions of a typical agent request.

    NOTE

    The exact composition differs between products, and vendors change it over time. The general shape, however, is common across agentic tools: stable configuration first, dynamic conversation last. If you want certainty for your own setup, inspect the actual requests instead of trusting blog posts, including this one.

    Two practical consequences follow directly from this layout.

    Your config files are not free. Every instruction file, every skill description, and every connected MCP tool schema is re-sent with every single request. A bloated copilot-instructions.md or a dozen MCP servers quietly tax every message you send. This is also why too many tools can hurt quality: the Berkeley Function-Calling Leaderboard and follow-up analyses show models struggling more as tool sets grow, especially when irrelevant tools are present.

    The stable front enables prompt caching. Providers cache the unchanged prefix of a request, so a system prompt and tool definitions that stay byte-identical between turns are much cheaper to process than fresh tokens. That is one practical reason configuration sits at the front: the part that never changes gets cached, the part that always changes goes last.

    What happens when the context fills up

    A long agentic session can burn through context surprisingly fast. Tool outputs are the usual culprit: one failing test run or one verbose log dump can be worth thousands of tokens.

    When the window approaches its limit, agents do not simply stop. They compact: the client summarizes older parts of the conversation and replaces them with a shorter description. Claude Code calls this auto-compact; Copilot summarizes conversation history similarly.

    Compaction keeps the session alive, but it is lossy. The summary preserves what the summarizer considered important, not necessarily what you considered important. If the agent suddenly "forgets" a decision from an hour ago, there is a good chance that decision did not survive a compaction pass.

    Bias #1: lost in the middle

    You may have heard that models pay more attention to the beginning and end of the context. This one is well documented. The paper Lost in the Middle (Liu et al., 2023) tested models on retrieving relevant information placed at different positions in long inputs and found a clear U-shaped curve: performance is highest when the relevant information is at the beginning or end of the context, and degrades significantly when it sits in the middle.

    This maps directly onto the layout from earlier. The system prompt and instruction files occupy the high-attention front. Your latest message occupies the high-attention end. The middle of a long session, where most of your earlier discussion lives, is exactly the region models use least reliably.

    The practical advice writes itself: if something from twenty messages ago still matters, repeat it in your current message instead of assuming the model will fish it out of the middle.

    Bias #2: early turns lay the foundation

    The second thing I had only "heard" was that the first messages anchor the model's behavior for the rest of the session. This also turns out to be measurable.

    A Microsoft and Salesforce team showed it in LLMs Get Lost in Multi-Turn Conversation. They took benchmark tasks and split the same information across multiple chat turns instead of one complete prompt. Average performance dropped by 39% across the models they tested. Their diagnosis: models make assumptions and attempt solutions in early turns before they have all the information, then keep relying on those early attempts. In their words, when LLMs take a wrong turn in a conversation, they get lost and do not recover.

    So the folklore is roughly right, with a sharper edge: it is not just that early messages are influential. It is that early mistakes are sticky, because they stay in the context and get re-read on every subsequent turn.

    This is why "arguing" with a confused model rarely works. Every correction you send is appended after the wrong answer, and the wrong answer is still sitting there, still being attended to. Starting a fresh session with a better first prompt is usually cheaper than rehabilitating a derailed one.

    When context turns against you

    Researchers and practitioners have catalogued several distinct ways a long context degrades output. Drew Breunig's How Long Contexts Fail is the cleanest taxonomy I found:

    • Context poisoning: a hallucination or error enters the context and gets repeatedly referenced. The Gemini 2.5 technical report described this while the model played Pokémon: once misinformation about the game state poisoned its goals, the agent fixated on impossible objectives for a very long time.
    • Context distraction: the context grows so large that the model leans on its history instead of its training. The same Gemini report observed that well past 100k tokens, the agent favored repeating past actions over synthesizing new plans. A Databricks study found correctness dropping around 32k tokens for Llama 3.1 405B, far below its advertised window.
    • Context confusion: superfluous content (irrelevant docs, unused tool definitions) degrades the response, because the model has to attend to everything you put in front of it.
    • Context clash: parts of the context contradict each other, for example an early wrong attempt clashing with later corrections. This is the multi-turn failure mode from the previous section.

    The common thread: a bigger window is capacity, not comprehension. Models misbehave long before the window is technically full.

    What I changed in my own workflow

    Knowing how the context fills up changed a few habits for me:

    • Front-load the first message. Constraints, file references, and expected output format go into the opening prompt, not into turn five.
    • Restart instead of arguing. If the session takes a wrong turn, I start fresh and write a better first prompt. The research says derailed sessions rarely recover, and my experience agrees.
    • Keep instruction files lean. Every line of copilot-instructions.md rides along with every request. I treat it like a hot code path.
    • Re-state what matters. Anything critical from earlier in a long session gets repeated in the current message, where attention is strongest.
    • Watch tool sprawl. MCP servers and skills are useful, but each one adds permanent weight to the front of every request.

    Final takeaway

    The context window is not a bag of facts the model rummages through. It is an ordered, finite, position-sensitive sequence where configuration sits at the cached front, your conversation accumulates in the unreliable middle, and your latest message lands at the high-attention end.

    Once you see that structure, most of the "weird" LLM behaviors stop being mysterious. Compaction loses details because summaries are lossy. Early mistakes persist because they are re-read every turn. Mid-session instructions get ignored because the middle is where attention goes to die.

    You cannot change how the window works. You can absolutely change what you put in it, and where.

    FAQ

    1. 1.LLM Context Window: how it is consumed and why it matters
    2. 2.Caveman Skill Review: Does It Really Save Tokens?

    RECOMMENDED FOR YOU