Token Efficiency in AI Coding
This series is all about how you, as a developer, can optimize token efficiency when working with AI agents. I'll try to share general tips and tricks and benchmark promising libraries and tools. Let's start with the basics: what is a token, and when can we talk about token efficiency?
What is a token?
In the context of AI language models, a token is a unit of text that the model processes. It can be as small as a single character or as large as a word or even a phrase, depending on the tokenization method used. For example, the sentence "Hello, world!" might be tokenized into ["Hello", ",", "world", "!"]. The number of tokens in a prompt or response can affect the cost and performance of using AI models, as many models have limits on the number of tokens they can process in a single request. This is where token efficiency comes into play.
Token efficiency
Token efficiency refers to the practice of optimizing the number of tokens used in prompts and responses when interacting with AI models.
Generally speaking, token efficiency is about getting the most value out of the tokens you use. If you can achieve the same result with a shorter prompt and fewer round trips, you are being more token efficient.
Why token efficiency matters
Before we jump into the details, let's quickly cover why token efficiency matters:
- Cost: Many AI tools charge based on the number of tokens processed. Using unnecessary tokens can balloon your costs.
- Performance: Models have token limits. If your session with an AI agent is too long, it may lead to context compression, which may remove important information from the context and lead to worse results.
- Relevance: Using too many tokens can dilute the focus of the prompt, making it harder for the model to understand the core task and generate relevant responses.
- User Experience: In interactive applications, long responses can overwhelm users. Keeping responses concise can improve readability and engagement.
- Efficiency: Optimizing token usage can lead to faster response times and more efficient use of computational resources.
There is one more factor that sits somewhere between all of the points above: compounding error. LLMs are probabilistic systems, which means every extra step is another chance for something to go slightly wrong.
Even if a model were 99% accurate at each step, after 50 stages of interaction the probability of a fully correct end-to-end result would be about 60%. If you assume 95% step-level accuracy instead, that same calculation drops below 8% after 50 stages.
Different LLMs, different levels of token efficiency
Different models can produce very different token footprints for the same task. That matters because most usage-based AI products charge for some combination of:
- the number of tokens in the prompt (input)
- the number of tokens in the response (output)
- cached input tokens reused from earlier context or prompt prefixes
A few examples:
| Model | Input tokens | Output tokens | Cached tokens |
|---|---|---|---|
| GPT-5.4 | 2.50 USD | 0.25 USD | 15.00 USD |
| GPT-5.5 | 5.00 USD | 0.50 USD | 30.00 USD |
| Opus 4.6 | 5.00 USD | 0.50 USD | 25.00 USD |
| Opus 4.7 | 5.00 USD | 0.50 USD | 25.00 USD |
| Opus 4.8 | 5.00 USD | 0.50 USD | 25.00 USD |
Note: prices per 1M tokens.
Keep in mind that this is the raw token cost. But some LLMs require more "reasoning" tokens to achieve the same results than others. So, the cost of tokens is not the only factor to consider when optimizing for token efficiency.
If you want to better understand how efficient LLMs are in terms of token usage and cost, I highly recommend checking out the Artificial Analysis website.
Here is a concrete example: GPT-5.5 has a higher token cost overall than Opus 4.7, but the number of tokens generated for the same task is much lower.
General optimization strategies
If you wanted to create a new tool to save tokens, where would you start?
I would start by understanding the problem surface. A simple assumption: if an LLM generates a short answer, it must have produced fewer output tokens (costs saved!). This demonstrates that there are some general ideas, and they might be shared across many tools.
This raises another question: can tools individually save tokens but together cause problems (e.g. context pollution, too many tool calls, etc.)? I think so, and I'll try to prove that someday. But for now, let's focus on the general strategies that can be applied to optimize token efficiency when working with AI agents:
- Input/Output Optimization: Craft prompts that are concise and focused on the single, well-defined task. Avoid unnecessary details and instructions that can be inferred by the model.
- Improve code/files search: If you have a complex codebase, it can be challenging for an AI agent to find the relevant information/files and understand relationships between them. Skills or tools that can help with code search and understanding can save tokens by reducing the need for the model to process irrelevant information.
- Context focus: Define a clear goal and stop criteria for the AI agent. Don't make never-ending final touches to the codebase; just create a new session and start with fresh context. This reduces input tokens for subsequent requests and lowers the risk of context pollution and compounding error.
- Adjust LLM and reasoning effort: If the task is simple, don't use the most powerful model with the largest context window. It is a waste of tokens.
- Is there more? I hope so. I'll modify this list as I discover more strategies.
AI context focus
As a human, if you wrote a 10-page resume, what are the odds that a recruiter would still spot the achievements you care about most? The outcome would likely be very different if the same information fit on one well-structured page with the important points clearly highlighted. It also takes far less time and attention to process.
Now replace the recruiter with an AI agent and the resume with a prompt.
If you want the best results from an AI agent, you need it focused on the task at hand instead of spending tokens on irrelevant information.
A clear goal in the prompt is the first step. You also want to minimize unnecessary tool calls and reduce distractions from unrelated files, instructions, or side quests. This becomes especially important in longer sessions, where the useful context can get diluted over time.
Tools and libraries I'm planning to cover in this series
In this series, I will cover tools and libraries that can help improve token efficiency when working with AI agents. The list will likely include:
