Caveman Skill Review: Does It Really Save Tokens?

    In this article you will learn:

    • What is the "caveman skill" and what it brings to the table
    • How it can help you optimize your AI usage
    • My observations and experiments with the caveman skill

    What is the "caveman skill"?

    A few weeks back, I had conversation with my coleague about the upcomming changes in Copilot billing model. That single change made us both actively looking for ways to optimize token usage in our AI coding workflows.

    The first tool we have found and started testing is the caveman skill.

    Features promised by the caveman skill are:

    • Reduce the number of output tokens by 65% in average
    • Select compression level (lite, full, ultra) to balance between terseness and readability
    • Compress files, code explanations, error messages, and more
    • Preview stats of saved tokens

    One important caveat before we go further: caveman is about output compression. It does not reduce reasoning tokens, and it does not magically make large context cheaper. I still wanted to test it because output is a meaningful part of how I use Copilot. I read most generated responses carefully, compare options, and often ask Copilot to summarize, review, or explain things back to me. In that workflow, shorter output is not just about cost. It is also about how fast I can scan and verify the answer.

    Simple example

    Let's start with a fairly simple example. Command: "Your task is to read all my blog posts and summarize each one in one sentence." Context: folder with .mdx files that are rendered to static pages on my blog. Model: GPT-5.4 with reasoning level set to "High"

    Without caveman skill:

    Copilot AI Credits consumed: 22 List with summaries: 615 tokens / 3073 characters

    With caveman skill:

    Copilot AI Credits consumed: 18.6 List with summaries: 531 tokens / 2645 characters

    If you would switch the model reasoning level to "Medium", the difference would be even smaller: 17.5 to 14.9 credits (without/with).

    Well, it does not seem like a difference at all 🤔

    What you have to consider is the specificity of the task. The LLM just reads post files in batches and then produces slightly modified one-sentence summaries. That is it. The task is simple, it does not require much reasoning, and therefore the outputs are already relatively small. Caveman does not have much to optimize here.

    This is also a good moment to clarify what exactly I am measuring. I know caveman saves only output tokens. It does not reduce the model's reasoning cost. But output still matters to me, because it is part of the way I work with Copilot. I spend time reading generated text, checking whether it makes sense, and deciding what to do next. If caveman can keep that part shorter without hurting quality, it is still valuable in my workflow, even if it is not the whole credit story.

    Knowing that, let's try to make the task more complex.

    Caveman somewhat useful

    Command: "Your task is to list out all features of my portfolio website, group them by category, and write them down in a bullet point list. Make sure not to miss any" Context: no initial context Model: GPT-5.4 with reasoning level set to "High"

    Without caveman skill: 74.4 credits With caveman skill: 58.2 credits

    NOTE

    This instruction was designed to be a little tricky for the LLM - I use graphify library in my project (was used in both caveman/non-caveman runs), but new features were added since the last update of graph report was generated. Thus, LLM must have not only read the graph report, but also understand it and figure out that some features are missing. This is a task that requires not only reasoning, but also test if caveman does not degrade the quality of output.

    Quick calculation shows that the caveman skill helped to reduce the token usage by 22% in this case. Still not even close to the headline numbers from the caveman README, but still better than the "vanilla" run. At least we did not lose any output quality, which is a good sign.

    /caveman ultra

    Default compression level is "full".

    You can pick "ultra" mode instead. This is the most aggressive compression level, and it is designed to produce the shortest possible output. When you tried it out, the difference is obvious. It just use simple words. I like this mode. You might hate it. It is not for everyone, but it is definitely worth trying out if you want to save as much tokens as possible and you don't care about the style of output.

    Said so, I have my doubts if that simplicity would not lead to important details being lost. I will update that article if any obvious quality issues arise.

    Things to consider

    I had issues installing this skill. It might be due to the fact that I use Copilot in VS Code instead of Cloud Code (it was built for).

    I have issues invoking this skill as well (~40% of my sessions were I started the prompt with "Talk like a caveman" or simmilar, the skill was not invoked). The most reliabe way to invoke the skill was to start with "use caveman skill" or "activate caveman skill". If you have similar issues, try to use those phrases.

    Another thing worth calling out is how I personally use LLMs. I try to stay close to the technical details, I read most generated outputs carefully, and I usually jump into the code myself instead of blindly accepting everything. That is exactly why I care about output so much. Caveman saves only output tokens, but output is still part of my real Copilot cost and a big part of my day-to-day friction.

    So when I compare runs in this article, I am not claiming that caveman solves the whole billing problem. I am testing whether it helps with the part that affects me directly: the amount of generated text I have to read, verify, and pay for during normal Copilot usage. In my case, that is worth measuring, even if my average session is relatively short and does not capture every token category.

    Over the next couple of weeks, I want to test caveman in more complex scenarios and see where it helps the most. If you have ideas for prompts or workflows worth benchmarking, let me know. That would help me test this in situations that look closer to real Copilot usage, not just controlled examples.

    Stay tuned 🚀

    RECOMMENDED FOR YOU