What does the caveman skill actually optimize?

The caveman skill mainly optimizes output length. It pushes the assistant toward shorter, denser replies, which can reduce output tokens and make responses faster to scan.

Does the caveman skill always reduce GitHub Copilot AI credit usage?

No. In these GitHub Copilot tests, caveman consistently reduced output size, but total AI credit usage still depended on task complexity, reasoning level, context, and what the agent did during the session.

When does the caveman skill help the most?

It helps the most on explanation-heavy tasks such as debugging summaries, code reviews, and long chat answers where the default response would otherwise be verbose.

Is caveman worth using in GitHub Copilot?

If you care about shorter answers and faster scanning, probably yes. If you expect a guaranteed drop in total Copilot cost, the results are less predictable and should be judged per workflow.

Caveman Skill Review: Does It Really Save Tokens?

Q: Does caveman reduce reasoning tokens or large-context cost?

No. Caveman is an output-compression tool. It does not reduce reasoning tokens, and it does not make large context windows cheaper on its own.

The interesting part about the caveman skill is not whether it can make an AI sound funny. It is whether shorter answers actually make GitHub Copilot cheaper or just easier to read.

After testing it in a few Copilot workflows, my takeaway is fairly specific: caveman is good at compressing the part of the answer I actually have to read, but that does not automatically translate into lower total AI credit usage.

In this review, I cover:

what the caveman skill actually does
where output savings looked meaningful
where Copilot AI credit savings were small, messy, or inconsistent

Quick verdict

If your goal is shorter, denser answers, caveman is useful. If your goal is to reliably reduce total GitHub Copilot AI credit usage, the results are much less predictable.

What is the "caveman skill"?

A few weeks back, I had a conversation with my colleague about the upcoming changes in Copilot billing model. That single change made us both actively look for ways to optimize token usage in our AI coding workflows.

The first tool we found and started testing was the caveman skill.

Features promised by the caveman skill are:

Reduce the number of output tokens by around 65% on average
Select a compression level (lite, full, ultra) to balance terseness and readability
Compress files, code explanations, error messages, and more
Preview stats of saved tokens

The core pitch is simple: keep the useful part of the answer and cut the filler.

One important caveat before we go further: caveman is about output compression. It does not reduce reasoning tokens, and it does not magically make large context cheaper. I still wanted to test it because output is a meaningful part of how I use Copilot. I read most generated responses carefully, compare options, and often ask Copilot to summarize, review, or explain things back to me. In that workflow, shorter output is not just about cost. It is also about how fast I can scan and verify the answer.

Test 1: Simple summarization prompt

Let's start with a fairly simple example.

Here I am comparing both Copilot AI Credits and output length, because those are the two signals I can measure most directly inside Copilot.

Prompt: "Your task is to read all my blog posts and summarize each one in one sentence"
Context: folder with .mdx files that are rendered to static pages on my blog
Model: GPT-5.4
Reasoning level: High

Run	AI credits used	Output tokens	Output characters
Without caveman skill	22	615	3073
With caveman skill	18.6	531	2645

If you switch the model reasoning level to Medium, the difference becomes even smaller: 17.5 vs 14.9 AI credits without and with caveman.

That barely looks like a difference at all.

What you have to consider is the specificity of the task. The LLM just reads post files in batches and then produces slightly modified one-sentence summaries. That is it. The task is simple, it does not require much reasoning, and the outputs are already relatively small. Caveman does not have much to optimize here.

This is also a good moment to clarify what exactly I am measuring. I know caveman saves only output tokens. It does not reduce the model's reasoning cost. But output still matters to me, because it is part of the way I work with Copilot. I spend time reading generated text, checking whether it makes sense, and deciding what to do next. If caveman can keep that part shorter without hurting quality, it is still valuable in my workflow, even if it is not the whole credit story. That is why I care about both the credits visible in Copilot and the size of the response itself.

Knowing that, let's try to make the task more complex.

Test 2: Portfolio feature audit

Prompt: "Your task is to list out all features of my portfolio website, group them by category, and write them down in a bullet point list. Make sure not to miss any"
Context: no initial context
Model: GPT-5.4
Reasoning level: High

Run	AI credits used	Output tokens	Output characters
Without caveman skill	74.4	not recorded	not recorded
With caveman skill	58.2	not recorded	not recorded

NOTE

This instruction was designed to be a little tricky for the LLM. I use the graphify library in my project, and it was used in both caveman and non-caveman runs, but new features were added after the last graph report was generated. So the LLM had to not only read the graph report, but also understand it and figure out that some features were missing. This makes it a better test of both reasoning and whether caveman degrades output quality.

Quick calculation shows that the caveman skill helped reduce token usage by 22% in this case. Still not even close to the headline numbers from the caveman README, but that is also the point: savings depend heavily on how verbose the baseline task is. It was still clearly better than the "vanilla" run, and we did not lose any output quality, which is a good sign.

Test 3: Debugging-style prompt

I also tested a scenario that should be much closer to my real Copilot workflow: investigating a concrete issue in my own codebase.

Prompt: "investigate React 19 lint failures in a single interactive diagram component, identify the likely root cause, and propose the smallest safe fix"
Context: serilog architecture diagram component
Model: GPT-5.4
Reasoning level: Medium

Run	AI credits used	Output tokens	Output characters
Without caveman skill	13.2	349	1617
With caveman skill	13.9	210	1068

This is a much better example of where caveman actually shines. The final answer was clearly shorter: around 40% fewer output tokens and about 34% fewer characters in the final bullet list. That is a meaningful difference in scan time and output size.

At the same time, the credit numbers are a good reminder that shorter output does not automatically mean lower total session cost. In this pair of runs, the caveman session actually used slightly more credits.

There is an important caveat, though: the two sessions were not perfectly identical. In the non-caveman run, Copilot investigated the files but did not run the linter in PowerShell. That means I would treat this as a good comparison of final output compression, but not as a clean apples-to-apples comparison of total Copilot credits.

So my takeaway from this test is pretty specific: for debugging-style prompts, caveman can significantly shrink the part of the answer I actually have to read, even when the overall credit usage does not improve in a neat, predictable way.

Should you use /caveman ultra?

Default compression level is "full".

You can pick "ultra" mode instead. This is the most aggressive compression level, and it is designed to produce the shortest possible output. When I tried it, the difference was obvious. It just uses simpler words and cuts even more connective tissue. I like this mode. You might hate it. It is not for everyone, but it is definitely worth trying if you want to save as many tokens as possible and do not care much about style.

That said, I do have some doubts about whether that simplicity could eventually hide important details. I will update this article if any obvious quality issues show up.

Things to consider before using caveman in Copilot

I had issues installing this skill. That might be due to the fact that I use Copilot in VS Code instead of Claude Code, which was the environment it was originally built for.

I also had issues invoking this skill. In roughly 40% of my sessions where I started the prompt with "Talk like a caveman" or something similar, the skill was not invoked. The most reliable way to invoke it was to start with "use caveman skill" or "activate caveman skill". If you have similar issues, try those phrases first.

Another thing worth calling out is how I personally use LLMs. I try to stay close to the technical details, I read most generated outputs carefully, and I usually jump into the code myself instead of blindly accepting everything. That is exactly why I care about output so much. Caveman saves only output tokens, but output is still part of my real Copilot cost and a big part of my day-to-day friction.

So when I compare runs in this article, I am not claiming that caveman solves the whole billing problem. I am testing whether it helps with the part that affects me directly: the amount of generated text I have to read, verify, and pay for during normal Copilot usage. In my case, that is worth measuring, even if my average session is relatively short and does not capture every token category. These comparisons are practical workflow comparisons, not a perfect lab-grade audit of every token category behind the scenes.

Final takeaway

Caveman looks most useful when you judge it as an output-compression tool, not as a universal cost-reduction tool.

It can absolutely make Copilot responses shorter and faster to scan. What it cannot do is guarantee lower total Copilot AI credit usage, because those credits still depend on reasoning, context, and the overall shape of the session.

If you spend a lot of time reading long Copilot answers, that may still be enough reason to use it.

Caveman Skill Review: Does It Really Save Tokens?

Quick verdict

What is the "caveman skill"?

Test 1: Simple summarization prompt

Test 2: Portfolio feature audit

Test 3: Debugging-style prompt

Should you use /caveman ultra?

Things to consider before using caveman in Copilot

Final takeaway

FAQ

In this series

Further reading

Where to Host a Website or Web App: Options and Costs

GitHub Copilot pricing change: Is it still worth it?

Serilog & OpenTelemetry in .NET: React Demo UI