Why do output tokens cost more than input tokens?

Input tokens are processed in parallel during the prefill phase, while output tokens are generated one at a time during decode — each requiring a full forward pass through the model. Decode is memory-bandwidth-bound and sequential, so it's fundamentally slower and more expensive. Across major providers output tokens cost roughly 3–10x more than input.

What is prefix (prompt) caching and how much does it save?

Prefix caching stores the prefill state of a reused prompt prefix so the provider skips recomputing it on later requests. On Anthropic, cache reads cost about 0.1x the base input rate — a 90% discount — while the one-time cache write costs 1.25x (5-min TTL) or 2x (1-hour TTL). For RAG and agent apps it commonly cuts input cost by 60–90%.

Why does the same sentence have different token counts on different models?

Each model family has its own learned tokenizer trained on different data. Claude's tokenizer, GPT's, and Llama's split the same text differently, so the same prompt bills differently on each. You can't eyeball it — you have to measure on the model you actually use.

Why does pasting code or a UUID cost so many tokens?

Tokenizers give common patterns short representations and break rare ones into pieces. Code, logs, JSON, and UUIDs are full of punctuation and random-looking strings the tokenizer can't compress, so they explode the token count compared to plain English of the same length.

Does the same model cost the same everywhere?

No. For open-weight models like Llama, multiple providers host identical weights at very different per-token prices and latencies. Picking a model is only half the decision — you also choose who serves it.

Tokens Are Compute — Why Your LLM Bill Is Really a GPU Bill

TL;DR

When you call an LLM you aren't sending text — you're buying compute. Every token in and out is work a GPU has to do, and that work is your bill, your latency, everything.
Words are a human unit. Characters are a transport unit. Tokens are the model's compute unit — and compute is what you pay for.
Inference splits into prefill (reads your whole prompt in parallel — fast) and decode (generates one token at a time, sequentially — slow). This is why output tokens cost 3–10x more than input. It's physics, not pricing strategy.
Most "the model forgot my instruction" bugs are context problems, not model problems. Audit what's in the context window before blaming the model.
Four cost levers most teams never pull: trim what you send, prefix caching (up to ~90% off input), model tiering, and provider shopping for open-weight models.

Why this matters

Most engineers think about LLMs in terms of words, characters, or messages. The model sees none of that. The model sees tokens — and tokens are what you actually pay for, what decides how fast your application runs, and what blows up your bill as you scale.

Here's the core idea, and everything else in this post comes back to it:

Important: What looks short to you can be long to the model. What looks long can be surprisingly short. You can't eyeball token cost — you have to measure.

If you build anything serious with LLMs, you have to learn to read your prompts the way the model reads them: in tokens, not words.

What a token actually is

The fastest way to build intuition is to run text through a tokenizer and watch how the model splits it. A few examples that surprise people:

Input	Looks like	Tokens	Why
`The quick brown fox jumps over the lazy dog`	9 words	~9	Common words each get their own token
`tokenizationally`	1 word	~5	Rare word — split into known sub-pieces
A UUID like `a1b2c3d4-...`	1 ID	~20	Random chars and dashes don't compress
One line of Python (~40 chars)	1 line	~15	Every `_`, `:`, `(`, `*` counts

⚠️ Warning: A single UUID can eat ~20 tokens. If your retrieval layer pulls five documents each prefixed with a couple of UUIDs, that's ~100 tokens of pure metadata tax before you've said anything useful.

This is also why pasting a stack trace costs far more than pasting the same length of English prose. The model isn't reading meaning — it's counting symbols.

How tokenization works (and why)

Why is is one token but tokenizationally five? Because tokenizers are learned. You take a huge pile of text and run an algorithm that finds the patterns showing up most often. Common stuff — the, is, and — appears everywhere, so giving each a short representation saves compute later. Rare words never earned a dedicated token, so the tokenizer falls back to breaking them into smaller known chunks.

The technical name for this is byte pair encoding (BPE). You don't need to memorize the name — just the rule:

Common patterns are cheap. Rare patterns are expensive. And "common" is defined by the training data, which is mostly English text from the internet.

That has three real consequences:

Different models have different tokenizers. Claude's tokenizer isn't GPT's isn't Llama's. The same sentence counts differently on each. Even within one family this shifts — Opus 4.7 ships with a new tokenizer that can generate up to 35% more tokens for the same input text compared to Opus 4.6, with per-token prices unchanged but effective cost per request rising accordingly.
Non-English text tokenizes worse. Languages like Hindi, Mandarin, and Arabic often take 2–3x more tokens per character than English. If you build for a global audience, some languages are fundamentally more expensive.
Code, logs, JSON, and UUIDs explode token counts. Anything dense with punctuation and random-looking strings costs disproportionately.

The journey of an API call

So what physically happens when your code calls client.chat.completions.create(...)? It's five steps. Along the way you'll pick up the terms you'll see constantly in LLM work: inference, prefill, decode, context window.

First, the distinction that frames everything:

Training is when the model learns. Billions of pages, huge GPU clusters, months of time, tens of millions of dollars. Happens once.
Inference is when the model runs. Every API call, every chat message, every prompt your app sends. The model executes what it learned — it doesn't learn anything new.

Inference accounts for over 90% of total LLM operational cost because training happens once but inference runs on every request. When we talk cost, latency, and scaling, we're always talking inference. Training is somebody else's problem.

Breaking it down:

1. Tokenization — your string hits the provider, the tokenizer turns it into a sequence of integer token IDs. The model consumes numbers, not text.
2. Embedding — each token ID maps to a high-dimensional vector. You don't need the math; just know meaning gets encoded as a list of numbers.
3. Prefill — the model reads your entire input at once. All tokens go through in parallel, building the internal state needed to start generating. The prefill phase, which processes the entire input prompt, involves massive matrix multiplications that are highly parallelizable, making it compute-bound. The wait between hitting enter and the first character appearing — time to first token — is mostly prefill.
4. Decode — the model generates the response one token at a time, each new token depending on all previous ones. The decode phase generates the LLM output auto-regressively, predicting one token at a time and adding it back to the sequence. It's sequential — you cannot parallelize it — and every output token is a full pass through the model.
5. Detokenization — output IDs map back to text and stream to your client. Fast and basically free.

Here's why all that matters: input and output tokens are handled completely differently inside the model. Input is processed in parallel in one shot; output is processed sequentially, token by token.

Important: Output tokens cost 3–10x more than input tokens at every major provider. This isn't an arbitrary markup by providers like OpenAI or Anthropic; it's a reflection of the physical constraints of 'Intelligence' provided primarily on GPU hardware.

The asymmetry is brutal at the hardware level. At small batch sizes the decode cost per token can be as high as ~200 times the prefill cost per token. Modern GPUs are built around fast tensor cores fed by high-bandwidth memory; prefill uses both, while decode leaves the tensor cores idle and saturates memory bandwidth streaming weights for a single token at a time.

This is also why two calls that look identical to your user can take wildly different times. One is a short question with a short answer. The other stuffs a 5,000-token document into the prompt and asks for a long answer. Same chat box, same button — but to the GPU they're completely different workloads.

Three "model bugs" that are really token bugs

Before optimizing cost, recognize the failures that get blamed on the model but live in your context window.

The model forgets a constraint. Your system prompt worked perfectly in testing, then production sometimes ignores the rule. What happened? You crowded the context. System prompt at the top, then long conversation history, then five retrieved documents — by the time the model generates, the original instruction is buried thousands of tokens earlier. Attention dilutes. The instruction is technically there but no longer salient.

The same pattern explains hallucination (you didn't spend tokens on grounding) and inconsistency (context bloat dilutes the relevant signal). Same cause, same fix:

💡 Tip: When models forget, hallucinate, or behave inconsistently, don't blame the model first. Audit your context. Ask what's actually in there, and whether each token is earning its place. Spend your tokens on purpose.

Where the money goes

Every LLM has two prices: reading your input, and generating your output — and output costs 3–10x more per token because of decode. GPT-4o lists $2.50 per million input tokens against $10.00 per million output tokens, a 4x premium on the memory-bound phase.

Which means verbosity is a burn rate, not a style choice. A model answering in two paragraphs when one sentence would do costs you 3–5x more per response. When you build, decide the answer shape yourself: set max_tokens, use structured outputs, don't let the model write whatever length it wants.

The four cost levers most teams never pull

1. Trim what you send

Your system prompt goes out on every request, so every word is a tax you pay again and again. A prompt twice as long isn't a one-time cost — it's double the input bill on every call, forever. Cut anything the model doesn't need, and cap max_tokens to the smallest number that still works.

2. Prefix caching — the biggest hidden lever

Before the model generates a single token, it has to prefill your entire prompt. If you send the same long system prompt on every request, you pay for that same prefill work repeatedly. Prefix caching stops the waste: the first time the provider processes your prompt it stores the prefill state, and on the next request — if the start of the prompt matches — it loads the cached state and only processes the new part.

Cache hits cost 90% less than standard input tokens; the write costs 1.25x standard input for a 5-minute TTL or 2.0x for a 1-hour TTL, and any subsequent request within the window pays only 0.10x. For a 5-minute cache to pay off you need at least 2 reads within 5 minutes; for a 1-hour cache, about 12 reads.

The savings are not hypothetical. Anthropic's prompt caching reduces costs by up to 90% and latency by up to 85% for long prompts; OpenAI achieves a 50% cost reduction with automatic caching enabled by default.

⚠️ Warning: Caching requires exact matching — even a whitespace change at the start of the prompt can break the cache. Put your stable content (system prompt, tools, docs) first and your variable content (the user message) last.

3. Model tiering

Not every request needs your most expensive model. Simple classification, pulling a field out of text, routing, a short summary — that runs fine on a smaller, cheaper model. Reserve the frontier model for hard reasoning.

It's the same idea as a CPU cache hierarchy: you don't hit main memory for every operation because a faster, cheaper layer usually has what you need. Don't reach for the flagship on every request — match the model to the difficulty of the task.

Task	Tier	Why
Classification, field extraction, routing	Small / cheap	Deterministic, low reasoning load
Summaries, simple rewrites, FAQ answers	Mid	Some nuance, not hard reasoning
Multi-step reasoning, agents, complex code	Frontier	Worth the premium

4. Shop providers for open-weight models

The same model can cost wildly different amounts depending on who serves it. Take Llama 3.3 — an open-weight model hosted by Together AI, Fireworks, Groq, Deep Infra, and others. Same weights, same outputs, but very different price and latency per provider.

Important: Picking a model is only half the decision — you also pick the provider running it. Most teams wire up one SDK on day one, never revisit it, and quietly overpay for months.

If you don't want to integrate five APIs and normalize five error formats yourself, a routing layer (e.g. OpenRouter, which speaks the OpenAI chat-completions schema so you only change the base URL) abstracts that away and lets you switch models with a single string. [Mentioned in the source as the sponsor — included here as one option, not an endorsement.]

Production checklist

Measure tokens on your actual model — never eyeball; tokenizers differ across and within families.
Trim the system prompt — it ships on every request; every word is a recurring tax.
Cap max_tokens and use structured outputs — output is the 3–10x-expensive direction.
Cache stable prefixes — system prompt, tool definitions, and reused documents first; variable user input last.
Tier your models — route easy work to cheap models, reserve the frontier model for hard reasoning.
Benchmark providers for open-weight models — same weights, different price and latency.
Audit the context window when the model "misbehaves" — most forgetting/hallucination is context bloat.
Set fallback chains so a provider outage shifts traffic mid-flight instead of paging you at 2 a.m.

Conclusion

I've stopped thinking of LLM prompts as text and started thinking of them as compute budgets. Every one of these levers is the same underlying decision: which tokens you'll pay to include, and which you'll refuse to carry.

Engineers who build with this in mind ship systems that scale — predictable bills, tight latency, applications that keep working as they grow. Engineers who don't tend to learn it the hard way, in an incident review where the root cause turns out to be that they spent their compute budget on punctuation, UUIDs, and blobs.

Start small: open a tokenizer, paste in one of your real production prompts, and look at it the way the model does. You'll almost certainly find tokens that aren't earning their place.

TL;DR

When you call an LLM you aren't sending text — you're buying compute. Every token in and out is work a GPU has to do, and that work is your bill, your latency, everything.
Words are a human unit. Characters are a transport unit. Tokens are the model's compute unit — and compute is what you pay for.
Inference splits into prefill (reads your whole prompt in parallel — fast) and decode (generates one token at a time, sequentially — slow). This is why output tokens cost 3–10x more than input. It's physics, not pricing strategy.
Most "the model forgot my instruction" bugs are context problems, not model problems. Audit what's in the context window before blaming the model.
Four cost levers most teams never pull: trim what you send, prefix caching (up to ~90% off input), model tiering, and provider shopping for open-weight models.

Why this matters

Here's the core idea, and everything else in this post comes back to it:

Important: What looks short to you can be long to the model. What looks long can be surprisingly short. You can't eyeball token cost — you have to measure.

If you build anything serious with LLMs, you have to learn to read your prompts the way the model reads them: in tokens, not words.

What a token actually is

The fastest way to build intuition is to run text through a tokenizer and watch how the model splits it. A few examples that surprise people:

Input	Looks like	Tokens	Why
`The quick brown fox jumps over the lazy dog`	9 words	~9	Common words each get their own token
`tokenizationally`	1 word	~5	Rare word — split into known sub-pieces
A UUID like `a1b2c3d4-...`	1 ID	~20	Random chars and dashes don't compress
One line of Python (~40 chars)	1 line	~15	Every `_`, `:`, `(`, `*` counts

⚠️ Warning: A single UUID can eat ~20 tokens. If your retrieval layer pulls five documents each prefixed with a couple of UUIDs, that's ~100 tokens of pure metadata tax before you've said anything useful.

This is also why pasting a stack trace costs far more than pasting the same length of English prose. The model isn't reading meaning — it's counting symbols.

How tokenization works (and why)

The technical name for this is byte pair encoding (BPE). You don't need to memorize the name — just the rule:

Common patterns are cheap. Rare patterns are expensive. And "common" is defined by the training data, which is mostly English text from the internet.

That has three real consequences:

Different models have different tokenizers. Claude's tokenizer isn't GPT's isn't Llama's. The same sentence counts differently on each. Even within one family this shifts — Opus 4.7 ships with a new tokenizer that can generate up to 35% more tokens for the same input text compared to Opus 4.6, with per-token prices unchanged but effective cost per request rising accordingly.
Non-English text tokenizes worse. Languages like Hindi, Mandarin, and Arabic often take 2–3x more tokens per character than English. If you build for a global audience, some languages are fundamentally more expensive.
Code, logs, JSON, and UUIDs explode token counts. Anything dense with punctuation and random-looking strings costs disproportionately.

The journey of an API call

First, the distinction that frames everything:

Training is when the model learns. Billions of pages, huge GPU clusters, months of time, tens of millions of dollars. Happens once.
Inference is when the model runs. Every API call, every chat message, every prompt your app sends. The model executes what it learned — it doesn't learn anything new.

Breaking it down:

1. Tokenization — your string hits the provider, the tokenizer turns it into a sequence of integer token IDs. The model consumes numbers, not text.
2. Embedding — each token ID maps to a high-dimensional vector. You don't need the math; just know meaning gets encoded as a list of numbers.
3. Prefill — the model reads your entire input at once. All tokens go through in parallel, building the internal state needed to start generating. The prefill phase, which processes the entire input prompt, involves massive matrix multiplications that are highly parallelizable, making it compute-bound. The wait between hitting enter and the first character appearing — time to first token — is mostly prefill.
4. Decode — the model generates the response one token at a time, each new token depending on all previous ones. The decode phase generates the LLM output auto-regressively, predicting one token at a time and adding it back to the sequence. It's sequential — you cannot parallelize it — and every output token is a full pass through the model.
5. Detokenization — output IDs map back to text and stream to your client. Fast and basically free.

Important: Output tokens cost 3–10x more than input tokens at every major provider. This isn't an arbitrary markup by providers like OpenAI or Anthropic; it's a reflection of the physical constraints of 'Intelligence' provided primarily on GPU hardware.

Three "model bugs" that are really token bugs

Before optimizing cost, recognize the failures that get blamed on the model but live in your context window.

The same pattern explains hallucination (you didn't spend tokens on grounding) and inconsistency (context bloat dilutes the relevant signal). Same cause, same fix:

💡 Tip: When models forget, hallucinate, or behave inconsistently, don't blame the model first. Audit your context. Ask what's actually in there, and whether each token is earning its place. Spend your tokens on purpose.

⚠️ Warning: Caching requires exact matching — even a whitespace change at the start of the prompt can break the cache. Put your stable content (system prompt, tools, docs) first and your variable content (the user message) last.

3. Model tiering

Task	Tier	Why
Classification, field extraction, routing	Small / cheap	Deterministic, low reasoning load
Summaries, simple rewrites, FAQ answers	Mid	Some nuance, not hard reasoning
Multi-step reasoning, agents, complex code	Frontier	Worth the premium

4. Shop providers for open-weight models

Important: Picking a model is only half the decision — you also pick the provider running it. Most teams wire up one SDK on day one, never revisit it, and quietly overpay for months.

Production checklist

Measure tokens on your actual model — never eyeball; tokenizers differ across and within families.
Trim the system prompt — it ships on every request; every word is a recurring tax.
Cap max_tokens and use structured outputs — output is the 3–10x-expensive direction.
Cache stable prefixes — system prompt, tool definitions, and reused documents first; variable user input last.
Tier your models — route easy work to cheap models, reserve the frontier model for hard reasoning.
Benchmark providers for open-weight models — same weights, different price and latency.
Audit the context window when the model "misbehaves" — most forgetting/hallucination is context bloat.
Set fallback chains so a provider outage shifts traffic mid-flight instead of paging you at 2 a.m.

Conclusion

Start small: open a tokenizer, paste in one of your real production prompts, and look at it the way the model does. You'll almost certainly find tokens that aren't earning their place.

FAQ

FAQ