Tokens are the basic units of text that AI models process. A token is roughly 0.75 words in English — so 1,000 tokens is approximately 750 words or about 3 pages of text. Tokenization is the process of splitting text into tokens, which are then converted into numerical IDs that the model can process.
What is a token exactly?
AI models don't process characters or whole words — they use tokens, chunks of text produced by a tokenizer algorithm. The most common is Byte-Pair Encoding (BPE), which starts with individual characters and iteratively merges the most frequent adjacent pairs.
Tokenizing text with tiktoken (OpenAI's tokenizer for GPT-4)
import tiktoken
# GPT-4 tokenizer (cl100k_base, ~100k vocabulary)
enc = tiktoken.get_encoding("cl100k_base")
def tokenize(text: str):
ids = enc.encode(text)
tokens = [enc.decode([i]) for i in ids]
return ids, tokens
# Example 1: Common English words
ids, tokens = tokenize("The quick brown fox")
print(f"Tokens: {tokens}") # ['The', ' quick', ' brown', ' fox']
print(f"IDs: {ids}") # [791, 4062, 14198, 39935]
print(f"Count: {len(ids)}") # 4
# Example 2: Rare/technical words split into subwords
ids, tokens = tokenize("tokenization")
print(f"Tokens: {tokens}") # ['token', 'ization'] → 2 tokens for 1 word!
# Example 3: Unicode (non-English is more expensive)
ids, tokens = tokenize("नमस्ते") # Hindi for "hello"
print(f"Tokens: {tokens}") # many more tokens than the English equivalent
# Example 4: Code
ids, tokens = tokenize("def hello_world():")
print(f"Tokens: {tokens}") # ['def', ' hello', '_world', '():']
# Rule of thumb: 1 token ≈ 4 chars in English
test = "Hello, how are you doing today?"
print(f"Chars: {len(test)}, Tokens: {len(enc.encode(test))}")
# Chars: 31, Tokens: 8Why "strawberry" stumped GPT-4
GPT-4 tokenizes "strawberry" as ["straw", "berry"] — two separate tokens. Since models process tokens, not characters, they literally cannot see the letters within a token. This is why early LLMs failed at tasks like "count the r's in strawberry" — the 'r' in 'straw' isn't directly visible as a character.
How Byte-Pair Encoding (BPE) works
BPE builds a vocabulary by iteratively merging the most frequent character pairs. Here's the algorithm:
Simplified BPE training algorithm from scratch
from collections import Counter
def get_stats(vocab):
"""Count all adjacent symbol pairs across the vocabulary."""
pairs = Counter()
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i+1])] += freq
return pairs
def merge_vocab(pair, vocab):
"""Merge the most frequent pair in all vocabulary entries."""
bigram = ' '.join(pair)
replacement = ''.join(pair)
return {
word.replace(bigram, replacement): freq
for word, freq in vocab.items()
}
# Initial vocabulary: words split into characters + </w> end marker
vocab = {
'l o w </w>': 5,
'l o w e r </w>': 2,
'n e w e s t </w>': 6,
'w i d e s t </w>': 3,
}
print("Initial vocab:", vocab)
# Run BPE for 10 merge operations
merges = []
for step in range(10):
pairs = get_stats(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
merges.append(best)
print(f"Step {step+1}: Merge {best} → {''.join(best)}")
# Step 1: Merge ('e', 's') → es
# Step 2: Merge ('es', 't') → est
# Step 3: Merge ('est', '</w>') → est</w>
# Step 4: Merge ('l', 'o') → lo
# ...
# Eventually: 'newest' becomes a single token!Input tokens vs output tokens
AI platforms count both input tokens (everything sent to the model) and output tokens (everything the model generates). Output tokens are more expensive to produce because each requires a full forward pass through the model.
| Content | Approx. tokens |
|---|---|
| Typical user question | 50–150 tokens |
| This glossary article (full) | ~2,000 tokens |
| A textbook chapter | 8,000–15,000 tokens |
| A 300-page PDF | 120,000–200,000 tokens |
| Long coding session | 50,000+ tokens |
Estimating token counts before making API calls
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens before sending to API to estimate cost."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def estimate_cost(input_text: str, output_tokens: int = 500):
"""Estimate API cost in USD (GPT-4o pricing, 2025)."""
input_tokens = count_tokens(input_text)
input_cost = input_tokens * 2.50 / 1_000_000 # $2.50/M input tokens
output_cost = output_tokens * 10.0 / 1_000_000 # $10.0/M output tokens
total = input_cost + output_cost
print(f"Input tokens: {input_tokens:,}")
print(f"Output tokens: {output_tokens:,} (estimated)")
print(f"Input cost: ${input_cost:.4f}")
print(f"Output cost: ${output_cost:.4f}")
print(f"Total: ${total:.4f}")
return total
document = "This is a long research paper..." * 100 # simulated
estimate_cost(document)Practice questions
- What is the approximate token cost to process a 10-page academic paper (about 4000 words) with Claude? (Answer: English text averages ~1.3 tokens per word. 4000 words × 1.3 = ~5200 input tokens. At Claude Sonnet 4.6 pricing (~$3/million tokens), processing the paper costs ~$0.016. With a 500-token response at ~$15/million = $0.0075. Total ≈ $0.024 per query. For 1000 queries per day: ~$24/day. Token counting is essential for cost estimation in production AI applications.)
- Why do different languages use tokens at different rates? (Answer: BPE vocabularies are built to compress common sequences. English has the most representation in training data → most efficient tokenisation (1.3 tokens/word). European languages: 1.5–2 tokens/word. Asian languages (Chinese, Japanese, Korean): characters often map 1:1 to tokens but information density is higher per character than English letters, so per-concept cost is comparable. Arabic: right-to-left with ligatures, 2–4 tokens per word. This affects API costs significantly for multilingual applications.)
- What are special tokens and why must they be handled carefully? (Answer: Special tokens are reserved vocabulary entries with specific semantic roles: [CLS] (BERT classification token), [SEP] (BERT separator), [PAD] (padding), [UNK] (unknown), [MASK] (BERT masking), <|endoftext|> (GPT end of text), <|im_start|>/<|im_end|> (chat template turn markers). If special tokens are treated as regular text (e.g., appearing literally in user input), models may behave unexpectedly. Production systems must sanitise user inputs to prevent special token injection.)
- Token window vs context window — are these the same thing? (Answer: Often used interchangeably but technically distinct. Token window = maximum number of tokens the model can process (input + output combined). Context window = the effective context the model can attend to. For some architectures with sliding window attention, the context window may be smaller than the token window for old tokens. Claude's 200K context window means 200,000 tokens can be in input+output combined; all tokens in that window receive full bidirectional attention in the prefill phase.)
- What is tokenisation fertility and why is it a fairness concern? (Answer: Fertility = number of tokens per word/character in a language. Low fertility (English): efficient, cheap API cost, more information per context window. High fertility (non-English): expensive, reduced effective context, may truncate information. For a multilingual product, users writing in Yoruba or Vietnamese pay more per word than English users and get shorter effective contexts. This is a linguistic inequality embedded in the economic model of LLM APIs.)