What is practice questions?

Tokens: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/tokens

What are Tokens in AI? — LumiChats AI Glossary

Tokens

Tokens are the basic units of text that AI models process. A token is roughly 0.75 words in English — so 1,000 tokens is approximately 750 words or about 3 pages of text. Tokenization is the process of splitting text into tokens, which are then converted into numerical IDs that the model can process.

The unit AI models think in.

Category: AI Fundamentals

What is a token exactly?

AI models don't process characters or whole words — they use tokens, chunks of text produced by a tokenizer algorithm. The most common is Byte-Pair Encoding (BPE), which starts with individual characters and iteratively merges the most frequent adjacent pairs.

import tiktoken

# GPT-4 tokenizer (cl100k_base, ~100k vocabulary)
enc = tiktoken.get_encoding("cl100k_base")

def tokenize(text: str):
    ids     = enc.encode(text)
    tokens  = [enc.decode([i]) for i in ids]
    return ids, tokens

# Example 1: Common English words
ids, tokens = tokenize("The quick brown fox")
print(f"Tokens: {tokens}")   # ['The', ' quick', ' brown', ' fox']
print(f"IDs:    {ids}")      # [791, 4062, 14198, 39935]
print(f"Count:  {len(ids)}")  # 4

# Example 2: Rare/technical words split into subwords
ids, tokens = tokenize("tokenization")
print(f"Tokens: {tokens}")   # ['token', 'ization'] → 2 tokens for 1 word!

# Example 3: Unicode (non-English is more expensive)
ids, tokens = tokenize("नमस्ते")   # Hindi for "hello"
print(f"Tokens: {tokens}")   # many more tokens than the English equivalent

# Example 4: Code
ids, tokens = tokenize("def hello_world():")
print(f"Tokens: {tokens}")   # ['def', ' hello', '_world', '():']

# Rule of thumb: 1 token ≈ 4 chars in English
test = "Hello, how are you doing today?"
print(f"Chars: {len(test)}, Tokens: {len(enc.encode(test))}")
# Chars: 31, Tokens: 8

Why "strawberry" stumped GPT-4: GPT-4 tokenizes "strawberry" as ["straw", "berry"] — two separate tokens. Since models process tokens, not characters, they literally cannot see the letters within a token. This is why early LLMs failed at tasks like "count the r's in strawberry" — the 'r' in 'straw' isn't directly visible as a character.

How Byte-Pair Encoding (BPE) works

BPE builds a vocabulary by iteratively merging the most frequent character pairs. Here's the algorithm:

from collections import Counter

def get_stats(vocab):
    """Count all adjacent symbol pairs across the vocabulary."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    """Merge the most frequent pair in all vocabulary entries."""
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    return {
        word.replace(bigram, replacement): freq
        for word, freq in vocab.items()
    }

# Initial vocabulary: words split into characters + </w> end marker
vocab = {
    'l o w </w>': 5,
    'l o w e r </w>': 2,
    'n e w e s t </w>': 6,
    'w i d e s t </w>': 3,
}

print("Initial vocab:", vocab)

# Run BPE for 10 merge operations
merges = []
for step in range(10):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    merges.append(best)
    print(f"Step {step+1}: Merge {best} → {''.join(best)}")

# Step 1: Merge ('e', 's') → es
# Step 2: Merge ('es', 't') → est
# Step 3: Merge ('est', '</w>') → est</w>
# Step 4: Merge ('l', 'o') → lo
# ...
# Eventually: 'newest' becomes a single token!

Input tokens vs output tokens

AI platforms count both input tokens (everything sent to the model) and output tokens (everything the model generates). Output tokens are more expensive to produce because each requires a full forward pass through the model.

Content	Approx. tokens
Typical user question	50–150 tokens
This glossary article (full)	~2,000 tokens
A textbook chapter	8,000–15,000 tokens
A 300-page PDF	120,000–200,000 tokens
Long coding session	50,000+ tokens

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens before sending to API to estimate cost."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def estimate_cost(input_text: str, output_tokens: int = 500):
    """Estimate API cost in USD (GPT-4o pricing, 2025)."""
    input_tokens  = count_tokens(input_text)
    input_cost    = input_tokens  * 2.50 / 1_000_000   # $2.50/M input tokens
    output_cost   = output_tokens * 10.0 / 1_000_000   # $10.0/M output tokens
    total = input_cost + output_cost

    print(f"Input tokens:  {input_tokens:,}")
    print(f"Output tokens: {output_tokens:,} (estimated)")
    print(f"Input cost:    ${input_cost:.4f}")
    print(f"Output cost:   ${output_cost:.4f}")
    print(f"Total:         ${total:.4f}")
    return total

document = "This is a long research paper..." * 100  # simulated
estimate_cost(document)

Practice questions

What is the approximate token cost to process a 10-page academic paper (about 4000 words) with Claude? (Answer: English text averages ~1.3 tokens per word. 4000 words × 1.3 = ~5200 input tokens. At Claude Sonnet 4.6 pricing (~$3/million tokens), processing the paper costs ~$0.016. With a 500-token response at ~$15/million = $0.0075. Total ≈ $0.024 per query. For 1000 queries per day: ~$24/day. Token counting is essential for cost estimation in production AI applications.)
Why do different languages use tokens at different rates? (Answer: BPE vocabularies are built to compress common sequences. English has the most representation in training data → most efficient tokenization (1.3 tokens/word). European languages: 1.5–2 tokens/word. Asian languages (Chinese, Japanese, Korean): characters often map 1:1 to tokens but information density is higher per character than English letters, so per-concept cost is comparable. Arabic: right-to-left with ligatures, 2–4 tokens per word. This affects API costs significantly for multilingual applications.)
What are special tokens and why must they be handled carefully? (Answer: Special tokens are reserved vocabulary entries with specific semantic roles: [CLS] (BERT classification token), [SEP] (BERT separator), [PAD] (padding), [UNK] (unknown), [MASK] (BERT masking), <|endoftext|> (GPT end of text), <|im_start|>/<|im_end|> (chat template turn markers). If special tokens are treated as regular text (e.g., appearing literally in user input), models may behave unexpectedly. Production systems must sanitise user inputs to prevent special token injection.)
Token window vs context window — are these the same thing? (Answer: Often used interchangeably but technically distinct. Token window = maximum number of tokens the model can process (input + output combined). Context window = the effective context the model can attend to. For some architectures with sliding window attention, the context window may be smaller than the token window for old tokens. Claude's 200K context window means 200,000 tokens can be in input+output combined; all tokens in that window receive full bidirectional attention in the prefill phase.)
What is tokenization fertility and why is it a fairness concern? (Answer: Fertility = number of tokens per word/character in a language. Low fertility (English): efficient, cheap API cost, more information per context window. High fertility (non-English): expensive, reduced effective context, may truncate information. For a multilingual product, users writing in Yoruba or Vietnamese pay more per word than English users and get shorter effective contexts. This is a linguistic inequality embedded in the economic model of LLM APIs.)

Definition

What is a token exactly?

Tokenizing text with tiktoken (OpenAI's tokenizer for GPT-4)

import tiktoken

# GPT-4 tokenizer (cl100k_base, ~100k vocabulary)
enc = tiktoken.get_encoding("cl100k_base")

def tokenize(text: str):
    ids     = enc.encode(text)
    tokens  = [enc.decode([i]) for i in ids]
    return ids, tokens

# Example 1: Common English words
ids, tokens = tokenize("The quick brown fox")
print(f"Tokens: {tokens}")   # ['The', ' quick', ' brown', ' fox']
print(f"IDs:    {ids}")      # [791, 4062, 14198, 39935]
print(f"Count:  {len(ids)}")  # 4

# Example 2: Rare/technical words split into subwords
ids, tokens = tokenize("tokenization")
print(f"Tokens: {tokens}")   # ['token', 'ization'] → 2 tokens for 1 word!

# Example 3: Unicode (non-English is more expensive)
ids, tokens = tokenize("नमस्ते")   # Hindi for "hello"
print(f"Tokens: {tokens}")   # many more tokens than the English equivalent

# Example 4: Code
ids, tokens = tokenize("def hello_world():")
print(f"Tokens: {tokens}")   # ['def', ' hello', '_world', '():']

# Rule of thumb: 1 token ≈ 4 chars in English
test = "Hello, how are you doing today?"
print(f"Chars: {len(test)}, Tokens: {len(enc.encode(test))}")
# Chars: 31, Tokens: 8

Why "strawberry" stumped GPT-4

GPT-4 tokenizes "strawberry" as ["straw", "berry"] — two separate tokens. Since models process tokens, not characters, they literally cannot see the letters within a token. This is why early LLMs failed at tasks like "count the r's in strawberry" — the 'r' in 'straw' isn't directly visible as a character.

How Byte-Pair Encoding (BPE) works

BPE builds a vocabulary by iteratively merging the most frequent character pairs. Here's the algorithm:

Simplified BPE training algorithm from scratch

from collections import Counter

def get_stats(vocab):
    """Count all adjacent symbol pairs across the vocabulary."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    """Merge the most frequent pair in all vocabulary entries."""
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    return {
        word.replace(bigram, replacement): freq
        for word, freq in vocab.items()
    }

# Initial vocabulary: words split into characters + </w> end marker
vocab = {
    'l o w </w>': 5,
    'l o w e r </w>': 2,
    'n e w e s t </w>': 6,
    'w i d e s t </w>': 3,
}

print("Initial vocab:", vocab)

# Run BPE for 10 merge operations
merges = []
for step in range(10):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    merges.append(best)
    print(f"Step {step+1}: Merge {best} → {''.join(best)}")

# Step 1: Merge ('e', 's') → es
# Step 2: Merge ('es', 't') → est
# Step 3: Merge ('est', '</w>') → est</w>
# Step 4: Merge ('l', 'o') → lo
# ...
# Eventually: 'newest' becomes a single token!

Input tokens vs output tokens

Content	Approx. tokens
Typical user question	50–150 tokens
This glossary article (full)	~2,000 tokens
A textbook chapter	8,000–15,000 tokens
A 300-page PDF	120,000–200,000 tokens
Long coding session	50,000+ tokens

Estimating token counts before making API calls

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens before sending to API to estimate cost."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def estimate_cost(input_text: str, output_tokens: int = 500):
    """Estimate API cost in USD (GPT-4o pricing, 2025)."""
    input_tokens  = count_tokens(input_text)
    input_cost    = input_tokens  * 2.50 / 1_000_000   # $2.50/M input tokens
    output_cost   = output_tokens * 10.0 / 1_000_000   # $10.0/M output tokens
    total = input_cost + output_cost

    print(f"Input tokens:  {input_tokens:,}")
    print(f"Output tokens: {output_tokens:,} (estimated)")
    print(f"Input cost:    ${input_cost:.4f}")
    print(f"Output cost:   ${output_cost:.4f}")
    print(f"Total:         ${total:.4f}")
    return total

document = "This is a long research paper..." * 100  # simulated
estimate_cost(document)

Practice questions

What is the approximate token cost to process a 10-page academic paper (about 4000 words) with Claude? (Answer: English text averages ~1.3 tokens per word. 4000 words × 1.3 = ~5200 input tokens. At Claude Sonnet 4.6 pricing (~$3/million tokens), processing the paper costs ~$0.016. With a 500-token response at ~$15/million = $0.0075. Total ≈ $0.024 per query. For 1000 queries per day: ~$24/day. Token counting is essential for cost estimation in production AI applications.)
Why do different languages use tokens at different rates? (Answer: BPE vocabularies are built to compress common sequences. English has the most representation in training data → most efficient tokenization (1.3 tokens/word). European languages: 1.5–2 tokens/word. Asian languages (Chinese, Japanese, Korean): characters often map 1:1 to tokens but information density is higher per character than English letters, so per-concept cost is comparable. Arabic: right-to-left with ligatures, 2–4 tokens per word. This affects API costs significantly for multilingual applications.)
What are special tokens and why must they be handled carefully? (Answer: Special tokens are reserved vocabulary entries with specific semantic roles: [CLS] (BERT classification token), [SEP] (BERT separator), [PAD] (padding), [UNK] (unknown), [MASK] (BERT masking), <|endoftext|> (GPT end of text), <|im_start|>/<|im_end|> (chat template turn markers). If special tokens are treated as regular text (e.g., appearing literally in user input), models may behave unexpectedly. Production systems must sanitise user inputs to prevent special token injection.)
Token window vs context window — are these the same thing? (Answer: Often used interchangeably but technically distinct. Token window = maximum number of tokens the model can process (input + output combined). Context window = the effective context the model can attend to. For some architectures with sliding window attention, the context window may be smaller than the token window for old tokens. Claude's 200K context window means 200,000 tokens can be in input+output combined; all tokens in that window receive full bidirectional attention in the prefill phase.)
What is tokenization fertility and why is it a fairness concern? (Answer: Fertility = number of tokens per word/character in a language. Low fertility (English): efficient, cheap API cost, more information per context window. High fertility (non-English): expensive, reduced effective context, may truncate information. For a multilingual product, users writing in Yoruba or Vietnamese pay more per word than English users and get shorter effective contexts. This is a linguistic inequality embedded in the economic model of LLM APIs.)

Tokens

What is a token exactly?

How Byte-Pair Encoding (BPE) works

Input tokens vs output tokens

Practice questions

Tokens

What is a token exactly?

How Byte-Pair Encoding (BPE) works

Input tokens vs output tokens

Practice questions

Practice what you just learned

Related Terms