What is practice questions?

Transformer Architecture: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/transformer

Transformer Architecture

The Transformer is a neural network architecture introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al. at Google. It replaced recurrent neural networks (RNNs) as the dominant architecture for NLP by processing entire sequences in parallel using attention mechanisms, enabling massive scale and dramatically faster training.

The neural network design that powers every modern LLM.

Category: AI Fundamentals

The revolution: from RNNs to Transformers

Before Transformers, sequence modeling used RNNs and LSTMs — architectures that processed tokens one-by-one in sequence. This made training slow (no parallelization) and made long-range dependencies hard to capture because information decayed over many steps.

The Transformer's key innovation: eliminate recurrence entirely and use self-attention instead. Every token directly attends to every other token in a single operation — enabling full GPU parallelization and capturing global dependencies without decay.

Property	RNN/LSTM	Transformer
Parallelization	❌ Sequential (step-by-step)	✅ Fully parallel across tokens
Long-range dependencies	⚠️ Degrades with distance	✅ O(1) path between any two tokens
Training speed	Slow (GPU underutilized)	Fast (GPU fully utilized)
Memory	O(n) — the hidden state	O(n²) — attention matrix
Scale	Hard to scale past 1B params	Scales to trillions of parameters

Inside a Transformer block

Each Transformer block (stacked 12–96+ times in modern models) contains exactly these components in order:

Layer Norm — normalize the input (stabilizes training)
Multi-head self-attention — every token attends to every other token in parallel
Residual connection — add the block input to the attention output (skip connection)
Layer Norm — normalize again
Feed-forward network — two linear layers with GELU activation, applied independently to each token position
Residual connection — add the block input to the FFN output

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        # Multi-head self-attention
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )

        # Feed-forward network: expand 4× then project back
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),    # d_ff = 4 × d_model typically
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

        # Layer normalizations
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, causal_mask=None):
        # x shape: (batch, seq_len, d_model)

        # Pre-norm + self-attention + residual
        normed = self.norm1(x)
        attn_out, _ = self.attn(
            normed, normed, normed,
            attn_mask=causal_mask,      # None for encoder, causal for decoder
            is_causal=(causal_mask is not None)
        )
        x = x + self.dropout(attn_out)   # residual connection

        # Pre-norm + FFN + residual
        x = x + self.ffn(self.norm2(x))  # residual connection

        return x


# Minimal GPT-2 style decoder
class GPTDecoder(nn.Module):
    def __init__(self, vocab_size=50257, d_model=768, n_heads=12, n_layers=12):
        super().__init__()
        self.embed     = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(1024, d_model)     # learned positional
        self.blocks    = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff=4*d_model)
            for _ in range(n_layers)
        ])
        self.norm_out  = nn.LayerNorm(d_model)
        self.lm_head   = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        B, T = input_ids.shape
        pos = torch.arange(T, device=input_ids.device)

        x = self.embed(input_ids) + self.pos_embed(pos)

        # Causal mask: each token only sees itself and prior tokens
        causal_mask = nn.Transformer.generate_square_subsequent_mask(T)

        for block in self.blocks:
            x = block(x, causal_mask)

        logits = self.lm_head(self.norm_out(x))  # (B, T, vocab_size)
        return logits

# GPT-2 Small: 12 layers, 12 heads, d_model=768 → ~117M params
model = GPTDecoder()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")  # ~123M

Positional encodings

Self-attention treats input as a set, not a sequence — it has no inherent sense of order. Positional encodings add position information to token embeddings. The original Transformer used sinusoidal encodings:

PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Modern models use RoPE (Rotary Position Embedding) — a more powerful approach that encodes relative positions by rotating attention query and key vectors:

RoPE advantage: RoPE encodes relative rather than absolute positions, which makes it easier to extend to longer contexts than seen during training. LLaMA, Mistral, and most modern LLMs use RoPE. The original sinusoidal PE is now rarely used.

import numpy as np
import torch

def sinusoidal_positional_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Compute the original sinusoidal positional encodings from
    'Attention Is All You Need' (Vaswani et al., 2017).
    """
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]      # (max_len, 1)
    i = np.arange(d_model // 2)[np.newaxis, :]        # (1, d_model/2)

    div_term = 10000 ** (2 * i / d_model)

    PE[:, 0::2] = np.sin(position / div_term)          # even dims: sin
    PE[:, 1::2] = np.cos(position / div_term)          # odd dims: cos

    return torch.FloatTensor(PE)

pe = sinusoidal_positional_encoding(max_len=512, d_model=512)
print(f"Positional encoding shape: {pe.shape}")   # (512, 512)

# Usage: add to token embeddings before feeding to Transformer
# x = token_embeddings + pe[:seq_len, :]

Decoder-only vs Encoder-only vs Encoder-Decoder

Architecture	Attention type	Examples	Best for
Decoder-only	Causal (masked) self-attention	GPT, LLaMA, Claude, Gemini	Text generation, chat, completion
Encoder-only	Bidirectional self-attention	BERT, RoBERTa, DeBERTa	Classification, embeddings, NLU tasks
Encoder-Decoder	Bidirectional enc + causal dec + cross-attn	T5, BART, mT5, Whisper	Translation, summarization, seq2seq

The Transformer architecture from the 2017 paper was encoder-decoder (designed for translation). GPT (2018) showed that decoder-only models scale better for general language modeling. BERT (2018) showed encoder-only models excel at understanding tasks. Modern frontier LLMs (GPT-4, Claude, LLaMA) are all decoder-only.

Practice questions

What problem does the self-attention mechanism solve that RNNs could not? (Answer: RNNs process tokens sequentially — token at position t depends on all previous tokens only through the hidden state, which becomes a bottleneck for long-range dependencies. Self-attention computes relationships between every pair of tokens in O(1) steps regardless of distance. Token at position 1 and token at position 512 have direct attention connections. RNNs need 511 sequential steps to propagate information between them — gradients vanish. Transformers process all positions in parallel, making both training and long-range dependency learning dramatically more efficient.)
Why is scaled dot-product attention divided by √d_k? (Answer: Dot products QK^T grow in magnitude with d_k (key dimension). For d_k=64, expected dot product magnitude is √64=8. Without scaling, these large values push softmax into saturation regions (nearly one-hot distributions) where gradients are extremely small. Dividing by √d_k keeps pre-softmax values in a reasonable range (variance ≈ 1), maintaining healthy softmax gradients throughout training. This is why 'scaled' dot-product attention — the scaling is essential for stable training.)
What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures? (Answer: Encoder-only (BERT): bidirectional — every token attends to every other token. Best for understanding/classification tasks (NER, sentiment, QA). Decoder-only (GPT, LLaMA, Claude): causal — each token only attends to previous tokens. Best for generation. Encoder-decoder (T5, BART): encoder processes input bidirectionally, decoder generates output attending to encoder outputs via cross-attention. Best for seq2seq tasks (translation, summarization) where input and output are different sequences.)
What is positional encoding and why do transformers need it? (Answer: Self-attention is permutation-invariant — shuffling input tokens produces the same attention values (just reordered). The model has no notion of token order without explicit position information. Positional encoding adds position-dependent values to token embeddings before attention. Sinusoidal PE (original): PE(pos, 2i) = sin(pos/10000^(2i/d_model)). Each position has a unique, smooth pattern the model can use to infer order. Rotary PE (RoPE, used in LLaMA): encodes relative positions via rotation matrices applied to Q and K, enabling better length generalization.)
A transformer with 12 layers, 12 attention heads, and d_model=768. How many parameters does the attention mechanism contribute? (Answer: Per attention head: Q, K, V projections each are d_model × (d_model/heads) = 768 × 64. Output projection: d_model × d_model = 768 × 768. Per layer: 4 × d_model² = 4 × 768² = 2,359,296 parameters. For 12 layers: 12 × 2,359,296 = 28,311,552 ≈ 28M parameters. This is BERT-base (110M total) — attention contributes ~25% of parameters; the rest is in feed-forward layers (FFN ≈ 4 × d_model × d_model × 2 per layer).)

Definition

The revolution: from RNNs to Transformers

Property	RNN/LSTM	Transformer
Parallelization	❌ Sequential (step-by-step)	✅ Fully parallel across tokens
Long-range dependencies	⚠️ Degrades with distance	✅ O(1) path between any two tokens
Training speed	Slow (GPU underutilized)	Fast (GPU fully utilized)
Memory	O(n) — the hidden state	O(n²) — attention matrix
Scale	Hard to scale past 1B params	Scales to trillions of parameters

Inside a Transformer block

Each Transformer block (stacked 12–96+ times in modern models) contains exactly these components in order:

Layer Norm — normalize the input (stabilizes training)
Multi-head self-attention — every token attends to every other token in parallel
Residual connection — add the block input to the attention output (skip connection)
Layer Norm — normalize again
Feed-forward network — two linear layers with GELU activation, applied independently to each token position
Residual connection — add the block input to the FFN output

A single Transformer block in PyTorch (simplified)

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        # Multi-head self-attention
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )

        # Feed-forward network: expand 4× then project back
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),    # d_ff = 4 × d_model typically
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

        # Layer normalizations
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, causal_mask=None):
        # x shape: (batch, seq_len, d_model)

        # Pre-norm + self-attention + residual
        normed = self.norm1(x)
        attn_out, _ = self.attn(
            normed, normed, normed,
            attn_mask=causal_mask,      # None for encoder, causal for decoder
            is_causal=(causal_mask is not None)
        )
        x = x + self.dropout(attn_out)   # residual connection

        # Pre-norm + FFN + residual
        x = x + self.ffn(self.norm2(x))  # residual connection

        return x


# Minimal GPT-2 style decoder
class GPTDecoder(nn.Module):
    def __init__(self, vocab_size=50257, d_model=768, n_heads=12, n_layers=12):
        super().__init__()
        self.embed     = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(1024, d_model)     # learned positional
        self.blocks    = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff=4*d_model)
            for _ in range(n_layers)
        ])
        self.norm_out  = nn.LayerNorm(d_model)
        self.lm_head   = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        B, T = input_ids.shape
        pos = torch.arange(T, device=input_ids.device)

        x = self.embed(input_ids) + self.pos_embed(pos)

        # Causal mask: each token only sees itself and prior tokens
        causal_mask = nn.Transformer.generate_square_subsequent_mask(T)

        for block in self.blocks:
            x = block(x, causal_mask)

        logits = self.lm_head(self.norm_out(x))  # (B, T, vocab_size)
        return logits

# GPT-2 Small: 12 layers, 12 heads, d_model=768 → ~117M params
model = GPTDecoder()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")  # ~123M

Positional encodings

Even dimensions use sine, odd dimensions use cosine. Creates a unique pattern for each position.

Together these create position-unique vectors that generalize to unseen sequence lengths.

Modern models use RoPE (Rotary Position Embedding) — a more powerful approach that encodes relative positions by rotating attention query and key vectors:

RoPE advantage

RoPE encodes relative rather than absolute positions, which makes it easier to extend to longer contexts than seen during training. LLaMA, Mistral, and most modern LLMs use RoPE. The original sinusoidal PE is now rarely used.

Sinusoidal positional encoding from the original paper

import numpy as np
import torch

def sinusoidal_positional_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Compute the original sinusoidal positional encodings from
    'Attention Is All You Need' (Vaswani et al., 2017).
    """
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]      # (max_len, 1)
    i = np.arange(d_model // 2)[np.newaxis, :]        # (1, d_model/2)

    div_term = 10000 ** (2 * i / d_model)

    PE[:, 0::2] = np.sin(position / div_term)          # even dims: sin
    PE[:, 1::2] = np.cos(position / div_term)          # odd dims: cos

    return torch.FloatTensor(PE)

pe = sinusoidal_positional_encoding(max_len=512, d_model=512)
print(f"Positional encoding shape: {pe.shape}")   # (512, 512)

# Usage: add to token embeddings before feeding to Transformer
# x = token_embeddings + pe[:seq_len, :]

Decoder-only vs Encoder-only vs Encoder-Decoder

Architecture	Attention type	Examples	Best for
Decoder-only	Causal (masked) self-attention	GPT, LLaMA, Claude, Gemini	Text generation, chat, completion
Encoder-only	Bidirectional self-attention	BERT, RoBERTa, DeBERTa	Classification, embeddings, NLU tasks
Encoder-Decoder	Bidirectional enc + causal dec + cross-attn	T5, BART, mT5, Whisper	Translation, summarization, seq2seq

Practice questions

What problem does the self-attention mechanism solve that RNNs could not? (Answer: RNNs process tokens sequentially — token at position t depends on all previous tokens only through the hidden state, which becomes a bottleneck for long-range dependencies. Self-attention computes relationships between every pair of tokens in O(1) steps regardless of distance. Token at position 1 and token at position 512 have direct attention connections. RNNs need 511 sequential steps to propagate information between them — gradients vanish. Transformers process all positions in parallel, making both training and long-range dependency learning dramatically more efficient.)
Why is scaled dot-product attention divided by √d_k? (Answer: Dot products QK^T grow in magnitude with d_k (key dimension). For d_k=64, expected dot product magnitude is √64=8. Without scaling, these large values push softmax into saturation regions (nearly one-hot distributions) where gradients are extremely small. Dividing by √d_k keeps pre-softmax values in a reasonable range (variance ≈ 1), maintaining healthy softmax gradients throughout training. This is why 'scaled' dot-product attention — the scaling is essential for stable training.)
What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures? (Answer: Encoder-only (BERT): bidirectional — every token attends to every other token. Best for understanding/classification tasks (NER, sentiment, QA). Decoder-only (GPT, LLaMA, Claude): causal — each token only attends to previous tokens. Best for generation. Encoder-decoder (T5, BART): encoder processes input bidirectionally, decoder generates output attending to encoder outputs via cross-attention. Best for seq2seq tasks (translation, summarization) where input and output are different sequences.)
What is positional encoding and why do transformers need it? (Answer: Self-attention is permutation-invariant — shuffling input tokens produces the same attention values (just reordered). The model has no notion of token order without explicit position information. Positional encoding adds position-dependent values to token embeddings before attention. Sinusoidal PE (original): PE(pos, 2i) = sin(pos/10000^(2i/d_model)). Each position has a unique, smooth pattern the model can use to infer order. Rotary PE (RoPE, used in LLaMA): encodes relative positions via rotation matrices applied to Q and K, enabling better length generalization.)
A transformer with 12 layers, 12 attention heads, and d_model=768. How many parameters does the attention mechanism contribute? (Answer: Per attention head: Q, K, V projections each are d_model × (d_model/heads) = 768 × 64. Output projection: d_model × d_model = 768 × 768. Per layer: 4 × d_model² = 4 × 768² = 2,359,296 parameters. For 12 layers: 12 × 2,359,296 = 28,311,552 ≈ 28M parameters. This is BERT-base (110M total) — attention contributes ~25% of parameters; the rest is in feed-forward layers (FFN ≈ 4 × d_model × d_model × 2 per layer).)

Transformer Architecture

The revolution: from RNNs to Transformers

Inside a Transformer block

Positional encodings

Decoder-only vs Encoder-only vs Encoder-Decoder

Practice questions

Transformer Architecture

The revolution: from RNNs to Transformers

Inside a Transformer block

Positional encodings

Decoder-only vs Encoder-only vs Encoder-Decoder

Practice questions

Practice what you just learned

Related Terms