What is practice questions?

Recurrent Neural Network (RNN) & LSTM: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/recurrent-neural-network

Recurrent Neural Network (RNN) & LSTM

Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential data (text, time series, audio) by maintaining a hidden state that summarizes all previous inputs. Long Short-Term Memory (LSTM) networks added gating mechanisms to RNNs to better capture long-range dependencies. While largely superseded by Transformers for NLP, RNNs remain relevant for streaming/real-time applications.

The predecessor to Transformers for sequential data.

Category: Deep Learning & Neural Networks

How RNNs process sequences

Unlike feedforward networks that process fixed-size inputs independently, RNNs maintain a hidden state h_t that carries information through time. At each step, the new state depends on both the current input and the previous state:

h_t = \tanh\!\left(W_{hh}\,h_{t-1} + W_{xh}\,x_t + b\right)

import torch
import torch.nn as nn

# Manual RNN cell (illustrative)
class ManualRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, x, h_prev):
        # x: (batch, input_size)  h_prev: (batch, hidden_size)
        return torch.tanh(self.W_xh(x) + self.W_hh(h_prev))

# PyTorch built-in (optimized CUDA kernels, handles batching)
rnn = nn.RNN(input_size=64, hidden_size=128, num_layers=2,
             batch_first=True, bidirectional=True)
# Input: (batch, seq_len, input_size)
x = torch.randn(32, 50, 64)  # batch=32, seq_len=50
output, h_n = rnn(x)  # output: (32, 50, 256), h_n: (4, 32, 128)

The vanishing gradient problem in RNNs

Backpropagation Through Time (BPTT) unrolls the RNN over T steps and applies the chain rule. The gradient at step 1 requires multiplying T Jacobians together. If the weight matrix has eigenvalues < 1, gradients shrink geometrically:

\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} W_{hh}^{\top} \cdot \text{diag}(1-h_t^2)

Practical consequence: A vanilla RNN trained on sentences > ~20 tokens long will fail to learn long-range dependencies — a pronoun 50 words after its antecedent, a closing bracket 30 characters after the opening. LSTMs were invented specifically to solve this.

LSTM: gated memory cells

LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state C_t — a 'memory highway' with only linear interactions — controlled by three learned gates:

C_t = \underbrace{\sigma(W_f [h_{t-1}, x_t])}_\text{forget gate} \odot C_{t-1} + \underbrace{\sigma(W_i [h_{t-1}, x_t])}_\text{input gate} \odot \underbrace{\tanh(W_c [h_{t-1}, x_t])}_\text{candidate}

Gate	Range	Learns to
Forget gate f_t	[0, 1]	Erase irrelevant memory (e.g., reset subject after period)
Input gate i_t	[0, 1]	Write new information selectively
Candidate C̃_t	[−1, 1]	Compute what new content to potentially write
Output gate o_t	[0, 1]	Control what portion of memory to expose as hidden state

GRUs and modern variants

GRU (Gated Recurrent Unit, Cho et al., 2014) simplifies LSTM by merging forget+input gates into a single update gate and eliminating the separate cell state — fewer parameters, similar performance:

Architecture	Parameters	Gates	Key strength
Vanilla RNN	Fewest	None	Simple, fast; only for very short sequences
LSTM	Most	3 gates + cell state	Best long-range memory; gold standard for sequences
GRU	Middle	2 gates (reset, update)	Nearly LSTM quality, faster training, less memory
Bidirectional LSTM	2× LSTM	3 gates per direction	Sees full context — best for classification/NER (not generation)
Stacked LSTM	N × LSTM	3 gates per layer	Hierarchical feature extraction; was SOTA NLP before Transformers

Still relevant in 2025: LSTMs/GRUs remain used in streaming/low-latency scenarios where you process one token at a time without materializing the full sequence. On-device keyword spotting, real-time audio processing, and IoT sensor data often use LSTMs because they require O(1) memory and compute per step vs O(T²) for Transformers.

Why Transformers replaced RNNs for NLP

Dimension	RNN / LSTM	Transformer
Training parallelism	Sequential — must process t=1 before t=2	Fully parallel — all positions processed simultaneously
Training time (1M tokens)	Hours	Minutes
Long-range dependencies	O(n) path length — hard	O(1) — direct attention between any two tokens
Memory during training	O(n) — stores T hidden states	O(n²) — attention matrix (can be optimized with FlashAttention)
Inference (streaming)	O(1) per step — ideal for real-time	O(n²) growing KV cache — expensive for long sequences
Interpretability	Hidden state opaque	Attention weights somewhat interpretable

When to still use RNNs: Real-time audio (speech recognition on device), IoT sensor streams, on-chip keyword detection, robot motor control — anywhere you need constant-time per-step inference with a fixed memory footprint. Also: state space models (Mamba, RWKV) are 2024–2025 architectures that combine RNN-like O(1) inference with near-Transformer quality.

Practice questions

What is the vanishing gradient problem in RNNs and why does it prevent learning long-range dependencies? (Answer: During backpropagation through time (BPTT), gradients are multiplied by the recurrent weight matrix W at each timestep. If the spectral radius of W < 1, gradients decay exponentially backward. For a 100-step sequence: gradient at step 1 ≈ (0.8)^100 ≈ 2×10⁻¹⁰ — effectively zero. The network cannot learn that what happened 50 timesteps ago matters for the current prediction. LSTM and GRU solve this with gated memory cells that maintain gradients across long sequences.)
What is backpropagation through time (BPTT) and what is truncated BPTT? (Answer: BPTT: unroll the RNN for all T timesteps, apply standard backpropagation on the unrolled computational graph. For T=1000 steps, this creates a graph 1000 layers deep — memory-intensive and prone to vanishing/exploding gradients. Truncated BPTT: backpropagate only k steps (e.g., k=20) while still processing the full sequence forward. Reduces memory and gradient instability at the cost of not learning dependencies > k steps. Standard practice for training LSTMs/GRUs on long sequences.)
How does the LSTM cell state (C_t) differ from the hidden state (h_t)? (Answer: Cell state C_t: the 'memory' of the LSTM — a direct information highway through time with minimal transformation (only multiplicative gates, no non-linear activation). Information can flow unchanged across many timesteps — the gradient highway that solves vanishing gradients. Hidden state h_t: the 'output' of the LSTM at each step — derived from cell state via tanh and output gate. Passed to the next timestep AND to the output layer. C_t is internal memory; h_t is the observable representation.)
What is teacher forcing in RNN training and what is the exposure bias problem it creates? (Answer: Teacher forcing: during training, feed the ground truth token as input to the next RNN step (even if the model predicted wrong). Advantages: faster convergence, stable gradients. Exposure bias problem: during inference, there is no ground truth to feed — the model must use its own previous predictions. Training distribution (perfect inputs) ≠ inference distribution (own outputs). Accumulated errors cause performance degradation on long sequences. Solution: scheduled sampling (gradually replace teacher inputs with model outputs during training).)
When would you choose an RNN/LSTM over a Transformer for sequence modeling? (Answer: Practical advantages of RNN/LSTM in 2025: (1) Online/streaming processing: RNNs process sequences incrementally with O(1) memory per step — Transformers require the full sequence in memory. (2) Very long sequences where Transformer O(n²) attention is prohibitive. (3) Edge deployment: RNNs are extremely memory-efficient for sequential prediction on constrained devices. (4) Some time series tasks: LSTM architectures specifically designed for temporal forecasting remain competitive. General NLP: Transformers universally win.)

Definition

How RNNs process sequences

Vanilla RNN update rule. The same weight matrices (W_hh, W_xh) are shared across all time steps — enabling the network to process sequences of any length.

Minimal RNN cell from scratch vs PyTorch

import torch
import torch.nn as nn

# Manual RNN cell (illustrative)
class ManualRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, x, h_prev):
        # x: (batch, input_size)  h_prev: (batch, hidden_size)
        return torch.tanh(self.W_xh(x) + self.W_hh(h_prev))

# PyTorch built-in (optimized CUDA kernels, handles batching)
rnn = nn.RNN(input_size=64, hidden_size=128, num_layers=2,
             batch_first=True, bidirectional=True)
# Input: (batch, seq_len, input_size)
x = torch.randn(32, 50, 64)  # batch=32, seq_len=50
output, h_n = rnn(x)  # output: (32, 50, 256), h_n: (4, 32, 128)

The vanishing gradient problem in RNNs

Gradient of loss with respect to early hidden state. With T=100 steps and eigenvalues < 1, the product → 0 exponentially. The model completely forgets early inputs.

Practical consequence

A vanilla RNN trained on sentences > ~20 tokens long will fail to learn long-range dependencies — a pronoun 50 words after its antecedent, a closing bracket 30 characters after the opening. LSTMs were invented specifically to solve this.

LSTM: gated memory cells

LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state C_t — a 'memory highway' with only linear interactions — controlled by three learned gates:

LSTM cell state update. The forget gate (sigmoid, range 0–1) decides what fraction of the old memory to keep. The input gate decides what new content to write. Linear cell state dynamics = gradients flow without vanishing.

Gate	Range	Learns to
Forget gate f_t	[0, 1]	Erase irrelevant memory (e.g., reset subject after period)
Input gate i_t	[0, 1]	Write new information selectively
Candidate C̃_t	[−1, 1]	Compute what new content to potentially write
Output gate o_t	[0, 1]	Control what portion of memory to expose as hidden state

GRUs and modern variants

Architecture	Parameters	Gates	Key strength
Vanilla RNN	Fewest	None	Simple, fast; only for very short sequences
LSTM	Most	3 gates + cell state	Best long-range memory; gold standard for sequences
GRU	Middle	2 gates (reset, update)	Nearly LSTM quality, faster training, less memory
Bidirectional LSTM	2× LSTM	3 gates per direction	Sees full context — best for classification/NER (not generation)
Stacked LSTM	N × LSTM	3 gates per layer	Hierarchical feature extraction; was SOTA NLP before Transformers

Still relevant in 2025

LSTMs/GRUs remain used in streaming/low-latency scenarios where you process one token at a time without materializing the full sequence. On-device keyword spotting, real-time audio processing, and IoT sensor data often use LSTMs because they require O(1) memory and compute per step vs O(T²) for Transformers.

Why Transformers replaced RNNs for NLP

Dimension	RNN / LSTM	Transformer
Training parallelism	Sequential — must process t=1 before t=2	Fully parallel — all positions processed simultaneously
Training time (1M tokens)	Hours	Minutes
Long-range dependencies	O(n) path length — hard	O(1) — direct attention between any two tokens
Memory during training	O(n) — stores T hidden states	O(n²) — attention matrix (can be optimized with FlashAttention)
Inference (streaming)	O(1) per step — ideal for real-time	O(n²) growing KV cache — expensive for long sequences
Interpretability	Hidden state opaque	Attention weights somewhat interpretable

When to still use RNNs

Real-time audio (speech recognition on device), IoT sensor streams, on-chip keyword detection, robot motor control — anywhere you need constant-time per-step inference with a fixed memory footprint. Also: state space models (Mamba, RWKV) are 2024–2025 architectures that combine RNN-like O(1) inference with near-Transformer quality.

Practice questions

What is the vanishing gradient problem in RNNs and why does it prevent learning long-range dependencies? (Answer: During backpropagation through time (BPTT), gradients are multiplied by the recurrent weight matrix W at each timestep. If the spectral radius of W < 1, gradients decay exponentially backward. For a 100-step sequence: gradient at step 1 ≈ (0.8)^100 ≈ 2×10⁻¹⁰ — effectively zero. The network cannot learn that what happened 50 timesteps ago matters for the current prediction. LSTM and GRU solve this with gated memory cells that maintain gradients across long sequences.)
What is backpropagation through time (BPTT) and what is truncated BPTT? (Answer: BPTT: unroll the RNN for all T timesteps, apply standard backpropagation on the unrolled computational graph. For T=1000 steps, this creates a graph 1000 layers deep — memory-intensive and prone to vanishing/exploding gradients. Truncated BPTT: backpropagate only k steps (e.g., k=20) while still processing the full sequence forward. Reduces memory and gradient instability at the cost of not learning dependencies > k steps. Standard practice for training LSTMs/GRUs on long sequences.)
How does the LSTM cell state (C_t) differ from the hidden state (h_t)? (Answer: Cell state C_t: the 'memory' of the LSTM — a direct information highway through time with minimal transformation (only multiplicative gates, no non-linear activation). Information can flow unchanged across many timesteps — the gradient highway that solves vanishing gradients. Hidden state h_t: the 'output' of the LSTM at each step — derived from cell state via tanh and output gate. Passed to the next timestep AND to the output layer. C_t is internal memory; h_t is the observable representation.)
What is teacher forcing in RNN training and what is the exposure bias problem it creates? (Answer: Teacher forcing: during training, feed the ground truth token as input to the next RNN step (even if the model predicted wrong). Advantages: faster convergence, stable gradients. Exposure bias problem: during inference, there is no ground truth to feed — the model must use its own previous predictions. Training distribution (perfect inputs) ≠ inference distribution (own outputs). Accumulated errors cause performance degradation on long sequences. Solution: scheduled sampling (gradually replace teacher inputs with model outputs during training).)
When would you choose an RNN/LSTM over a Transformer for sequence modeling? (Answer: Practical advantages of RNN/LSTM in 2025: (1) Online/streaming processing: RNNs process sequences incrementally with O(1) memory per step — Transformers require the full sequence in memory. (2) Very long sequences where Transformer O(n²) attention is prohibitive. (3) Edge deployment: RNNs are extremely memory-efficient for sequential prediction on constrained devices. (4) Some time series tasks: LSTM architectures specifically designed for temporal forecasting remain competitive. General NLP: Transformers universally win.)

Recurrent Neural Network (RNN) & LSTM

How RNNs process sequences

The vanishing gradient problem in RNNs

LSTM: gated memory cells

GRUs and modern variants

Why Transformers replaced RNNs for NLP

Practice questions

Recurrent Neural Network (RNN) & LSTM

How RNNs process sequences

The vanishing gradient problem in RNNs

LSTM: gated memory cells

GRUs and modern variants

Why Transformers replaced RNNs for NLP

Practice questions

Practice what you just learned

Related Terms