Glossary/Recurrent Neural Network (RNN) & LSTM
Deep Learning & Neural Networks

Recurrent Neural Network (RNN) & LSTM

The predecessor to Transformers for sequential data.


Definition

Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential data (text, time series, audio) by maintaining a hidden state that summarizes all previous inputs. Long Short-Term Memory (LSTM) networks added gating mechanisms to RNNs to better capture long-range dependencies. While largely superseded by Transformers for NLP, RNNs remain relevant for streaming/real-time applications.

How RNNs process sequences

Unlike feedforward networks that process fixed-size inputs independently, RNNs maintain a hidden state h_t that carries information through time. At each step, the new state depends on both the current input and the previous state:

Vanilla RNN update rule. The same weight matrices (W_hh, W_xh) are shared across all time steps — enabling the network to process sequences of any length.

Minimal RNN cell from scratch vs PyTorch

import torch
import torch.nn as nn

# Manual RNN cell (illustrative)
class ManualRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, x, h_prev):
        # x: (batch, input_size)  h_prev: (batch, hidden_size)
        return torch.tanh(self.W_xh(x) + self.W_hh(h_prev))

# PyTorch built-in (optimized CUDA kernels, handles batching)
rnn = nn.RNN(input_size=64, hidden_size=128, num_layers=2,
             batch_first=True, bidirectional=True)
# Input: (batch, seq_len, input_size)
x = torch.randn(32, 50, 64)  # batch=32, seq_len=50
output, h_n = rnn(x)  # output: (32, 50, 256), h_n: (4, 32, 128)

The vanishing gradient problem in RNNs

Backpropagation Through Time (BPTT) unrolls the RNN over T steps and applies the chain rule. The gradient at step 1 requires multiplying T Jacobians together. If the weight matrix has eigenvalues < 1, gradients shrink geometrically:

Gradient of loss with respect to early hidden state. With T=100 steps and eigenvalues < 1, the product → 0 exponentially. The model completely forgets early inputs.

Practical consequence

A vanilla RNN trained on sentences > ~20 tokens long will fail to learn long-range dependencies — a pronoun 50 words after its antecedent, a closing bracket 30 characters after the opening. LSTMs were invented specifically to solve this.

LSTM: gated memory cells

LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state C_t — a 'memory highway' with only linear interactions — controlled by three learned gates:

LSTM cell state update. The forget gate (sigmoid, range 0–1) decides what fraction of the old memory to keep. The input gate decides what new content to write. Linear cell state dynamics = gradients flow without vanishing.

GateRangeLearns to
Forget gate f_t[0, 1]Erase irrelevant memory (e.g., reset subject after period)
Input gate i_t[0, 1]Write new information selectively
Candidate C̃_t[−1, 1]Compute what new content to potentially write
Output gate o_t[0, 1]Control what portion of memory to expose as hidden state

GRUs and modern variants

GRU (Gated Recurrent Unit, Cho et al., 2014) simplifies LSTM by merging forget+input gates into a single update gate and eliminating the separate cell state — fewer parameters, similar performance:

ArchitectureParametersGatesKey strength
Vanilla RNNFewestNoneSimple, fast; only for very short sequences
LSTMMost3 gates + cell stateBest long-range memory; gold standard for sequences
GRUMiddle2 gates (reset, update)Nearly LSTM quality, faster training, less memory
Bidirectional LSTM2× LSTM3 gates per directionSees full context — best for classification/NER (not generation)
Stacked LSTMN × LSTM3 gates per layerHierarchical feature extraction; was SOTA NLP before Transformers

Still relevant in 2025

LSTMs/GRUs remain used in streaming/low-latency scenarios where you process one token at a time without materializing the full sequence. On-device keyword spotting, real-time audio processing, and IoT sensor data often use LSTMs because they require O(1) memory and compute per step vs O(T²) for Transformers.

Why Transformers replaced RNNs for NLP

DimensionRNN / LSTMTransformer
Training parallelismSequential — must process t=1 before t=2Fully parallel — all positions processed simultaneously
Training time (1M tokens)HoursMinutes
Long-range dependenciesO(n) path length — hardO(1) — direct attention between any two tokens
Memory during trainingO(n) — stores T hidden statesO(n²) — attention matrix (can be optimized with FlashAttention)
Inference (streaming)O(1) per step — ideal for real-timeO(n²) growing KV cache — expensive for long sequences
InterpretabilityHidden state opaqueAttention weights somewhat interpretable

When to still use RNNs

Real-time audio (speech recognition on device), IoT sensor streams, on-chip keyword detection, robot motor control — anywhere you need constant-time per-step inference with a fixed memory footprint. Also: state space models (Mamba, RWKV) are 2024–2025 architectures that combine RNN-like O(1) inference with near-Transformer quality.

Practice questions

  1. What is the vanishing gradient problem in RNNs and why does it prevent learning long-range dependencies? (Answer: During backpropagation through time (BPTT), gradients are multiplied by the recurrent weight matrix W at each timestep. If the spectral radius of W < 1, gradients decay exponentially backward. For a 100-step sequence: gradient at step 1 ≈ (0.8)^100 ≈ 2×10⁻¹⁰ — effectively zero. The network cannot learn that what happened 50 timesteps ago matters for the current prediction. LSTM and GRU solve this with gated memory cells that maintain gradients across long sequences.)
  2. What is backpropagation through time (BPTT) and what is truncated BPTT? (Answer: BPTT: unroll the RNN for all T timesteps, apply standard backpropagation on the unrolled computational graph. For T=1000 steps, this creates a graph 1000 layers deep — memory-intensive and prone to vanishing/exploding gradients. Truncated BPTT: backpropagate only k steps (e.g., k=20) while still processing the full sequence forward. Reduces memory and gradient instability at the cost of not learning dependencies > k steps. Standard practice for training LSTMs/GRUs on long sequences.)
  3. How does the LSTM cell state (C_t) differ from the hidden state (h_t)? (Answer: Cell state C_t: the 'memory' of the LSTM — a direct information highway through time with minimal transformation (only multiplicative gates, no non-linear activation). Information can flow unchanged across many timesteps — the gradient highway that solves vanishing gradients. Hidden state h_t: the 'output' of the LSTM at each step — derived from cell state via tanh and output gate. Passed to the next timestep AND to the output layer. C_t is internal memory; h_t is the observable representation.)
  4. What is teacher forcing in RNN training and what is the exposure bias problem it creates? (Answer: Teacher forcing: during training, feed the ground truth token as input to the next RNN step (even if the model predicted wrong). Advantages: faster convergence, stable gradients. Exposure bias problem: during inference, there is no ground truth to feed — the model must use its own previous predictions. Training distribution (perfect inputs) ≠ inference distribution (own outputs). Accumulated errors cause performance degradation on long sequences. Solution: scheduled sampling (gradually replace teacher inputs with model outputs during training).)
  5. When would you choose an RNN/LSTM over a Transformer for sequence modelling? (Answer: Practical advantages of RNN/LSTM in 2025: (1) Online/streaming processing: RNNs process sequences incrementally with O(1) memory per step — Transformers require the full sequence in memory. (2) Very long sequences where Transformer O(n²) attention is prohibitive. (3) Edge deployment: RNNs are extremely memory-efficient for sequential prediction on constrained devices. (4) Some time series tasks: LSTM architectures specifically designed for temporal forecasting remain competitive. General NLP: Transformers universally win.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms