Deep learning transforms raw data into intelligent systems across three mega-domains: Natural Language Processing (understanding and generating text — chatbots, translation, summarisation), Speech Recognition (converting audio to text and text to speech — voice assistants, transcription), and Recommendation Systems (personalising content — Netflix, Spotify, Amazon, YouTube). Each domain has evolved from hand-crafted features to deep learning pipelines that achieve superhuman performance on narrow tasks.
NLP applications of deep learning
| NLP Application | Deep Learning model | Input | Output |
|---|---|---|---|
| Machine Translation | Transformer (MarianMT, NLLB) | Source language text | Target language text |
| Text Summarisation | BART, T5, Pegasus | Long document | Short summary |
| Question Answering | BERT-SQuAD, RoBERTa | Context + question | Answer span or generated text |
| Sentiment Analysis | Fine-tuned BERT/DistilBERT | Review/tweet text | Positive/Negative/Neutral |
| Named Entity Recognition | BERT with token classification | Text | Token labels (PER, ORG, LOC) |
| Text Generation | GPT-4, Claude, Llama, Mistral | Prompt | Continuation / response |
| Text Classification | BERT, FastText | Document | Category label |
NLP deep learning pipeline with Hugging Face
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
# ── Sentiment Analysis ──
sentiment_clf = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
results = sentiment_clf([
"This product is absolutely amazing, I love it!",
"Worst purchase I have ever made. Do not buy.",
])
for r in results:
print(f"{r['label']:<10} ({r['score']:.2%})")
# ── Named Entity Recognition ──
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Apple CEO Tim Cook announced new products at the iPhone 16 event in Cupertino."
entities = ner(text)
for e in entities:
print(f"{e['entity_group']}: {e['word']} ({e['score']:.2%})")
# ORG: Apple, PER: Tim Cook, MISC: iPhone 16, LOC: Cupertino
# ── Text Summarisation ──
summariser = pipeline("summarization", model="facebook/bart-large-cnn")
article = """The transformer architecture revolutionised natural language processing
by replacing recurrent networks with self-attention mechanisms. This allows
parallel processing of entire sequences simultaneously, dramatically reducing
training time and enabling the development of much larger language models..."""
summary = summariser(article, max_length=80, min_length=30)[0]['summary_text']
print(f"Summary: {summary}")
# ── Zero-Shot Classification ──
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"The new iPhone has a longer battery life and better camera.",
candidate_labels=["technology", "sports", "politics", "finance"]
)
print(f"Label: {result['labels'][0]} ({result['scores'][0]:.2%})")Speech recognition and synthesis
Automatic Speech Recognition (ASR) converts audio waveforms to text. The pipeline: audio → mel spectrogram (feature extraction) → deep model (CNN+RNN or Transformer) → text via CTC/attention decoder. Modern state-of-the-art: Whisper (OpenAI) — encoder-decoder transformer trained on 680k hours of multilingual audio, achieving near-human transcription in 99 languages.
Speech recognition with OpenAI Whisper
# OpenAI Whisper: end-to-end speech recognition
# Architecture: CNN feature extractor + Transformer encoder + decoder
import whisper
import numpy as np
# Load model (tiny/base/small/medium/large)
model = whisper.load_model("base") # 74M params, 1GB, ~16x real-time on CPU
# Transcribe audio file
result = model.transcribe("audio.mp3")
print(f"Text: {result['text']}")
print(f"Language: {result['language']}")
# With timestamps
result_ts = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result_ts['segments']:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s]: {segment['text']}")
# Batch transcription
import torch
audio = whisper.load_audio("audio.mp3") # Load and resample to 16kHz
audio = whisper.pad_or_trim(audio) # Pad/trim to 30s (max segment)
mel = whisper.log_mel_spectrogram(audio).to(model.device) # 80-band mel
# Whisper architecture for speech:
# Audio → 80-channel mel spectrogram → CNN → Transformer Encoder
# ↓
# Transformer Decoder → tokens → text
# Text-to-Speech (TTS) with deep learning
# from TTS import api; tts = api.TTS("tts_models/en/ljspeech/tacotron2-DDC")
# tts.tts_to_file("Hello, this is deep learning generated speech!", file_path="output.wav")Recommendation systems with deep learning
Modern recommendation systems use deep learning embeddings to represent users and items in a shared latent space — items similar to what a user likes are nearby in embedding space. Collaborative Filtering (matrix factorisation with neural embeddings) powers Netflix and Spotify. Two-Tower models encode queries and candidates into embedding spaces for fast retrieval. Transformers for recommendations (BERT4Rec, SASRec) model sequential user behaviour.
Neural collaborative filtering with PyTorch
import torch
import torch.nn as nn
class NeuralCF(nn.Module):
"""Neural Collaborative Filtering: learns user-item interactions via embeddings."""
def __init__(self, n_users, n_items, embed_dim=64, hidden_dims=[128, 64, 32]):
super().__init__()
# Separate embeddings for users and items
self.user_embed = nn.Embedding(n_users, embed_dim)
self.item_embed = nn.Embedding(n_items, embed_dim)
# MLP layers for learning complex interactions
layers = []
input_dim = embed_dim * 2 # Concat user + item embeddings
for h_dim in hidden_dims:
layers += [nn.Linear(input_dim, h_dim), nn.ReLU(), nn.Dropout(0.2)]
input_dim = h_dim
layers.append(nn.Linear(input_dim, 1))
layers.append(nn.Sigmoid()) # Output: probability of interaction
self.mlp = nn.Sequential(*layers)
def forward(self, user_ids, item_ids):
user_emb = self.user_embed(user_ids) # (batch, embed_dim)
item_emb = self.item_embed(item_ids) # (batch, embed_dim)
x = torch.cat([user_emb, item_emb], dim=1) # (batch, 2×embed_dim)
return self.mlp(x).squeeze(1) # (batch,) — interaction probability
# Training
n_users, n_items = 10000, 50000
model = NeuralCF(n_users, n_items)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
loss_fn = nn.BCELoss()
# Sample: user 42 interacted with items 100, 200, 300 (positive)
# Negative sampling: user 42 did NOT interact with 400, 500, 600
users_pos = torch.tensor([42, 42, 42])
items_pos = torch.tensor([100, 200, 300])
users_neg = torch.tensor([42, 42, 42])
items_neg = torch.tensor([400, 500, 600])
users = torch.cat([users_pos, users_neg])
items = torch.cat([items_pos, items_neg])
labels = torch.cat([torch.ones(3), torch.zeros(3)])
preds = model(users, items)
loss = loss_fn(preds, labels)
loss.backward(); optimizer.step()
print(f"Batch loss: {loss.item():.4f}")
# Inference: get top-N recommendations for user 42
user_tensor = torch.tensor([42] * n_items)
item_tensor = torch.arange(n_items)
with torch.no_grad():
scores = model(user_tensor, item_tensor)
top_items = torch.topk(scores, k=10).indices
print(f"Top 10 recommendations for user 42: {top_items.tolist()}")Practice questions
- Named Entity Recognition uses token classification. What does this mean? (Answer: Instead of one label per sentence (like sentiment analysis), NER assigns a label to EVERY token: "Tim/B-PER Cook/I-PER is/O the/O CEO/O of/O Apple/B-ORG". B- = beginning of entity, I- = inside entity, O = not an entity. It is a seq-to-seq classification task.)
- Whisper uses a mel spectrogram as input. What is a mel spectrogram? (Answer: A mel spectrogram converts audio into a 2D image (time × frequency) using the mel scale — a perceptual scale that matches human hearing (logarithmic). 80 frequency bands, sampled every ~10ms, gives a (80, T) matrix. CNNs can extract features from this representation just like from images.)
- Why do recommendation systems use embeddings instead of one-hot encoding for users and items? (Answer: With 10M users and 50M items, one-hot vectors are 60M-dimensional — intractable. Embeddings map each entity to a dense 64-512 dimensional vector that captures latent characteristics. Similar users/items have similar embeddings. Also enables learning complex non-linear interactions via MLP layers.)
- BERT-based NLP models are "fine-tuned" for downstream tasks. What does this mean? (Answer: BERT is pre-trained on masked language modelling (self-supervised). For a downstream task (sentiment, NER, QA), you add a task-specific head (linear layer) and continue training on the labelled task data with a small learning rate. The entire model updates, but starting from pre-trained weights — much better than random initialisation.)
- Two-Tower architecture in recommendations: what are the two towers and why is it fast? (Answer: Tower 1: User encoder — maps user context to embedding. Tower 2: Item encoder — maps item features to embedding. Similarity = dot product of both embeddings. Speed: item embeddings are pre-computed and indexed (FAISS). At query time: only user tower runs online, then fast nearest-neighbour search retrieves top items. O(1) per user, not O(n_items).)
On LumiChats
LumiChats is built on all three application domains: NLP (transformer-based text generation), speech (Whisper-powered voice input), and recommendations (surfacing relevant features and responses based on your context). Understanding these applications explains the capabilities and limitations of every AI system you use.
Try it free