Deep Learning Applications — NLP, Speech Recognition & Recommendation Systems
Deep learning transforms raw data into intelligent systems across three mega-domains: Natural Language Processing (understanding and generating text — chatbots, translation, summarization), Speech Recognition (converting audio to text and text to speech — voice assistants, transcription), and Recommendation Systems (personalising content — Netflix, Spotify, Amazon, YouTube). Each domain has evolved from hand-crafted features to deep learning pipelines that achieve superhuman performance on narrow tasks.
Where deep learning creates real-world impact — language, voice, and personalization.
Category: Deep Learning & Neural Networks
NLP applications of deep learning
| NLP Application | Deep Learning model | Input | Output |
|---|---|---|---|
| Machine Translation | Transformer (MarianMT, NLLB) | Source language text | Target language text |
| Text Summarization | BART, T5, Pegasus | Long document | Short summary |
| Question Answering | BERT-SQuAD, RoBERTa | Context + question | Answer span or generated text |
| Sentiment Analysis | Fine-tuned BERT/DistilBERT | Review/tweet text | Positive/Negative/Neutral |
| Named Entity Recognition | BERT with token classification | Text | Token labels (PER, ORG, LOC) |
| Text Generation | GPT-4, Claude, Llama, Mistral | Prompt | Continuation / response |
| Text Classification | BERT, FastText | Document | Category label |
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
# ── Sentiment Analysis ──
sentiment_clf = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
results = sentiment_clf([
"This product is absolutely amazing, I love it!",
"Worst purchase I have ever made. Do not buy.",
])
for r in results:
print(f"{r['label']:<10} ({r['score']:.2%})")
# ── Named Entity Recognition ──
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Apple CEO Tim Cook announced new products at the iPhone 16 event in Cupertino."
entities = ner(text)
for e in entities:
print(f"{e['entity_group']}: {e['word']} ({e['score']:.2%})")
# ORG: Apple, PER: Tim Cook, MISC: iPhone 16, LOC: Cupertino
# ── Text Summarization ──
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """The transformer architecture revolutionised natural language processing
by replacing recurrent networks with self-attention mechanisms. This allows
parallel processing of entire sequences simultaneously, dramatically reducing
training time and enabling the development of much larger language models..."""
summary = summarizer(article, max_length=80, min_length=30)[0]['summary_text']
print(f"Summary: {summary}")
# ── Zero-Shot Classification ──
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"The new iPhone has a longer battery life and better camera.",
candidate_labels=["technology", "sports", "politics", "finance"]
)
print(f"Label: {result['labels'][0]} ({result['scores'][0]:.2%})")
Speech recognition and synthesis
Automatic Speech Recognition (ASR) converts audio waveforms to text. The pipeline: audio → mel spectrogram (feature extraction) → deep model (CNN+RNN or Transformer) → text via CTC/attention decoder. Modern state-of-the-art: Whisper (OpenAI) — encoder-decoder transformer trained on 680k hours of multilingual audio, achieving near-human transcription in 99 languages.
# OpenAI Whisper: end-to-end speech recognition
# Architecture: CNN feature extractor + Transformer encoder + decoder
import whisper
import numpy as np
# Load model (tiny/base/small/medium/large)
model = whisper.load_model("base") # 74M params, 1GB, ~16x real-time on CPU
# Transcribe audio file
result = model.transcribe("audio.mp3")
print(f"Text: {result['text']}")
print(f"Language: {result['language']}")
# With timestamps
result_ts = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result_ts['segments']:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s]: {segment['text']}")
# Batch transcription
import torch
audio = whisper.load_audio("audio.mp3") # Load and resample to 16kHz
audio = whisper.pad_or_trim(audio) # Pad/trim to 30s (max segment)
mel = whisper.log_mel_spectrogram(audio).to(model.device) # 80-band mel
# Whisper architecture for speech:
# Audio → 80-channel mel spectrogram → CNN → Transformer Encoder
# ↓
# Transformer Decoder → tokens → text
# Text-to-Speech (TTS) with deep learning
# from TTS import api; tts = api.TTS("tts_models/en/ljspeech/tacotron2-DDC")
# tts.tts_to_file("Hello, this is deep learning generated speech!", file_path="output.wav")
Recommendation systems with deep learning
Modern recommendation systems use deep learning embeddings to represent users and items in a shared latent space — items similar to what a user likes are nearby in embedding space. Collaborative Filtering (matrix factorization with neural embeddings) powers Netflix and Spotify. Two-Tower models encode queries and candidates into embedding spaces for fast retrieval. Transformers for recommendations (BERT4Rec, SASRec) model sequential user behavior.
import torch
import torch.nn as nn
class NeuralCF(nn.Module):
"""Neural Collaborative Filtering: learns user-item interactions via embeddings."""
def __init__(self, n_users, n_items, embed_dim=64, hidden_dims=[128, 64, 32]):
super().__init__()
# Separate embeddings for users and items
self.user_embed = nn.Embedding(n_users, embed_dim)
self.item_embed = nn.Embedding(n_items, embed_dim)
# MLP layers for learning complex interactions
layers = []
input_dim = embed_dim * 2 # Concat user + item embeddings
for h_dim in hidden_dims:
layers += [nn.Linear(input_dim, h_dim), nn.ReLU(), nn.Dropout(0.2)]
input_dim = h_dim
layers.append(nn.Linear(input_dim, 1))
layers.append(nn.Sigmoid()) # Output: probability of interaction
self.mlp = nn.Sequential(*layers)
def forward(self, user_ids, item_ids):
user_emb = self.user_embed(user_ids) # (batch, embed_dim)
item_emb = self.item_embed(item_ids) # (batch, embed_dim)
x = torch.cat([user_emb, item_emb], dim=1) # (batch, 2×embed_dim)
return self.mlp(x).squeeze(1) # (batch,) — interaction probability
# Training
n_users, n_items = 10000, 50000
model = NeuralCF(n_users, n_items)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
loss_fn = nn.BCELoss()
# Sample: user 42 interacted with items 100, 200, 300 (positive)
# Negative sampling: user 42 did NOT interact with 400, 500, 600
users_pos = torch.tensor([42, 42, 42])
items_pos = torch.tensor([100, 200, 300])
users_neg = torch.tensor([42, 42, 42])
items_neg = torch.tensor([400, 500, 600])
users = torch.cat([users_pos, users_neg])
items = torch.cat([items_pos, items_neg])
labels = torch.cat([torch.ones(3), torch.zeros(3)])
preds = model(users, items)
loss = loss_fn(preds, labels)
loss.backward(); optimizer.step()
print(f"Batch loss: {loss.item():.4f}")
# Inference: get top-N recommendations for user 42
user_tensor = torch.tensor([42] * n_items)
item_tensor = torch.arange(n_items)
with torch.no_grad():
scores = model(user_tensor, item_tensor)
top_items = torch.topk(scores, k=10).indices
print(f"Top 10 recommendations for user 42: {top_items.tolist()}")
Practice questions
- Named Entity Recognition uses token classification. What does this mean? (Answer: Instead of one label per sentence (like sentiment analysis), NER assigns a label to EVERY token: "Tim/B-PER Cook/I-PER is/O the/O CEO/O of/O Apple/B-ORG". B- = beginning of entity, I- = inside entity, O = not an entity. It is a seq-to-seq classification task.)
- Whisper uses a mel spectrogram as input. What is a mel spectrogram? (Answer: A mel spectrogram converts audio into a 2D image (time × frequency) using the mel scale — a perceptual scale that matches human hearing (logarithmic). 80 frequency bands, sampled every ~10ms, gives a (80, T) matrix. CNNs can extract features from this representation just like from images.)
- Why do recommendation systems use embeddings instead of one-hot encoding for users and items? (Answer: With 10M users and 50M items, one-hot vectors are 60M-dimensional — intractable. Embeddings map each entity to a dense 64-512 dimensional vector that captures latent characteristics. Similar users/items have similar embeddings. Also enables learning complex non-linear interactions via MLP layers.)
- BERT-based NLP models are "fine-tuned" for downstream tasks. What does this mean? (Answer: BERT is pre-trained on masked language modeling (self-supervised). For a downstream task (sentiment, NER, QA), you add a task-specific head (linear layer) and continue training on the labeled task data with a small learning rate. The entire model updates, but starting from pre-trained weights — much better than random initialization.)
- Two-Tower architecture in recommendations: what are the two towers and why is it fast? (Answer: Tower 1: User encoder — maps user context to embedding. Tower 2: Item encoder — maps item features to embedding. Similarity = dot product of both embeddings. Speed: item embeddings are pre-computed and indexed (FAISS). At query time: only user tower runs online, then fast nearest-neighbor search retrieves top items. O(1) per user, not O(n_items).)
LumiChats is built on all three application domains: NLP (transformer-based text generation), speech (Whisper-powered voice input), and recommendations (surfacing relevant features and responses based on your context). Understanding these applications explains the capabilities and limitations of every AI system you use.