What is from static to contextual embeddings?

Contextual & Sentence Embeddings — ELMo, BERT, Sentence-BERT: From static to contextual embeddings. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/contextual-sentence-embeddings

What is practice questions?

Contextual & Sentence Embeddings — ELMo, BERT, Sentence-BERT: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/contextual-sentence-embeddings

Contextual & Sentence Embeddings — ELMo, BERT

Contextual & Sentence Embeddings — ELMo, BERT, Sentence-BERT

Static word embeddings (Word2Vec, GloVe) assign one vector per word regardless of context — 'bank' has the same vector in 'river bank' and 'bank account'. Contextual embeddings (ELMo, BERT, GPT) generate different vectors for the same word based on its surrounding context, dramatically improving performance on disambiguation tasks. Sentence embeddings (Sentence-BERT, E5, GTE) map entire sentences to dense vectors, enabling semantic similarity search, clustering, and retrieval-augmented generation. These are the core of modern NLP systems.

From static word vectors to dynamic representations that understand context.

Category: Natural Language Processing

Real-life analogy: The chameleon word

The word 'bank' changes meaning completely based on context. Static embeddings: 'bank' always points to the same location on the word map — somewhere between 'finance' and 'river'. Contextual embeddings: 'bank' in 'I deposited money at the bank' is pulled toward the finance cluster. 'Bank' in 'we sat on the river bank' is pulled toward the geography cluster. The same word gets a completely different vector based on the sentence around it.

From static to contextual embeddings

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# ── STATIC: Word2Vec (same vector regardless of context) ──
try:
    import gensim.downloader as api
    w2v = api.load('word2vec-google-news-300')
    bank_vec = w2v['bank']   # Always the same 300-dim vector
    print(f"Static 'bank' vector (same for all contexts): {bank_vec[:5]}")
    print(f"Similarity: bank ~ money: {w2v.similarity('bank', 'money'):.3f}")
    print(f"Similarity: bank ~ river: {w2v.similarity('bank', 'river'):.3f}")
    # Both ~0.35 — static embedding is an average of all senses
except:
    print("Install gensim for Word2Vec")

# ── CONTEXTUAL: BERT embeddings (different vectors per context) ──
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')
model.eval()

def get_contextual_embedding(sentence: str, target_word: str) -> np.ndarray:
    """Get BERT embedding for target_word in the context of sentence."""
    inputs = tokenizer(sentence, return_tensors='pt')
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Find position of target word
    target_idx = next((i for i, t in enumerate(tokens) if target_word.lower() in t), None)
    if target_idx is None: return None

    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[0, target_idx].numpy()

# Same word, different contexts
sent_finance = "I deposited my salary at the bank yesterday"
sent_river   = "We sat on the grassy bank by the river"

bank_finance = get_contextual_embedding(sent_finance, 'bank')
bank_river   = get_contextual_embedding(sent_river,   'bank')

if bank_finance is not None and bank_river is not None:
    sim = cosine_similarity([bank_finance], [bank_river])[0][0]
    print(f"
BERT 'bank' contextual similarity (finance vs river): {sim:.3f}")
    # ~0.55 — different contexts → different vectors (static would be 1.0)

Sentence embeddings with Sentence-BERT

Standard BERT embeddings are great for word-level tasks but poor for sentence-level similarity. Running two sentences through BERT and comparing [CLS] tokens requires two forward passes and does not produce well-separated sentence embeddings. Sentence-BERT (SBERT) fine-tunes BERT with a Siamese network on NLI + STS data, producing sentence embeddings where cosine similarity directly reflects semantic similarity.

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, 384-dim, fast

# ── Semantic similarity ──
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",     # Semantically similar
    "The dog played in the park",     # Different animal, different action
    "Stock markets fell sharply today", # Completely different topic
]
embeddings = model.encode(sentences, normalize_embeddings=True)

# Compare all pairs
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        sim = util.cos_sim(embeddings[i], embeddings[j]).item()
        print(f"{sim:.3f}: '{sentences[i][:40]}' ↔ '{sentences[j][:40]}'")
# 0.854: cat sat... ↔ feline rested...  (high similarity)
# 0.221: cat sat... ↔ stock markets...  (low similarity)

# ── Semantic document search (basis of RAG) ──
corpus = [
    "Paris is the capital city of France and its largest metropolitan area",
    "The Eiffel Tower is an iron lattice tower in Paris built in 1889",
    "Python is a high-level programming language used for data science",
    "Machine learning uses statistical methods to enable computers to learn",
    "The Seine river flows through Paris and into the English Channel",
]
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)

def semantic_search(query: str, top_k: int = 3):
    query_emb = model.encode(query, normalize_embeddings=True)
    scores    = util.cos_sim(query_emb, corpus_embeddings)[0]
    top       = scores.topk(top_k)
    print(f"
Query: '{query}'")
    for score, idx in zip(top.values, top.indices):
        print(f"  {score:.3f}: {corpus[idx]}")

semantic_search("What can I visit in Paris?")
semantic_search("How do I program a machine learning model?")

Model	Type	Dimensions	Best for	Speed
Word2Vec/GloVe	Static word	100-300	Word analogies, quick baselines	Very fast
ELMo	Contextual word (biLSTM)	1024	Word disambiguation	Medium
BERT CLS token	Contextual sentence (not optimal)	768	Not recommended for sent similarity	Slow
Sentence-BERT (SBERT)	Contextual sentence (fine-tuned)	384-768	Sentence similarity, search, RAG	Fast
OpenAI text-embedding-3-large	Contextual sentence	3072	Production semantic search	API call

Practice questions

Why does BERT produce better word representations than Word2Vec for polysemous words? (Answer: BERT is contextual — the embedding of "bank" in "river bank" is computed using the full sentence context via self-attention. Every layer refines the representation based on surrounding words. Word2Vec produces one static average vector trained to be near both "money" and "river" — blurring both senses.)
Why is comparing [CLS] tokens from BERT poor for sentence similarity? (Answer: BERT was not trained to produce meaningful sentence embeddings in the [CLS] token — it was trained for masked LM and NSP (next sentence prediction). The [CLS] representation is not optimized for geometric similarity. Sentence-BERT fine-tunes BERT on NLI + STS tasks with a siamese network specifically to make cosine similarity of [CLS] meaningful.)
What is the key training technique that makes Sentence-BERT embeddings useful? (Answer: Siamese network training on NLI + Semantic Textual Similarity (STS) data. Two sentences go through the same BERT model independently. The loss pulls similar sentences close together and pushes dissimilar sentences apart in embedding space — learning a similarity-preserving projection.)
In RAG (Retrieval-Augmented Generation), what role do sentence embeddings play? (Answer: Documents are encoded to embeddings at indexing time. At query time, the user query is encoded to an embedding. Fast ANN (approximate nearest neighbor) search finds the most similar document chunks. Those chunks are inserted as context for the LLM to generate a grounded, factual answer.)
ELMo uses a bidirectional LSTM, BERT uses a bidirectional Transformer. What advantage does BERT have? (Answer: BERT captures long-range dependencies better via self-attention (O(n²) interactions vs O(n) for LSTM). BERT processes the entire sentence simultaneously (parallelisable). ELMo processes sequentially and struggles with very long dependencies. BERT also has better scaling properties for larger models.)

Sentence embeddings power LumiChats document search — when you paste a PDF and ask questions, your query and document chunks are encoded to sentence embeddings, and the closest chunks are retrieved to ground the answer. The same technology powers the 'Related Terms' feature you see on this glossary page.

Definition

Real-life analogy: The chameleon word

From static to contextual embeddings

Static vs contextual embeddings — the bank example

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# ── STATIC: Word2Vec (same vector regardless of context) ──
try:
    import gensim.downloader as api
    w2v = api.load('word2vec-google-news-300')
    bank_vec = w2v['bank']   # Always the same 300-dim vector
    print(f"Static 'bank' vector (same for all contexts): {bank_vec[:5]}")
    print(f"Similarity: bank ~ money: {w2v.similarity('bank', 'money'):.3f}")
    print(f"Similarity: bank ~ river: {w2v.similarity('bank', 'river'):.3f}")
    # Both ~0.35 — static embedding is an average of all senses
except:
    print("Install gensim for Word2Vec")

# ── CONTEXTUAL: BERT embeddings (different vectors per context) ──
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')
model.eval()

def get_contextual_embedding(sentence: str, target_word: str) -> np.ndarray:
    """Get BERT embedding for target_word in the context of sentence."""
    inputs = tokenizer(sentence, return_tensors='pt')
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Find position of target word
    target_idx = next((i for i, t in enumerate(tokens) if target_word.lower() in t), None)
    if target_idx is None: return None

    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[0, target_idx].numpy()

# Same word, different contexts
sent_finance = "I deposited my salary at the bank yesterday"
sent_river   = "We sat on the grassy bank by the river"

bank_finance = get_contextual_embedding(sent_finance, 'bank')
bank_river   = get_contextual_embedding(sent_river,   'bank')

if bank_finance is not None and bank_river is not None:
    sim = cosine_similarity([bank_finance], [bank_river])[0][0]
    print(f"
BERT 'bank' contextual similarity (finance vs river): {sim:.3f}")
    # ~0.55 — different contexts → different vectors (static would be 1.0)

Sentence embeddings with Sentence-BERT

Sentence-BERT for semantic similarity and document search

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, 384-dim, fast

# ── Semantic similarity ──
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",     # Semantically similar
    "The dog played in the park",     # Different animal, different action
    "Stock markets fell sharply today", # Completely different topic
]
embeddings = model.encode(sentences, normalize_embeddings=True)

# Compare all pairs
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        sim = util.cos_sim(embeddings[i], embeddings[j]).item()
        print(f"{sim:.3f}: '{sentences[i][:40]}' ↔ '{sentences[j][:40]}'")
# 0.854: cat sat... ↔ feline rested...  (high similarity)
# 0.221: cat sat... ↔ stock markets...  (low similarity)

# ── Semantic document search (basis of RAG) ──
corpus = [
    "Paris is the capital city of France and its largest metropolitan area",
    "The Eiffel Tower is an iron lattice tower in Paris built in 1889",
    "Python is a high-level programming language used for data science",
    "Machine learning uses statistical methods to enable computers to learn",
    "The Seine river flows through Paris and into the English Channel",
]
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)

def semantic_search(query: str, top_k: int = 3):
    query_emb = model.encode(query, normalize_embeddings=True)
    scores    = util.cos_sim(query_emb, corpus_embeddings)[0]
    top       = scores.topk(top_k)
    print(f"
Query: '{query}'")
    for score, idx in zip(top.values, top.indices):
        print(f"  {score:.3f}: {corpus[idx]}")

semantic_search("What can I visit in Paris?")
semantic_search("How do I program a machine learning model?")

Model	Type	Dimensions	Best for	Speed
Word2Vec/GloVe	Static word	100-300	Word analogies, quick baselines	Very fast
ELMo	Contextual word (biLSTM)	1024	Word disambiguation	Medium
BERT CLS token	Contextual sentence (not optimal)	768	Not recommended for sent similarity	Slow
Sentence-BERT (SBERT)	Contextual sentence (fine-tuned)	384-768	Sentence similarity, search, RAG	Fast
OpenAI text-embedding-3-large	Contextual sentence	3072	Production semantic search	API call

Practice questions

Why does BERT produce better word representations than Word2Vec for polysemous words? (Answer: BERT is contextual — the embedding of "bank" in "river bank" is computed using the full sentence context via self-attention. Every layer refines the representation based on surrounding words. Word2Vec produces one static average vector trained to be near both "money" and "river" — blurring both senses.)
Why is comparing [CLS] tokens from BERT poor for sentence similarity? (Answer: BERT was not trained to produce meaningful sentence embeddings in the [CLS] token — it was trained for masked LM and NSP (next sentence prediction). The [CLS] representation is not optimized for geometric similarity. Sentence-BERT fine-tunes BERT on NLI + STS tasks with a siamese network specifically to make cosine similarity of [CLS] meaningful.)
What is the key training technique that makes Sentence-BERT embeddings useful? (Answer: Siamese network training on NLI + Semantic Textual Similarity (STS) data. Two sentences go through the same BERT model independently. The loss pulls similar sentences close together and pushes dissimilar sentences apart in embedding space — learning a similarity-preserving projection.)
In RAG (Retrieval-Augmented Generation), what role do sentence embeddings play? (Answer: Documents are encoded to embeddings at indexing time. At query time, the user query is encoded to an embedding. Fast ANN (approximate nearest neighbor) search finds the most similar document chunks. Those chunks are inserted as context for the LLM to generate a grounded, factual answer.)
ELMo uses a bidirectional LSTM, BERT uses a bidirectional Transformer. What advantage does BERT have? (Answer: BERT captures long-range dependencies better via self-attention (O(n²) interactions vs O(n) for LSTM). BERT processes the entire sentence simultaneously (parallelisable). ELMo processes sequentially and struggles with very long dependencies. BERT also has better scaling properties for larger models.)

On LumiChats

Try it free

Real-life analogy: The chameleon word

From static to contextual embeddings

Sentence embeddings with Sentence-BERT

Practice questions

Practice what you just learned

Related Terms