Word embeddings are dense, low-dimensional vector representations of words where semantically similar words are close together in vector space. Unlike one-hot or BoW vectors, embeddings capture meaning: synonyms cluster together, analogies have geometric structure (king - man + woman = queen), and unseen words can be handled via sub-word components. Word2Vec (Google, 2013), GloVe (Stanford, 2014), and FastText (Facebook, 2016) are the three foundational static embedding methods that transformed NLP and directly led to contextual embeddings (ELMo, BERT, GPT).
Real-life analogy: The city map
Imagine every word is a point on a map of concepts. Words that mean similar things are placed in the same neighbourhood: cat, kitten, dog, puppy are all near each other in the 'animals' district. Paris, London, Berlin are in the 'European capitals' district. The distance and direction between points encode relationships: the vector from 'man' to 'woman' is the same as the vector from 'king' to 'queen'. Word embeddings are exactly this map — learnt automatically from billions of words of text.
Word2Vec — learning from context
Word2Vec (Mikolov et al., Google 2013) trains a shallow neural network on one of two tasks: CBOW (Continuous Bag of Words) predicts the centre word from its context window. Skip-gram predicts context words from the centre word. Neither task is the real goal — the weights of the hidden layer are the word embeddings, learnt as a by-product.
Skip-gram maximises the log probability of context words given centre word w_t over a window of size c. Training uses negative sampling to avoid computing the full softmax over the entire vocabulary.
Training Word2Vec with gensim
from gensim.models import Word2Vec
# Tokenised corpus: list of sentences (lists of words)
sentences = [
["the", "king", "rules", "the", "kingdom"],
["the", "queen", "is", "the", "ruler"],
["man", "works", "at", "the", "office"],
["woman", "works", "at", "the", "office"],
["paris", "is", "the", "capital", "of", "france"],
["berlin", "is", "the", "capital", "of", "germany"],
]
model = Word2Vec(
sentences,
vector_size=50, # embedding dimensions (typically 100-300)
window=3, # context window size
min_count=1, # ignore words with freq < min_count
sg=1, # 1 = Skip-gram, 0 = CBOW
epochs=100,
)
# Semantic similarity
print(model.wv.most_similar("king", topn=3))
# [('queen', 0.97), ('ruler', 0.92), ('kingdom', 0.88)]
# Word analogy: king - man + woman = queen
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0]) # ('queen', 0.96)
# Cosine similarity
print(model.wv.similarity("paris", "berlin")) # ~0.91 (both capitals)
print(model.wv.similarity("paris", "dog")) # ~0.12 (unrelated)GloVe and FastText
GloVe (Global Vectors, Stanford 2014) takes a different approach: instead of a prediction task, it directly factorises the word co-occurrence matrix of the entire corpus. GloVe embeddings encode global corpus statistics — not just local context windows — making them particularly good for syntactic relationships.
FastText (Facebook 2016) extends Word2Vec by representing each word as a bag of character n-grams. 'apple' = {ap, app, ppl, ple, le,
| Method | Approach | OOV words? | Best at | Dimensions |
|---|---|---|---|---|
| Word2Vec | Prediction (CBOW/Skip-gram) | No | Semantic analogy tasks | 100-300 |
| GloVe | Matrix factorisation | No | Syntactic tasks, global stats | 50-300 |
| FastText | Sub-word n-grams | Yes (via char n-grams) | Morphology, multilingual | 100-300 |
| ELMo/BERT | Deep bidirectional LM | Yes (sub-word) | Contextual meaning, NLU | 768-1024 |
Static vs contextual embeddings
Word2Vec, GloVe, and FastText are static: the word "bank" has one embedding regardless of whether it means river bank or financial bank. BERT and GPT produce contextual embeddings: the same word gets different vectors depending on its sentence context. For most modern NLP tasks, contextual embeddings (BERT, GPT) significantly outperform static ones.
Practice questions
- What is the dimensionality problem with one-hot vectors that Word2Vec solves? (Answer: One-hot vectors have dimension |V| (10k-100k+) and are orthogonal — all words are equidistant. Word2Vec uses 100-300 dimensions and encodes semantic similarity via cosine distance.)
- In Skip-gram, given the sentence "the cat sat on the mat" with window=2 and centre word "sat", what are the training pairs? (Answer: (sat, cat), (sat, the), (sat, on), (sat, the) — all words within distance 2.)
- Why does FastText outperform Word2Vec on rare words? (Answer: FastText represents words via character n-grams. Rare words share n-grams with common words, so their embeddings inherit some meaning even with few training examples.)
- What does the analogy "Paris - France + Germany = ?" test in word embeddings? (Answer: Berlin. Tests that "capital of" relationships are encoded as consistent vectors. Result = model.wv.most_similar(positive=["Paris", "Germany"], negative=["France"]).)
- GloVe is called "Global" because: (Answer: It factors the global word co-occurrence matrix of the entire corpus, rather than only looking at local context windows like Word2Vec.)
On LumiChats
LumiChats uses contextual embeddings (the modern successors to Word2Vec) to power semantic search and RAG. When you search your documents, the system compares dense vector similarity — the same principle behind Word2Vec analogies, but with BERT-quality contextual understanding.
Try it free