Glossary/Word Embeddings — Word2Vec, GloVe & FastText
Natural Language Processing

Word Embeddings — Word2Vec, GloVe & FastText

Dense vectors that capture semantic meaning — king - man + woman = queen.


Definition

Word embeddings are dense, low-dimensional vector representations of words where semantically similar words are close together in vector space. Unlike one-hot or BoW vectors, embeddings capture meaning: synonyms cluster together, analogies have geometric structure (king - man + woman = queen), and unseen words can be handled via sub-word components. Word2Vec (Google, 2013), GloVe (Stanford, 2014), and FastText (Facebook, 2016) are the three foundational static embedding methods that transformed NLP and directly led to contextual embeddings (ELMo, BERT, GPT).

Real-life analogy: The city map

Imagine every word is a point on a map of concepts. Words that mean similar things are placed in the same neighbourhood: cat, kitten, dog, puppy are all near each other in the 'animals' district. Paris, London, Berlin are in the 'European capitals' district. The distance and direction between points encode relationships: the vector from 'man' to 'woman' is the same as the vector from 'king' to 'queen'. Word embeddings are exactly this map — learnt automatically from billions of words of text.

Word2Vec — learning from context

Word2Vec (Mikolov et al., Google 2013) trains a shallow neural network on one of two tasks: CBOW (Continuous Bag of Words) predicts the centre word from its context window. Skip-gram predicts context words from the centre word. Neither task is the real goal — the weights of the hidden layer are the word embeddings, learnt as a by-product.

Skip-gram maximises the log probability of context words given centre word w_t over a window of size c. Training uses negative sampling to avoid computing the full softmax over the entire vocabulary.

Training Word2Vec with gensim

from gensim.models import Word2Vec

# Tokenised corpus: list of sentences (lists of words)
sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "is", "the", "ruler"],
    ["man", "works", "at", "the", "office"],
    ["woman", "works", "at", "the", "office"],
    ["paris", "is", "the", "capital", "of", "france"],
    ["berlin", "is", "the", "capital", "of", "germany"],
]

model = Word2Vec(
    sentences,
    vector_size=50,   # embedding dimensions (typically 100-300)
    window=3,         # context window size
    min_count=1,      # ignore words with freq < min_count
    sg=1,             # 1 = Skip-gram, 0 = CBOW
    epochs=100,
)

# Semantic similarity
print(model.wv.most_similar("king", topn=3))
# [('queen', 0.97), ('ruler', 0.92), ('kingdom', 0.88)]

# Word analogy: king - man + woman = queen
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0])   # ('queen', 0.96)

# Cosine similarity
print(model.wv.similarity("paris", "berlin"))   # ~0.91 (both capitals)
print(model.wv.similarity("paris", "dog"))       # ~0.12 (unrelated)

GloVe and FastText

GloVe (Global Vectors, Stanford 2014) takes a different approach: instead of a prediction task, it directly factorises the word co-occurrence matrix of the entire corpus. GloVe embeddings encode global corpus statistics — not just local context windows — making them particularly good for syntactic relationships.

FastText (Facebook 2016) extends Word2Vec by representing each word as a bag of character n-grams. 'apple' = {ap, app, ppl, ple, le, }. This handles morphologically rich languages (Turkish, Finnish), rare words, and misspellings gracefully — even words not seen in training can be represented by averaging their character n-gram embeddings.

MethodApproachOOV words?Best atDimensions
Word2VecPrediction (CBOW/Skip-gram)NoSemantic analogy tasks100-300
GloVeMatrix factorisationNoSyntactic tasks, global stats50-300
FastTextSub-word n-gramsYes (via char n-grams)Morphology, multilingual100-300
ELMo/BERTDeep bidirectional LMYes (sub-word)Contextual meaning, NLU768-1024

Static vs contextual embeddings

Word2Vec, GloVe, and FastText are static: the word "bank" has one embedding regardless of whether it means river bank or financial bank. BERT and GPT produce contextual embeddings: the same word gets different vectors depending on its sentence context. For most modern NLP tasks, contextual embeddings (BERT, GPT) significantly outperform static ones.

Practice questions

  1. What is the dimensionality problem with one-hot vectors that Word2Vec solves? (Answer: One-hot vectors have dimension |V| (10k-100k+) and are orthogonal — all words are equidistant. Word2Vec uses 100-300 dimensions and encodes semantic similarity via cosine distance.)
  2. In Skip-gram, given the sentence "the cat sat on the mat" with window=2 and centre word "sat", what are the training pairs? (Answer: (sat, cat), (sat, the), (sat, on), (sat, the) — all words within distance 2.)
  3. Why does FastText outperform Word2Vec on rare words? (Answer: FastText represents words via character n-grams. Rare words share n-grams with common words, so their embeddings inherit some meaning even with few training examples.)
  4. What does the analogy "Paris - France + Germany = ?" test in word embeddings? (Answer: Berlin. Tests that "capital of" relationships are encoded as consistent vectors. Result = model.wv.most_similar(positive=["Paris", "Germany"], negative=["France"]).)
  5. GloVe is called "Global" because: (Answer: It factors the global word co-occurrence matrix of the entire corpus, rather than only looking at local context windows like Word2Vec.)

On LumiChats

LumiChats uses contextual embeddings (the modern successors to Word2Vec) to power semantic search and RAG. When you search your documents, the system compares dense vector similarity — the same principle behind Word2Vec analogies, but with BERT-quality contextual understanding.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms