What is practice questions?

Word Embeddings — Word2Vec, GloVe & FastText: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/word2vec-embeddings

Word Embeddings — Word2Vec, GloVe, FastText Explained | NLP

Word Embeddings — Word2Vec, GloVe & FastText

Word embeddings are dense, low-dimensional vector representations of words where semantically similar words are close together in vector space. Unlike one-hot or BoW vectors, embeddings capture meaning: synonyms cluster together, analogies have geometric structure (king - man + woman = queen), and unseen words can be handled via sub-word components. Word2Vec (Google, 2013), GloVe (Stanford, 2014), and FastText (Facebook, 2016) are the three foundational static embedding methods that transformed NLP and directly led to contextual embeddings (ELMo, BERT, GPT).

Dense vectors that capture semantic meaning — king - man + woman = queen.

Category: Natural Language Processing

Real-life analogy: The city map

Imagine every word is a point on a map of concepts. Words that mean similar things are placed in the same neighborhood: cat, kitten, dog, puppy are all near each other in the 'animals' district. Paris, London, Berlin are in the 'European capitals' district. The distance and direction between points encode relationships: the vector from 'man' to 'woman' is the same as the vector from 'king' to 'queen'. Word embeddings are exactly this map — learnt automatically from billions of words of text.

Word2Vec — learning from context

Word2Vec (Mikolov et al., Google 2013) trains a shallow neural network on one of two tasks: CBOW (Continuous Bag of Words) predicts the center word from its context window. Skip-gram predicts context words from the center word. Neither task is the real goal — the weights of the hidden layer are the word embeddings, learnt as a by-product.

\text{Skip-gram objective: } \max \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t)

from gensim.models import Word2Vec

# Tokenised corpus: list of sentences (lists of words)
sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "is", "the", "ruler"],
    ["man", "works", "at", "the", "office"],
    ["woman", "works", "at", "the", "office"],
    ["paris", "is", "the", "capital", "of", "france"],
    ["berlin", "is", "the", "capital", "of", "germany"],
]

model = Word2Vec(
    sentences,
    vector_size=50,   # embedding dimensions (typically 100-300)
    window=3,         # context window size
    min_count=1,      # ignore words with freq < min_count
    sg=1,             # 1 = Skip-gram, 0 = CBOW
    epochs=100,
)

# Semantic similarity
print(model.wv.most_similar("king", topn=3))
# [('queen', 0.97), ('ruler', 0.92), ('kingdom', 0.88)]

# Word analogy: king - man + woman = queen
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0])   # ('queen', 0.96)

# Cosine similarity
print(model.wv.similarity("paris", "berlin"))   # ~0.91 (both capitals)
print(model.wv.similarity("paris", "dog"))       # ~0.12 (unrelated)

GloVe and FastText

GloVe (Global Vectors, Stanford 2014) takes a different approach: instead of a prediction task, it directly factorises the word co-occurrence matrix of the entire corpus. GloVe embeddings encode global corpus statistics — not just local context windows — making them particularly good for syntactic relationships.

FastText (Facebook 2016) extends Word2Vec by representing each word as a bag of character n-grams. 'apple' = {ap, app, ppl, ple, le, }. This handles morphologically rich languages (Turkish, Finnish), rare words, and misspellings gracefully — even words not seen in training can be represented by averaging their character n-gram embeddings.

Method	Approach	OOV words?	Best at	Dimensions
Word2Vec	Prediction (CBOW/Skip-gram)	No	Semantic analogy tasks	100-300
GloVe	Matrix factorization	No	Syntactic tasks, global stats	50-300
FastText	Sub-word n-grams	Yes (via char n-grams)	Morphology, multilingual	100-300
ELMo/BERT	Deep bidirectional LM	Yes (sub-word)	Contextual meaning, NLU	768-1024

Static vs contextual embeddings: Word2Vec, GloVe, and FastText are static: the word "bank" has one embedding regardless of whether it means river bank or financial bank. BERT and GPT produce contextual embeddings: the same word gets different vectors depending on its sentence context. For most modern NLP tasks, contextual embeddings (BERT, GPT) significantly outperform static ones.

Practice questions

What is the dimensionality problem with one-hot vectors that Word2Vec solves? (Answer: One-hot vectors have dimension |V| (10k-100k+) and are orthogonal — all words are equidistant. Word2Vec uses 100-300 dimensions and encodes semantic similarity via cosine distance.)
In Skip-gram, given the sentence "the cat sat on the mat" with window=2 and center word "sat", what are the training pairs? (Answer: (sat, cat), (sat, the), (sat, on), (sat, the) — all words within distance 2.)
Why does FastText outperform Word2Vec on rare words? (Answer: FastText represents words via character n-grams. Rare words share n-grams with common words, so their embeddings inherit some meaning even with few training examples.)
What does the analogy "Paris - France + Germany = ?" test in word embeddings? (Answer: Berlin. Tests that "capital of" relationships are encoded as consistent vectors. Result = model.wv.most_similar(positive=["Paris", "Germany"], negative=["France"]).)
GloVe is called "Global" because: (Answer: It factors the global word co-occurrence matrix of the entire corpus, rather than only looking at local context windows like Word2Vec.)

LumiChats uses contextual embeddings (the modern successors to Word2Vec) to power semantic search and RAG. When you search your documents, the system compares dense vector similarity — the same principle behind Word2Vec analogies, but with BERT-quality contextual understanding.

Definition

Real-life analogy: The city map

Word2Vec — learning from context

Skip-gram maximizes the log probability of context words given center word w_t over a window of size c. Training uses negative sampling to avoid computing the full softmax over the entire vocabulary.

Training Word2Vec with gensim

from gensim.models import Word2Vec

# Tokenised corpus: list of sentences (lists of words)
sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "is", "the", "ruler"],
    ["man", "works", "at", "the", "office"],
    ["woman", "works", "at", "the", "office"],
    ["paris", "is", "the", "capital", "of", "france"],
    ["berlin", "is", "the", "capital", "of", "germany"],
]

model = Word2Vec(
    sentences,
    vector_size=50,   # embedding dimensions (typically 100-300)
    window=3,         # context window size
    min_count=1,      # ignore words with freq < min_count
    sg=1,             # 1 = Skip-gram, 0 = CBOW
    epochs=100,
)

# Semantic similarity
print(model.wv.most_similar("king", topn=3))
# [('queen', 0.97), ('ruler', 0.92), ('kingdom', 0.88)]

# Word analogy: king - man + woman = queen
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0])   # ('queen', 0.96)

# Cosine similarity
print(model.wv.similarity("paris", "berlin"))   # ~0.91 (both capitals)
print(model.wv.similarity("paris", "dog"))       # ~0.12 (unrelated)

GloVe and FastText

Method	Approach	OOV words?	Best at	Dimensions
Word2Vec	Prediction (CBOW/Skip-gram)	No	Semantic analogy tasks	100-300
GloVe	Matrix factorization	No	Syntactic tasks, global stats	50-300
FastText	Sub-word n-grams	Yes (via char n-grams)	Morphology, multilingual	100-300
ELMo/BERT	Deep bidirectional LM	Yes (sub-word)	Contextual meaning, NLU	768-1024

Static vs contextual embeddings

Word2Vec, GloVe, and FastText are static: the word "bank" has one embedding regardless of whether it means river bank or financial bank. BERT and GPT produce contextual embeddings: the same word gets different vectors depending on its sentence context. For most modern NLP tasks, contextual embeddings (BERT, GPT) significantly outperform static ones.

Practice questions

What is the dimensionality problem with one-hot vectors that Word2Vec solves? (Answer: One-hot vectors have dimension |V| (10k-100k+) and are orthogonal — all words are equidistant. Word2Vec uses 100-300 dimensions and encodes semantic similarity via cosine distance.)
In Skip-gram, given the sentence "the cat sat on the mat" with window=2 and center word "sat", what are the training pairs? (Answer: (sat, cat), (sat, the), (sat, on), (sat, the) — all words within distance 2.)
Why does FastText outperform Word2Vec on rare words? (Answer: FastText represents words via character n-grams. Rare words share n-grams with common words, so their embeddings inherit some meaning even with few training examples.)
What does the analogy "Paris - France + Germany = ?" test in word embeddings? (Answer: Berlin. Tests that "capital of" relationships are encoded as consistent vectors. Result = model.wv.most_similar(positive=["Paris", "Germany"], negative=["France"]).)
GloVe is called "Global" because: (Answer: It factors the global word co-occurrence matrix of the entire corpus, rather than only looking at local context windows like Word2Vec.)

On LumiChats

Try it free

Word Embeddings — Word2Vec, GloVe & FastText

Real-life analogy: The city map

Word2Vec — learning from context

GloVe and FastText

Practice questions

Word Embeddings — Word2Vec, GloVe & FastText

Real-life analogy: The city map

Word2Vec — learning from context

GloVe and FastText

Practice questions

Practice what you just learned

Related Terms