Embeddings are dense numerical vector representations of text, images, or other data that capture their semantic meaning. Similar concepts produce embeddings that are mathematically close in high-dimensional space, allowing AI systems to perform semantic search, clustering, classification, and retrieval based on meaning rather than keyword matching.
What an embedding looks like
An embedding is simply a list of floating-point numbers — a vector. The length of this list is called the embedding dimension. Modern text embedding models produce vectors with 768 to 3,072 dimensions.
Generating a real embedding with OpenAI text-embedding-3-small
from openai import OpenAI
import numpy as np
client = OpenAI() # uses OPENAI_API_KEY env variable
def embed(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions
input=text
)
return np.array(response.data[0].embedding)
king = embed("king")
queen = embed("queen")
banana = embed("banana")
print(f"Embedding shape: {king.shape}") # (1536,)
print(f"First 5 values: {king[:5]}")
# e.g. [ 0.021, -0.083, 0.045, -0.012, 0.067, ...]
# 1536 numbers — each individually meaningless,
# but together they encode the word's meaningShape matters
These 1536 numbers have no individually interpretable meaning. What matters is the geometric relationship: words with similar meanings live close together in this 1536-dimensional space.
Cosine similarity: measuring meaning distance
To compare how similar two embeddings are, we use cosine similarity — the cosine of the angle between the two vectors. This is preferred over Euclidean distance for embeddings because it measures directional similarity, not magnitude:
Cosine similarity: ranges from −1 (opposite meaning) to 1 (identical meaning). 0 = unrelated.
Computing cosine similarity between embeddings
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two vectors."""
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Using the embeddings from the previous example:
print(f"king ↔ queen: {cosine_similarity(king, queen):.4f}") # ~0.87
print(f"king ↔ banana: {cosine_similarity(king, banana):.4f}") # ~0.21
print(f"king ↔ king: {cosine_similarity(king, king):.4f}") # 1.0000
# For bulk comparisons, normalize first (cosine = dot product of unit vectors):
def normalize(v: np.ndarray) -> np.ndarray:
return v / np.linalg.norm(v)
king_n, queen_n = normalize(king), normalize(queen)
# Now: similarity = np.dot(king_n, queen_n)The famous king − man + woman = queen
A celebrated property of well-trained embeddings (first shown in Word2Vec, 2013) is that semantic relationships correspond to arithmetic in vector space:
The gender relationship is encoded as a consistent geometric direction in embedding space.
Demonstrating word arithmetic with embeddings
# Semantic arithmetic in embedding space
man = embed("man")
woman = embed("woman")
# Compute the "analogy" vector
target = king - man + woman
# Find which word's embedding is closest to the result
candidates = {"king": king, "queen": queen, "woman": woman, "banana": banana}
similarities = {
word: cosine_similarity(target, vec)
for word, vec in candidates.items()
}
best = max(similarities, key=similarities.get)
print("Nearest to (king - man + woman):", best) # queen ✓
# Output: queen (similarity ~0.89)
# This works because the gender direction (man→woman)
# is consistent across semantic spacesModern large embedding models (like text-embedding-3-large) capture far richer, contextual semantics. The same word 'bank' gets completely different embeddings depending on whether the surrounding context is about finance or rivers — because modern embedders process context, not just isolated words.
Building a semantic search system
Here's a minimal but production-realistic semantic search implementation — the same core logic used in LumiChats Study Mode:
Minimal semantic search with embeddings + cosine similarity
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed_batch(texts: list[str]) -> np.ndarray:
"""Embed multiple texts in one API call (efficient)."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return np.array([r.embedding for r in response.data])
# --- 1. Index your knowledge base ---
documents = [
"Photosynthesis converts sunlight into glucose in plant cells.",
"The mitochondria is the powerhouse of the cell.",
"DNA replication occurs during the S phase of the cell cycle.",
"The Eiffel Tower was built in 1889 in Paris, France.",
]
doc_embeddings = embed_batch(documents) # shape: (4, 1536)
# Normalize for fast cosine similarity via dot product
norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
doc_embeddings_norm = doc_embeddings / norms
# --- 2. Query ---
def search(query: str, top_k: int = 3):
query_emb = embed_batch([query])[0] # (1536,)
query_emb = query_emb / np.linalg.norm(query_emb) # normalize
# Cosine similarity = dot product when both are normalized
scores = doc_embeddings_norm @ query_emb # (4,)
top_indices = np.argsort(scores)[::-1][:top_k]
return [(documents[i], float(scores[i])) for i in top_indices]
results = search("how do plants make food?")
for doc, score in results:
print(f"Score {score:.3f}: {doc}")
# Score 0.847: Photosynthesis converts sunlight into glucose in plant cells. ✓
# Score 0.432: The mitochondria is the powerhouse of the cell.
# Score 0.381: DNA replication occurs during the S phase of the cell cycle.Production tip
In production, store embeddings in a vector database (pgvector, Pinecone, Qdrant) instead of NumPy arrays — they handle millions of vectors with millisecond search using HNSW indexing.
Embedding model comparison (2025)
| Model | Dimensions | MTEB Score | Best for |
|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3,072 | 64.6 | General purpose, highest quality |
| text-embedding-3-small (OpenAI) | 1,536 | 62.3 | Cost-efficient, fast |
| Cohere Embed v3 | 1,024 | 64.5 | Multilingual, strong retrieval |
| voyage-3 (Voyage AI) | 1,024 | 67.1 | Code, technical retrieval |
| BGE-M3 (open-source) | 1,024 | 63.5 | Self-hosted, multilingual |
| mxbai-embed-large (open-source) | 1,024 | 64.7 | Self-hosted, cost-free |
Dimension mismatch
You must embed queries and documents with the same model. Mixing models produces meaningless comparisons — the vector spaces are completely incompatible even at the same dimension.
Use cases beyond RAG
- Recommendation systems — embed user history and items; find items closest to the user's "taste vector"
- Duplicate detection — find near-identical documents in a corpus (cosine similarity > 0.97)
- Classification — train a lightweight classifier (logistic regression, SVM) on top of frozen embeddings — often beats fine-tuning for small datasets
- Clustering — K-Means or HDBSCAN over embeddings to discover semantic groups without labels
- Cross-lingual search — multilingual models embed English and Hindi into the same space; search Hindi docs with an English query
- Anomaly detection — inputs far from the distribution of "normal" embeddings may indicate unusual or adversarial inputs
Practice questions
- What is the dot product between two unit vectors and why does cosine similarity use it? (Answer: Cosine similarity = (A·B)/(||A||·||B||). For unit vectors (||A||=||B||=1): cosine_similarity = A·B directly. Range: -1 (opposite directions) to +1 (same direction), 0 (orthogonal/unrelated). Used for embeddings because length is normalised — similarity reflects angular relationship (semantic closeness) not vector magnitude. In practice: normalise embeddings before cosine comparison; this also makes nearest neighbour search faster (dot product only).)
- What is the difference between sentence embeddings and word embeddings? (Answer: Word embeddings (Word2Vec, GloVe): one vector per word in the vocabulary, context-independent. 'bank' has the same vector whether 'river bank' or 'bank account.' Sentence embeddings (SBERT, OpenAI text-embedding): one vector per sentence/passage, context-aware. The vector for 'I went to the bank to deposit money' reflects the financial sense. Sentence embeddings are generated by averaging token embeddings or using [CLS] token from a transformer — capturing the full contextual meaning of the input.)
- What is semantic search vs keyword search and when is each appropriate? (Answer: Keyword search (BM25/TF-IDF): finds documents containing the exact query terms. Fast, interpretable, handles technical terms and product IDs exactly. Fails when query uses different vocabulary than documents. Semantic search (embedding similarity): finds semantically similar documents even with different vocabulary — 'affordable car' matches documents about 'budget vehicle' and 'cheap automobile.' Slower (requires embedding + ANN lookup). Use keyword for: exact product IDs, medical codes, legal citations. Use semantic for: user intent queries, cross-lingual retrieval, FAQ matching.)
- What is the 'curse of dimensionality' problem for high-dimensional embeddings? (Answer: In high dimensions (768, 1536, 3072), the volume grows exponentially — almost all points become roughly equidistant from each other. Nearest neighbour search loses discriminative power: the difference between the closest and farthest neighbour becomes proportionally small. Practical effect: at very high dimensions, cosine similarities cluster around 0 for all pairs. Mitigations: dimensionality reduction (PCA to 256 dims), Matryoshka embeddings (encode most information in first 256 dims), and ANN algorithms (HNSW) that navigate the manifold structure rather than brute-force comparing all distances.)
- What is fine-tuning embedding models and when is it necessary? (Answer: Pretrained embeddings (text-embedding-3-large, BAAI/bge): trained on general web text. May not capture domain-specific similarity. Fine-tuning: train on (query, positive_document, negative_document) triplets from your domain using contrastive loss. When necessary: (1) Highly specialised vocabulary (legal, medical, chemical). (2) Custom similarity notion (you want 'similar' to mean something specific). (3) When out-of-box retrieval quality is below 70% accuracy. Fine-tuned embedding models often improve RAG retrieval by 10–20 percentage points on domain-specific tasks.)
On LumiChats
LumiChats uses text-embedding-3-large (OpenAI's best embedding model, 3072 dimensions) for Study Mode and Memory. Document chunks and memories are stored as embeddings in pgvector and retrieved using cosine similarity search.
Try it free