Embeddings
Embeddings are dense numerical vector representations of text, images, or other data that capture their semantic meaning. Similar concepts produce embeddings that are mathematically close in high-dimensional space, allowing AI systems to perform semantic search, clustering, classification, and retrieval based on meaning rather than keyword matching.
How AI understands meaning, not just words.
Category: AI Fundamentals
What an embedding looks like
An embedding is simply a list of floating-point numbers — a vector. The length of this list is called the embedding dimension. Modern text embedding models produce vectors with 768 to 3,072 dimensions.
from openai import OpenAI
import numpy as np
client = OpenAI() # uses OPENAI_API_KEY env variable
def embed(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions
input=text
)
return np.array(response.data[0].embedding)
king = embed("king")
queen = embed("queen")
banana = embed("banana")
print(f"Embedding shape: {king.shape}") # (1536,)
print(f"First 5 values: {king[:5]}")
# e.g. [ 0.021, -0.083, 0.045, -0.012, 0.067, ...]
# 1536 numbers — each individually meaningless,
# but together they encode the word's meaning
Shape matters: These 1536 numbers have no individually interpretable meaning. What matters is the geometric relationship: words with similar meanings live close together in this 1536-dimensional space.
Cosine similarity: measuring meaning distance
To compare how similar two embeddings are, we use cosine similarity — the cosine of the angle between the two vectors. This is preferred over Euclidean distance for embeddings because it measures directional similarity, not magnitude:
\cos(\theta) = \frac{A \cdot B}{\|A\| \cdot \|B\|}
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two vectors."""
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Using the embeddings from the previous example:
print(f"king ↔ queen: {cosine_similarity(king, queen):.4f}") # ~0.87
print(f"king ↔ banana: {cosine_similarity(king, banana):.4f}") # ~0.21
print(f"king ↔ king: {cosine_similarity(king, king):.4f}") # 1.0000
# For bulk comparisons, normalize first (cosine = dot product of unit vectors):
def normalize(v: np.ndarray) -> np.ndarray:
return v / np.linalg.norm(v)
king_n, queen_n = normalize(king), normalize(queen)
# Now: similarity = np.dot(king_n, queen_n)
The famous king − man + woman = queen
A celebrated property of well-trained embeddings (first shown in Word2Vec, 2013) is that semantic relationships correspond to arithmetic in vector space:
\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}
# Semantic arithmetic in embedding space
man = embed("man")
woman = embed("woman")
# Compute the "analogy" vector
target = king - man + woman
# Find which word's embedding is closest to the result
candidates = {"king": king, "queen": queen, "woman": woman, "banana": banana}
similarities = {
word: cosine_similarity(target, vec)
for word, vec in candidates.items()
}
best = max(similarities, key=similarities.get)
print("Nearest to (king - man + woman):", best) # queen ✓
# Output: queen (similarity ~0.89)
# This works because the gender direction (man→woman)
# is consistent across semantic spaces
Modern large embedding models (like text-embedding-3-large) capture far richer, contextual semantics. The same word 'bank' gets completely different embeddings depending on whether the surrounding context is about finance or rivers — because modern embedders process context, not just isolated words.
Building a semantic search system
Here's a minimal but production-realistic semantic search implementation — the same core logic used in LumiChats Study Mode:
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed_batch(texts: list[str]) -> np.ndarray:
"""Embed multiple texts in one API call (efficient)."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return np.array([r.embedding for r in response.data])
# --- 1. Index your knowledge base ---
documents = [
"Photosynthesis converts sunlight into glucose in plant cells.",
"The mitochondria is the powerhouse of the cell.",
"DNA replication occurs during the S phase of the cell cycle.",
"The Eiffel Tower was built in 1889 in Paris, France.",
]
doc_embeddings = embed_batch(documents) # shape: (4, 1536)
# Normalize for fast cosine similarity via dot product
norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
doc_embeddings_norm = doc_embeddings / norms
# --- 2. Query ---
def search(query: str, top_k: int = 3):
query_emb = embed_batch([query])[0] # (1536,)
query_emb = query_emb / np.linalg.norm(query_emb) # normalize
# Cosine similarity = dot product when both are normalized
scores = doc_embeddings_norm @ query_emb # (4,)
top_indices = np.argsort(scores)[::-1][:top_k]
return [(documents[i], float(scores[i])) for i in top_indices]
results = search("how do plants make food?")
for doc, score in results:
print(f"Score {score:.3f}: {doc}")
# Score 0.847: Photosynthesis converts sunlight into glucose in plant cells. ✓
# Score 0.432: The mitochondria is the powerhouse of the cell.
# Score 0.381: DNA replication occurs during the S phase of the cell cycle.
Production tip: In production, store embeddings in a vector database (pgvector, Pinecone, Qdrant) instead of NumPy arrays — they handle millions of vectors with millisecond search using HNSW indexing.
Embedding model comparison (2025)
| Model | Dimensions | MTEB Score | Best for |
|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3,072 | 64.6 | General purpose, highest quality |
| text-embedding-3-small (OpenAI) | 1,536 | 62.3 | Cost-efficient, fast |
| Cohere Embed v3 | 1,024 | 64.5 | Multilingual, strong retrieval |
| voyage-3 (Voyage AI) | 1,024 | 67.1 | Code, technical retrieval |
| BGE-M3 (open-source) | 1,024 | 63.5 | Self-hosted, multilingual |
| mxbai-embed-large (open-source) | 1,024 | 64.7 | Self-hosted, cost-free |
Dimension mismatch: You must embed queries and documents with the same model. Mixing models produces meaningless comparisons — the vector spaces are completely incompatible even at the same dimension.
Use cases beyond RAG
- Recommendation systems — embed user history and items; find items closest to the user's "taste vector"
- Duplicate detection — find near-identical documents in a corpus (cosine similarity > 0.97)
- Classification — train a lightweight classifier (logistic regression, SVM) on top of frozen embeddings — often beats fine-tuning for small datasets
- Clustering — K-Means or HDBSCAN over embeddings to discover semantic groups without labels
- Cross-lingual search — multilingual models embed English and Hindi into the same space; search Hindi docs with an English query
- Anomaly detection — inputs far from the distribution of "normal" embeddings may indicate unusual or adversarial inputs
Practice questions
- What is the dot product between two unit vectors and why does cosine similarity use it? (Answer: Cosine similarity = (A·B)/(||A||·||B||). For unit vectors (||A||=||B||=1): cosine_similarity = A·B directly. Range: -1 (opposite directions) to +1 (same direction), 0 (orthogonal/unrelated). Used for embeddings because length is normalized — similarity reflects angular relationship (semantic closeness) not vector magnitude. In practice: normalize embeddings before cosine comparison; this also makes nearest neighbor search faster (dot product only).)
- What is the difference between sentence embeddings and word embeddings? (Answer: Word embeddings (Word2Vec, GloVe): one vector per word in the vocabulary, context-independent. 'bank' has the same vector whether 'river bank' or 'bank account.' Sentence embeddings (SBERT, OpenAI text-embedding): one vector per sentence/passage, context-aware. The vector for 'I went to the bank to deposit money' reflects the financial sense. Sentence embeddings are generated by averaging token embeddings or using [CLS] token from a transformer — capturing the full contextual meaning of the input.)
- What is semantic search vs keyword search and when is each appropriate? (Answer: Keyword search (BM25/TF-IDF): finds documents containing the exact query terms. Fast, interpretable, handles technical terms and product IDs exactly. Fails when query uses different vocabulary than documents. Semantic search (embedding similarity): finds semantically similar documents even with different vocabulary — 'affordable car' matches documents about 'budget vehicle' and 'cheap automobile.' Slower (requires embedding + ANN lookup). Use keyword for: exact product IDs, medical codes, legal citations. Use semantic for: user intent queries, cross-lingual retrieval, FAQ matching.)
- What is the 'curse of dimensionality' problem for high-dimensional embeddings? (Answer: In high dimensions (768, 1536, 3072), the volume grows exponentially — almost all points become roughly equidistant from each other. Nearest neighbor search loses discriminative power: the difference between the closest and farthest neighbor becomes proportionally small. Practical effect: at very high dimensions, cosine similarities cluster around 0 for all pairs. Mitigations: dimensionality reduction (PCA to 256 dims), Matryoshka embeddings (encode most information in first 256 dims), and ANN algorithms (HNSW) that navigate the manifold structure rather than brute-force comparing all distances.)
- What is fine-tuning embedding models and when is it necessary? (Answer: Pretrained embeddings (text-embedding-3-large, BAAI/bge): trained on general web text. May not capture domain-specific similarity. Fine-tuning: train on (query, positive_document, negative_document) triplets from your domain using contrastive loss. When necessary: (1) Highly specialized vocabulary (legal, medical, chemical). (2) Custom similarity notion (you want 'similar' to mean something specific). (3) When out-of-box retrieval quality is below 70% accuracy. Fine-tuned embedding models often improve RAG retrieval by 10–20 percentage points on domain-specific tasks.)
LumiChats uses text-embedding-3-large (OpenAI's best embedding model, 3072 dimensions) for Study Mode and Memory. Document chunks and memories are stored as embeddings in pgvector and retrieved using cosine similarity search.