
Imagine you're building a customer support tool that searches through thousands of previous support tickets to find answers. A user types: "my app keeps freezing on startup." Your search system dutifully looks for tickets containing those exact words — and returns nothing useful, because the relevant tickets say things like "application crashes on launch" and "program hangs at boot screen." Same problem, completely different words. Traditional keyword search just failed you.
This is the fundamental gap that embeddings solve. Embeddings are a technique that transforms text — or any other data — into lists of numbers that capture meaning, not just spelling. When two pieces of text mean similar things, their number representations end up mathematically close to each other. Suddenly, "freezing on startup" and "crashes on launch" become neighbors in a mathematical space, and your search system finds exactly what it should.
By the end of this lesson, you'll understand how that transformation works, why it works, and how to actually use it in code. This is foundational knowledge for building RAG (Retrieval-Augmented Generation) systems, AI agents, and any application that needs to understand language rather than just match characters.
What you'll learn:
Before we dive into embeddings, it helps to really feel the limitation of the alternative.
Traditional search works by matching strings. When you search a database for "invoice payment", the system looks for documents containing those exact characters. This is fast and simple, but language is messy. People describe the same concept in wildly different ways:
A keyword system treats each of these as completely different. There's no awareness that they're semantically related — that they mean similar things.
There's a deeper issue too: keyword search has no concept of context. The word "bank" means something totally different in "river bank" versus "savings bank." Keyword search can't distinguish them. Embeddings can, because they encode the surrounding context of words, not just the words themselves.
An embedding is a fixed-length list of numbers that represents a piece of text. That's it. If you run the sentence "the dog chased the ball" through an embedding model, you get back something like:
[0.023, -0.187, 0.441, 0.012, -0.303, ... , 0.198]
This list might have 384 numbers, or 768, or 1536 — depending on which embedding model you use. Each individual number doesn't have a clean human-readable meaning like "how dog-related is this sentence." Instead, the pattern of all the numbers together encodes the semantic content.
Think of it like GPS coordinates. A single latitude number doesn't tell you much — 51.5 doesn't mean "London" by itself. But the combination of latitude and longitude pinpoints an exact location. Embeddings work similarly: the full list of numbers together pinpoints a location in a high-dimensional semantic space.
We call this list of numbers a vector. In mathematics, a vector is just an ordered list of numbers. The space that all these vectors live in is called a vector space or embedding space. For a model producing 768-dimensional embeddings, this is a space with 768 axes — impossible to visualize, but perfectly calculable.
The critical property that makes this useful: the model is trained so that text with similar meanings produces vectors that are close together in this space. "Dog" and "puppy" will have vectors that are nearby. "Dog" and "quarterly earnings report" will have vectors far apart.
You don't need to build an embedding model — you'll use pre-trained ones. But understanding how they learn to place meaning in space will help you use them more wisely.
Modern embedding models are built on transformer neural networks — the same architecture behind GPT and other large language models. During training, the model processes enormous amounts of text and learns from patterns in that text.
The training signal often comes from tasks like: "given these two sentences, predict whether they say similar things." After seeing millions of examples — scientific papers paired with their abstracts, questions paired with their answers, product reviews with their ratings — the model's internal weights adjust until the vector it produces for similar content ends up in similar regions of the vector space.
What emerges is remarkable. The model learns to encode concepts, topics, sentiment, and even analogical relationships into geometric structure. One famous example: if you take the vector for "king", subtract the vector for "man", and add the vector for "woman", you end up very close to the vector for "queen." Meaning has become arithmetic.
Key insight: You're not programming these relationships manually. They emerge automatically from training on language patterns. This is why embeddings are so powerful — they capture the full richness of how language is actually used.
Let's make this concrete. We'll use the sentence-transformers library, which provides high-quality pre-trained embedding models that run locally and are free to use.
First, install the library:
pip install sentence-transformers
Now let's generate embeddings for a few sentences:
from sentence_transformers import SentenceTransformer
# Load a pre-trained embedding model
# all-MiniLM-L6-v2 is small, fast, and surprisingly capable
model = SentenceTransformer('all-MiniLM-L6-v2')
# A few sentences about similar and different topics
sentences = [
"The application crashes immediately after launch.",
"My software freezes when I try to open it.",
"I can't get the program to start without it hanging.",
"The payment failed and I was charged twice.",
"I got billed twice for the same order.",
]
# Generate embeddings — returns a NumPy array of shape (5, 384)
embeddings = model.encode(sentences)
print(f"Shape of embeddings array: {embeddings.shape}")
print(f"Each embedding has {embeddings.shape[1]} dimensions")
print(f"\nFirst few values of embedding 1:\n{embeddings[0][:8]}")
Output:
Shape of embeddings array: (5, 384)
Each embedding has 384 dimensions
First few values of embedding 1:
[ 0.0231 -0.0412 0.0887 0.1203 -0.0654 0.0321 0.0911 -0.0187]
You've just turned five sentences into five vectors. Each sentence is now a point in a 384-dimensional space. But how do we check whether the "crashing" sentences are actually close to each other and far from the "billing" sentences?
Once you have vectors, you need a way to measure how close they are. The most common method for embeddings is cosine similarity.
Here's the intuition: imagine two arrows (vectors) pointing out from the origin. If they point in nearly the same direction, they're similar. If they point in opposite directions, they're dissimilar. Cosine similarity measures the angle between them, not the distance.
Why angle rather than straight-line distance? Because embedding models tend to encode meaning in the direction of a vector more than its magnitude (length). Two vectors can be at very different distances from the origin but point in the same direction — which would mean they carry similar meaning.
Cosine similarity produces a score between -1 and 1:
Let's calculate this for our five sentences:
from sentence_transformers import util
# Compute pairwise cosine similarity for all sentences
cosine_scores = util.cos_sim(embeddings, embeddings)
print("Cosine Similarity Matrix:")
print("(Rows and columns correspond to our 5 sentences)\n")
# Print a nicely formatted matrix
for i, row in enumerate(cosine_scores):
scores = [f"{score:.2f}" for score in row]
print(f"Sentence {i+1}: {scores}")
Output:
Cosine Similarity Matrix:
(Rows and columns correspond to our 5 sentences)
Sentence 1: [1.00, 0.81, 0.77, 0.12, 0.15]
Sentence 2: [0.81, 1.00, 0.79, 0.10, 0.13]
Sentence 3: [0.77, 0.79, 1.00, 0.09, 0.11]
Sentence 4: [0.12, 0.10, 0.09, 1.00, 0.84]
Sentence 5: [0.15, 0.13, 0.11, 0.84, 1.00]
Look at that. Sentences 1, 2, and 3 (all about crashing/freezing) have similarity scores of 0.77–0.81 with each other, but only 0.09–0.15 with sentences 4 and 5 (about billing). Meanwhile, sentences 4 and 5 score 0.84 with each other.
The model has never been explicitly told "crashing" and "freezing" are related. It learned this from reading vast amounts of text where these concepts appeared in similar contexts.
Now let's put this together into something genuinely useful: a function that takes a query and finds the most semantically relevant result from a collection of documents.
from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer('all-MiniLM-L6-v2')
# A small knowledge base of support articles
knowledge_base = [
"To reset your password, click 'Forgot Password' on the login screen and follow the email instructions.",
"If the application crashes on startup, try clearing the cache folder located in AppData/Local.",
"Billing discrepancies and duplicate charges are handled by the billing team at billing@support.com.",
"To export your data, navigate to Settings > Data Management > Export and choose your format.",
"If the app is running slowly, check that your system meets the minimum RAM requirement of 8GB.",
"Two-factor authentication can be enabled under Security Settings in your account dashboard.",
"To cancel your subscription, go to Account > Subscription > Cancel Plan.",
"If videos won't play, ensure your graphics drivers are up to date and hardware acceleration is enabled.",
]
# Pre-compute embeddings for all documents in our knowledge base
# In a real system, you'd store these in a vector database
kb_embeddings = model.encode(knowledge_base, convert_to_tensor=True)
def semantic_search(query: str, top_k: int = 3) -> list:
"""
Find the top_k most semantically similar documents to a query.
"""
# Embed the query using the same model
query_embedding = model.encode(query, convert_to_tensor=True)
# Calculate similarity between query and all documents
scores = util.cos_sim(query_embedding, kb_embeddings)[0]
# Sort by score, descending
top_results = torch.topk(scores, k=top_k)
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append({
"document": knowledge_base[idx],
"score": round(score.item(), 3)
})
return results
# Try it out
query = "my program keeps hanging when I open it"
print(f"Query: '{query}'\n")
print("Top matches:")
for i, result in enumerate(semantic_search(query), 1):
print(f"\n{i}. Score: {result['score']}")
print(f" {result['document']}")
Output:
Query: 'my program keeps hanging when I open it'
Top matches:
1. Score: 0.743
If the application crashes on startup, try clearing the cache folder...
2. Score: 0.521
If the app is running slowly, check that your system meets the minimum RAM...
3. Score: 0.412
To reset your password, click 'Forgot Password' on the login screen...
The query "keeps hanging when I open it" correctly surfaces the crash/startup article first, even though none of those exact words appear in the document.
Important: Notice that we encode the knowledge base once and reuse those embeddings for every query. Embedding is computationally expensive relative to a similarity lookup. In production systems, you'd store pre-computed embeddings in a vector database (like Pinecone, Weaviate, or pgvector) and only compute the query embedding at search time.
Not all embedding models are equal, and choosing wisely matters. Here's what to consider:
Dimensionality: More dimensions generally means more expressive power, but also more storage and slower search. all-MiniLM-L6-v2 produces 384-dimensional vectors and is great for getting started. OpenAI's text-embedding-3-large produces 3072-dimensional vectors with substantially better performance on complex tasks.
Domain: General-purpose models work surprisingly well across domains, but specialized models exist. If you're building search over legal documents, a model fine-tuned on legal text will outperform a general model.
Maximum token length: Most embedding models have a maximum input length — often 512 tokens (roughly 380 words). If you try to embed a 5,000-word document, it gets truncated. For long documents, you need a chunking strategy: split the document into overlapping passages and embed each passage separately.
API vs. local: OpenAI's embedding API is convenient and produces excellent results, but costs money and requires network calls. sentence-transformers models run locally, are free, and are often fast enough for many use cases.
# Using OpenAI's embedding API (requires: pip install openai)
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY environment variable
response = client.embeddings.create(
input="The application crashes immediately after launch.",
model="text-embedding-3-small" # 1536 dimensions, cost-effective
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 1536
Tip: Always use the same model to embed both your documents and your queries. An embedding from
all-MiniLM-L6-v2and an embedding from OpenAI's model live in completely different vector spaces — comparing them gives meaningless results.
Build a semantic search system over a realistic dataset. Here's your challenge:
Scenario: You work at a recipe platform. Users search for recipes using natural language, but your database only has recipe titles and descriptions. Build a semantic search function that handles queries like "something warm for a cold night" matching recipes like "hearty beef stew" or "spiced lentil soup."
Step 1: Create a list of at least 15 recipe descriptions. Include a mix of cuisines and meal types. Make them specific enough to be interesting — not just "pasta dish" but "creamy carbonara with pancetta and black pepper."
Step 2: Generate embeddings for all recipe descriptions and store them.
Step 3: Write a find_recipes(query, top_k=3) function that takes a natural language query and returns the top matches with their similarity scores.
Step 4: Test your function with at least five queries that use different vocabulary than what's in your recipe descriptions. For example:
Step 5: Find a query where the results surprise you — where the model finds a connection you didn't expect. Write a short note explaining why you think the model made that connection.
Stretch goal: Add a threshold filter so your function only returns results above a certain similarity score (e.g., 0.4). Return a message like "No close matches found" when nothing meets the threshold.
Mistake 1: Comparing embeddings from different models If you embed your knowledge base with one model and your queries with another, you'll get nonsensical similarity scores. The vectors live in different spaces. Always use the same model, consistently, throughout a project.
Mistake 2: Embedding entire long documents Most models silently truncate input that exceeds their token limit. You might embed a 10-page PDF and only the first ~380 words actually get encoded. For long documents, split them into chunks of 200-400 words (with some overlap between chunks) and embed each chunk separately.
Mistake 3: Treating cosine similarity scores as absolute thresholds A similarity score of 0.7 might be a great match in one context and a terrible one in another. It depends on your domain, your model, and how diverse your documents are. Always calibrate thresholds empirically by testing with real queries.
Mistake 4: Re-embedding the knowledge base on every search This is a performance killer. Embedding is the expensive step. Compute embeddings once, save them (to disk, or better, a vector database), and load them at search time. Only compute the query embedding on the fly.
# Save embeddings to disk (so you don't recompute them)
import numpy as np
# Save
np.save('kb_embeddings.npy', kb_embeddings.numpy())
# Load later
loaded_embeddings = torch.tensor(np.load('kb_embeddings.npy'))
Mistake 5: Assuming higher dimensions always means better results For your specific task, a well-tuned smaller model often beats a larger general-purpose one. Always benchmark on your actual data before committing to a larger, more expensive model.
Troubleshooting: "My search returns irrelevant results" Before assuming the model is bad, check: Are your documents specific enough? A knowledge base full of vague, short descriptions won't embed well. Are you chunking appropriately? Is your query using the kind of language the model was trained on? Try rephrasing queries and see if results change. If they do, the model is working — your queries might just need to be clearer.
You now understand what embeddings are and why they matter. Let's recap the core ideas:
all-MiniLM-L6-v2 (local, free) or OpenAI's text-embedding-3-small (API, paid) are ready to use.This is the foundation of everything in the RAG (Retrieval-Augmented Generation) world. When a RAG system "retrieves relevant context before generating an answer," the retrieval step is almost always semantic search — exactly what you've built here.
Next steps in this learning path:
Vector Databases: Learn how to store and query embeddings at scale using tools like Chroma, Weaviate, or pgvector — because storing vectors in a Python list doesn't scale beyond a few thousand documents.
Chunking Strategies: When your documents are longer than a few paragraphs, how you split them dramatically affects retrieval quality. The next lesson covers fixed-size chunking, recursive chunking, and semantic chunking.
Building a RAG Pipeline: Once you can retrieve relevant passages with embeddings, the next step is passing those passages to a language model to generate grounded, accurate answers.
The mental model you've built today — that meaning can be a location in space, and that proximity in that space means semantic similarity — will be the foundation for every one of those topics.
Learning Path: RAG & AI Agents