Embeddings Explained: What They Are and How to Use Them with Python and Ollama (2026)
Embeddings are one of the most useful building blocks in modern AI applications, yet they are often explained poorly. This guide starts from zero — what an embedding actually is, why it works, and how to put it to use in Python using Ollama (free, local) or OpenAI.
By the end you will have working code for semantic search, duplicate detection, and storing embeddings in PostgreSQL with pgvector.
What Are Embeddings?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings get similar vectors.
"Python is a programming language" → [0.12, -0.45, 0.89, 0.03, ...]
"Python is used for coding" → [0.11, -0.43, 0.91, 0.02, ...] # very similar
"The Eiffel Tower is in Paris" → [-0.67, 0.23, -0.12, 0.78, ...] # very different
A typical embedding model produces a vector with hundreds or thousands of dimensions. You never interpret individual numbers — what matters is the geometric relationship between vectors. Texts that mean similar things cluster together in that high-dimensional space.
Use cases:
- Semantic search (find similar documents)
- Recommendation systems
- Duplicate detection
- Text classification
- RAG (Retrieval-Augmented Generation)
Embeddings are not the same as word count or TF-IDF. Those methods care about which words appear; embeddings care about meaning. "automobile" and "car" are very different strings, but their embeddings will be nearly identical.
Generate Embeddings with Ollama (Free, Local)
Ollama runs open-source models on your machine. The nomic-embed-text model is fast, produces 768-dimensional vectors, and requires no API key.
ollama pull nomic-embed-text # best embedding model for Ollama
Call it via HTTP:
import requests
def embed(text: str) -> list[float]:
response = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text}
)
return response.json()["embedding"]
vector = embed("Python is a programming language")
print(f"Dimensions: {len(vector)}") # 768
print(f"First 5 values: {vector[:5]}")
Or Use the Official Ollama Python Library
pip install ollama
import ollama
response = ollama.embeddings(
model="nomic-embed-text",
prompt="Python is a programming language"
)
vector = response["embedding"]
The library handles connection pooling and is cleaner for production code. Under the hood it calls the same local HTTP API.
Generate with OpenAI
If you prefer a cloud model or need higher-quality embeddings for a critical application:
from openai import OpenAI
client = OpenAI()
def embed_openai(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions, very cheap
input=text
)
return response.data[0].embedding
text-embedding-3-small costs roughly $0.02 per million tokens — extremely cheap. text-embedding-3-large (3072 dimensions) is more accurate for tasks where quality matters more than cost.
The rest of this guide works with either provider. Just swap embed() for embed_openai() and adjust the vector dimension where needed.
Cosine Similarity: Measure How Similar Two Texts Are
The standard way to compare two embeddings is cosine similarity. It measures the angle between two vectors: 1.0 means identical direction (same meaning), 0.0 means perpendicular (unrelated), and -1.0 means opposite.
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
v1 = embed("Python is a programming language")
v2 = embed("Python is used for coding")
v3 = embed("The Eiffel Tower is in Paris")
print(cosine_similarity(v1, v2)) # ~0.95 (very similar)
print(cosine_similarity(v1, v3)) # ~0.20 (unrelated)
Install NumPy if needed:
pip install numpy
Cosine similarity is preferred over Euclidean distance for text embeddings because it is not affected by the length of the text — a short sentence and a long paragraph about the same topic will still score high.
Semantic Search: Find Similar Documents
With a small document set, you can do semantic search entirely in memory using NumPy:
import numpy as np
documents = [
"Python is a programming language",
"Docker containers run applications",
"Machine learning uses neural networks",
"Linux is an open-source operating system",
"Flask is a Python web framework",
]
# Pre-compute embeddings for all documents
doc_embeddings = [embed(doc) for doc in documents]
def search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
query_embedding = embed(query)
scores = [
(doc, cosine_similarity(query_embedding, doc_emb))
for doc, doc_emb in zip(documents, doc_embeddings)
]
return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
results = search("web development with Python")
for doc, score in results:
print(f"{score:.3f}: {doc}")
# Output:
# 0.891: Flask is a Python web framework
# 0.743: Python is a programming language
# 0.312: Docker containers run applications
The key insight: you pre-compute document embeddings once and store them. At query time, you only embed the query (one call), then compute similarity against all stored vectors. This is fast enough for thousands of documents in pure Python.
Keyword Search vs Semantic Search
The difference becomes obvious when the query uses different words than the documents:
# Keyword search: misses synonyms and related concepts
def keyword_search(query, docs):
return [d for d in docs if query.lower() in d.lower()]
print(keyword_search("ML", documents)) # Returns nothing (no "ML" in docs)
# Semantic search: understands meaning
results = search("ML")
print(results[0]) # Finds "Machine learning uses neural networks" (0.82 similarity)
Semantic search also handles misspellings, synonyms, and paraphrases gracefully — anything the embedding model learned during training.
Store Embeddings in NumPy (Small Scale)
For up to tens of thousands of documents, NumPy arrays stored on disk are sufficient:
import numpy as np
import pickle
# Save
matrix = np.array(doc_embeddings)
np.save("embeddings.npy", matrix)
with open("documents.pkl", "wb") as f:
pickle.dump(documents, f)
# Load
matrix = np.load("embeddings.npy")
with open("documents.pkl", "rb") as f:
documents = pickle.load(f)
For vectorized similarity search over the whole matrix at once (much faster than a Python loop):
def fast_search(query: str, matrix: np.ndarray, documents: list[str], top_k: int = 3):
q = np.array(embed(query))
# Dot product of query with every row; normalise both sides
norms = np.linalg.norm(matrix, axis=1) * np.linalg.norm(q)
similarities = matrix @ q / norms
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(documents[i], float(similarities[i])) for i in top_indices]
Store in PostgreSQL with pgvector (Production)
For production systems with millions of documents, you need a vector database. pgvector is a PostgreSQL extension that adds vector types and approximate nearest-neighbour search — no separate infrastructure required if you already use Postgres.
-- Install extension
CREATE EXTENSION vector;
-- Create table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(768) -- match your model's dimensions
);
-- Create index for fast search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
import psycopg2
import json
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
def store_document(content: str):
embedding = embed(content)
cur.execute(
"INSERT INTO documents (content, embedding) VALUES (%s, %s)",
(content, json.dumps(embedding))
)
conn.commit()
def semantic_search(query: str, limit: int = 5):
query_embedding = embed(query)
cur.execute(
"""
SELECT content, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(json.dumps(query_embedding), json.dumps(query_embedding), limit)
)
return cur.fetchall()
The <=> operator computes cosine distance. 1 - cosine_distance = cosine_similarity, so we subtract to get a score where 1.0 is most similar. The ivfflat index makes this fast even with millions of rows.
Install the Python driver:
pip install psycopg2-binary
Batch Embed for Efficiency
Embedding one document at a time is slow when processing large datasets. Batch in groups to reduce overhead:
def batch_embed(texts: list[str], batch_size: int = 32) -> list[list[float]]:
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
for text in batch:
embeddings.append(embed(text))
return embeddings
With the OpenAI API you can pass a list directly to client.embeddings.create(input=batch) and get all embeddings in one network round-trip, which is significantly faster:
def batch_embed_openai(texts: list[str], batch_size: int = 100) -> list[list[float]]:
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
embeddings.extend([item.embedding for item in response.data])
return embeddings
Practical: Deduplicate Documents
Embeddings make it easy to find near-duplicate content even when the wording differs:
def find_duplicates(documents: list[str], threshold: float = 0.95) -> list[tuple[int, int]]:
embeddings = batch_embed(documents)
duplicates = []
for i in range(len(embeddings)):
for j in range(i + 1, len(embeddings)):
sim = cosine_similarity(embeddings[i], embeddings[j])
if sim > threshold:
duplicates.append((i, j, sim))
return duplicates
docs = [
"How do I install Python on Ubuntu?",
"Steps to install Python on Ubuntu Linux",
"What is the capital of France?",
]
dupes = find_duplicates(docs, threshold=0.90)
for i, j, sim in dupes:
print(f"Duplicate ({sim:.2f}): '{docs[i]}' and '{docs[j]}'")
# Duplicate (0.94): 'How do I install Python on Ubuntu?' and 'Steps to install Python on Ubuntu Linux'
This is useful for cleaning training data, deduplicating knowledge bases, or flagging repeated support tickets.
Choosing the Right Model
| Model | Dimensions | Cost | Best for |
|---|---|---|---|
| nomic-embed-text (Ollama) | 768 | Free (local) | Local dev, private data |
| text-embedding-3-small (OpenAI) | 1536 | Very cheap | Production, cloud |
| text-embedding-3-large (OpenAI) | 3072 | Cheap | High-accuracy retrieval |
| mxbai-embed-large (Ollama) | 1024 | Free (local) | Better quality local option |
For most applications, nomic-embed-text with Ollama is the right starting point: zero cost, runs offline, and good enough for semantic search over tens of thousands of documents. Switch to an OpenAI model when you need higher quality or are deploying to an environment without local GPU access.