Normal view

Received today — 20 December 2025

How to Install Linux Kernel 6.18 LTS on Ubuntu 25.10 and Ubuntu 25.04

20 December 2025 at 02:23

Ubuntu Linux 6.18

You can now install the latest and greatest Linux 6.18 LTS kernel series on your Ubuntu 25.04 and Ubuntu 25.10 distributions. Here’s how to do it!

The post How to Install Linux Kernel 6.18 LTS on Ubuntu 25.10 and Ubuntu 25.04 appeared first on 9to5Linux - do not reproduce this article without permission. This RSS feed is intended for readers, not scrapers.

Received before yesterday

Production Vector Databases

16 December 2025 at 22:31

Previously, we saw something interesting when we added metadata filtering to our arXiv paper search. Filtering added significant overhead to our queries. Category filters made queries 3.3x slower. Year range filters added 8x overhead. Combined filters landed somewhere in the middle at 5x.

That’s fine for a learning environment or a small-scale prototype. But if you’re building a real application where users are constantly filtering by category, date ranges, or combinations of metadata fields, those milliseconds add up fast. When you’re handling hundreds or thousands of queries per hour, they really start to matter.

Let’s see if production databases handle this better. We’ll go beyond ChromaDB and get hands-on with three production-grade vector databases. We won’t just read about them. We’ll actually set them up, load our data, run queries, and measure what happens.

Here’s what we’ll build:

  1. PostgreSQL with pgvector: The SQL integration play. We’ll add vector search capabilities to a traditional database that many teams already run.
  2. Qdrant: The specialized vector database. Built from the ground up in Rust for handling filtered vector search efficiently.
  3. Pinecone: The managed service approach. We’ll see what it’s like when someone else handles all the infrastructure.

By the end, you’ll have hands-on experience with all three approaches, real performance data showing how they compare, and a framework for choosing the right database for your specific situation.

What You Already Know

This tutorial assumes you understand:

  • What embeddings are and how similarity search works
  • How to use ChromaDB for basic vector operations
  • Why metadata filtering matters for real applications
  • The performance characteristics of ChromaDB’s filtering

If any of these topics are new to you, we recommend checking out these previous posts:

  1. Introduction to Vector Databases using ChromaDB
  2. Document Chunking Strategies for Vector Databases
  3. Metadata Filtering and Hybrid Search for Vector Databases

They’ll give you the foundation needed to get the most out of what we’re covering here.

What You’ll Learn

By working through this tutorial, you’ll:

  • Set up and configure three different production vector databases
  • Load the same dataset into each one and run identical queries
  • Measure and compare performance characteristics
  • Understand the tradeoffs: raw speed, filtering efficiency, operational overhead
  • Learn when to choose each database based on your team’s constraints
  • Get comfortable with different database architectures and APIs

A Quick Note on “Production”

When we say “production database,” we don’t mean these are only for big companies with massive scale. We mean these are databases you could actually deploy in a real application that serves real users. They handle the edge cases, offer reasonable performance at scale, and have communities and documentation you can rely on.

That said, “production-ready” doesn’t mean “production-required.” ChromaDB is perfectly fine for many applications. The goal here is to expand your toolkit so you can make informed choices.

Setup: Two Paths Forward

Before we get into our three vector databases, we need to talk about how we’re going to run them. You have two options, and neither is wrong.

Option 1: Docker (Recommended)

We recommend using Docker for this tutorial because it lets you run all three databases side-by-side without any conflicts. You can experiment, break things, start over, and when you’re done, everything disappears cleanly with a single command.

More importantly, this is how engineers actually work with databases in development. You spin up containers, test things locally, then deploy similar containers to production. Learning this pattern now gives you a transferable skill.

If you’re new to Docker, don’t worry. You don’t need to become a Docker expert. We’ll use it like a tool that creates safe workspaces. Think of it as running each database in its own isolated bubble on your computer.

Here’s what we’ll set up:

  • A workspace container where you’ll write and run Python code
  • A PostgreSQL container with pgvector already installed
  • A Qdrant container running the vector database
  • Shared folders so your code and data persist between sessions

Everything stays on your actual computer in folders you can see and edit. The containers just provide the database software and Python environment.

Want to learn more about Docker? We have an excellent guide on setting up data engineering labs with Docker: Setting Up Your Data Engineering Lab with Docker

Option 2: Direct Installation (Alternative)

If you prefer to install things directly on your system, or if Docker won’t work in your environment, that’s totally fine. You can:

The direct installation path means you can’t easily run all three databases simultaneously for side-by-side comparison, but you’ll still learn the concepts and get hands-on experience with each one.

What We’re Using

Throughout this tutorial, we’ll use the same dataset we’ve been working with: 5,000 arXiv papers with pre-generated embeddings. If you don’t have these files yet, you can download them:

If you already have these files from previous work, you’re all set.

Docker Setup Instructions

Let’s get the Docker environment running. First, create a folder for this project:

mkdir vector_dbs
cd vector_dbs

Create a structure for your files:

mkdir code data

Put your dataset files (arxiv_papers_5k.csv and embeddings_cohere_5k.npy) in the data/ folder.

Now create a file called docker-compose.yml in the vector_dbs folder:

services:
  lab:
    image: python:3.12-slim
    volumes:
      - ./code:/code
      - ./data:/data
    working_dir: /code
    stdin_open: true
    tty: true
    depends_on: [postgres, qdrant]
    networks: [vector_net]
    environment:
      POSTGRES_HOST: postgres
      QDRANT_HOST: qdrant

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_PASSWORD: tutorial_password
      POSTGRES_DB: arxiv_db
    ports: ["5432:5432"]
    volumes: [postgres_data:/var/lib/postgresql/data]
    networks: [vector_net]

  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333", "6334:6334"]
    volumes: [qdrant_data:/qdrant/storage]
    networks: [vector_net]

networks:
  vector_net:

volumes:
  postgres_data:
  qdrant_data:

This configuration sets up three containers:

  • lab: Your Python workspace where you’ll run code
  • postgres: PostgreSQL database with pgvector pre-installed
  • qdrant: Qdrant vector database

The databases store their data in Docker volumes (postgres_data, qdrant_data), which means your data persists even when you stop the containers.

Start the databases:

docker compose up -d postgres qdrant

The -d flag runs them in the background. You should see Docker downloading the images (first time only) and then starting the containers.

Now enter your Python workspace:

docker compose run --rm lab

The --rm flag tells Docker to automatically remove the container when you exit. Don’t worry about losing your work. Your code in the code/ folder and data in data/ folder are safe. Only the temporary container workspace gets cleaned up.

You’re now inside a container with Python 3.12. Your code/ and data/ folders from your computer are available here at /code and /data.

Create a requirements.txt file in your code/ folder with the packages we’ll need:

psycopg2-binary==2.9.9
pgvector==0.2.4
qdrant-client==1.16.1
pinecone==5.0.1
cohere==5.11.0
numpy==1.26.4
pandas==2.2.0
python-dotenv==1.0.1

Install the packages:

pip install -r requirements.txt

Perfect! You now have a safe environment where you can experiment with all three databases.

When you’re done working, just type exit to leave the container, then:

docker compose down

This stops the databases. Your data is safe in Docker volumes. Next time you want to work, just run docker compose up -d postgres qdrant and docker compose run --rm lab again.

A Note for Direct Installation Users

If you’re going the direct installation route, you’ll need:

For PostgreSQL + pgvector:

For Qdrant:

  • Option A: Install Qdrant locally following their installation guide
  • Option B: Skip Qdrant for now and focus on pgvector and Pinecone

Python packages:
Use the same requirements.txt from above and install with pip install -r requirements.txt

Alright, setup is complete. Let’s build something.


Database 1: PostgreSQL with pgvector

If you’ve worked with data at all, you’ve probably encountered PostgreSQL. It’s everywhere. It powers everything from tiny startups to massive enterprises. Many teams already have Postgres running in production, complete with backups, monitoring, and people who know how to keep it healthy.

So when your team needs vector search capabilities, a natural question is: “Can we just add this to our existing database?”

That’s exactly what pgvector does. It’s a PostgreSQL extension that adds vector similarity search to a database you might already be running. No new infrastructure to learn, no new backup strategies, no new team to build. Just install an extension and suddenly you can store embeddings alongside your regular data.

Let’s see what that looks like in practice.

Loading Data into PostgreSQL

We’ll start by creating a table that stores our paper metadata and embeddings together. Create a file called load_pgvector.py in your code/ folder:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import pandas as pd
import os

# Connect to PostgreSQL
# If using Docker, these environment variables are already set
db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
cur = conn.cursor()

# Enable pgvector extension
# This needs to happen BEFORE we register the vector type
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.commit()

# Now register the vector type with psycopg2
# This lets us pass numpy arrays directly as vectors
register_vector(conn)

# Create table for our papers
# The vector(1536) column stores our 1536-dimensional embeddings
cur.execute("""
    CREATE TABLE IF NOT EXISTS papers (
        id TEXT PRIMARY KEY,
        title TEXT,
        authors TEXT,
        abstract TEXT,
        year INTEGER,
        category TEXT,
        embedding vector(1536)
    )
""")
conn.commit()

# Load the metadata and embeddings
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"Loading {len(papers_df)} papers into PostgreSQL...")

# Insert papers in batches
# We'll do 500 at a time to keep transactions manageable
batch_size = 500
for i in range(0, len(papers_df), batch_size):
    batch_df = papers_df.iloc[i:i+batch_size]
    batch_embeddings = embeddings[i:i+batch_size]

    for j, (idx, row) in enumerate(batch_df.iterrows()):
        cur.execute("""
            INSERT INTO papers (id, title, authors, abstract, year, category, embedding)
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (id) DO NOTHING
        """, (
            row['id'],
            row['title'],
            row['authors'],
            row['abstract'],
            row['year'],
            row['categories'],
            batch_embeddings[j]  # Pass numpy array directly
        ))

    conn.commit()
    print(f"  Loaded {min(i+batch_size, len(papers_df))} / {len(papers_df)} papers")

print("\nData loaded successfully!")

# Create HNSW index for fast similarity search
# This takes a couple seconds for 5,000 papers
print("Creating HNSW index...")
cur.execute("""
    CREATE INDEX IF NOT EXISTS papers_embedding_idx 
    ON papers 
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")
conn.commit()
print("Index created!")

# Verify everything worked
cur.execute("SELECT COUNT(*) FROM papers")
count = cur.fetchone()[0]
print(f"\nTotal papers in database: {count}")

cur.close()
conn.close()

Let’s break down what’s happening here:

  • Extension Setup: We enable the pgvector extension first, then register the vector type with our Python database driver. This order matters. If you try to register the type before the extension exists, you’ll get an error.
  • Table Structure: We’re storing both metadata (title, authors, abstract, year, category) and the embedding vector in the same table. The vector(1536) type tells PostgreSQL we want a 1536-dimensional vector column.
  • Passing Vectors: Thanks to the register_vector() call, we can pass numpy arrays directly. The pgvector library handles converting them to PostgreSQL’s vector format automatically. If you tried to pass a Python list instead, PostgreSQL would create a regular array type, which doesn’t support the distance operators we need.
  • HNSW Index: After loading the data, we create an HNSW index. The parameters m=16 and ef_construction=64 are defaults that work well for most cases. The index took about 2.8 seconds to build on 5,000 papers in our tests.

Run this script:

python load_pgvector.py

You should see output like this:

Loading 5000 papers into PostgreSQL...
  Loaded 500 / 5000 papers
  Loaded 1000 / 5000 papers
  Loaded 1500 / 5000 papers
  Loaded 2000 / 5000 papers
  Loaded 2500 / 5000 papers
  Loaded 3000 / 5000 papers
  Loaded 3500 / 5000 papers
  Loaded 4000 / 5000 papers
  Loaded 4500 / 5000 papers
  Loaded 5000 / 5000 papers

Data loaded successfully!
Creating HNSW index...
Index created!

Total papers in database: 5000

The data is now loaded and indexed in PostgreSQL.

Querying with pgvector

Now let’s write some queries. Create query_pgvector.py:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import os

# Connect and register vector type
db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
register_vector(conn)
cur = conn.cursor()

# Let's use a paper from our dataset as the query
# We'll find papers similar to a machine learning paper
cur.execute("""
    SELECT id, title, category, year, embedding
    FROM papers
    WHERE category = 'cs.LG'
    LIMIT 1
""")
result = cur.fetchone()
query_id, query_title, query_category, query_year, query_embedding = result

print("Query paper:")
print(f"  Title: {query_title}")
print(f"  Category: {query_category}")
print(f"  Year: {query_year}")
print()

# Scenario 1: Unfiltered similarity search
# The <=> operator computes cosine distance
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE year >= 2025 AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND year >= 2025 AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")

cur.close()
conn.close()

This script tests the same four scenarios we measured previously:

  1. Unfiltered vector search
  2. Filter by category (text field)
  3. Filter by year range (integer field)
  4. Combined filters (category AND year)

Run it:

python query_pgvector.py

You’ll see output similar to this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.2134 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.2445 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.2591 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.2678 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.2134 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.2591 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.2734 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.2678 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.2923 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.3012 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.3145 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.3267 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.3401 | Reward Shaping for Complex Tasks

The queries work just like regular SQL. We’re just using the <=> operator for cosine distance instead of normal comparison operators.

Measuring Performance

Let’s get real numbers. Create benchmark_pgvector.py:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import time
import os

db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
register_vector(conn)
cur = conn.cursor()

# Get a query embedding
cur.execute("""
    SELECT embedding FROM papers 
    WHERE category = 'cs.LG' 
    LIMIT 1
""")
query_embedding = cur.fetchone()[0]

def benchmark_query(query, params, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        cur.execute(query, params)
        cur.fetchall()

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        cur.execute(query, params)
        cur.fetchall()
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking pgvector performance...")
print("=" * 80)

# Scenario 1: Unfiltered
query1 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query1, (query_embedding, query_embedding), "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
query2 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG'
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query2, (query_embedding, query_embedding), "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
query3 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE year >= 2025
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query3, (query_embedding, query_embedding), "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
query4 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND year >= 2025
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query4, (query_embedding, query_embedding), "Combined filter")
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

cur.close()
conn.close()

Run this:

python benchmark_pgvector.py

Here’s what we found in our testing (your numbers might vary slightly depending on your hardware):

Benchmarking pgvector performance...
================================================================================
Unfiltered search:        2.48ms (±0.31ms)
Category filter:          5.70ms (±0.42ms) | 2.30x overhead
Year filter (>= 2025):    2.51ms (±0.29ms) | 1.01x overhead
Combined filter:          5.64ms (±0.38ms) | 2.27x overhead
================================================================================

What the Numbers Tell Us

Let’s compare this to ChromaDB:

Scenario ChromaDB pgvector Winner
Unfiltered 4.5ms 2.5ms pgvector (1.8x faster)
Category filter 3.3x overhead 2.3x overhead pgvector (30% less overhead)
Year filter 8.0x overhead 1.0x overhead pgvector (essentially free!)
Combined filter 5.0x overhead 2.3x overhead pgvector (54% less overhead)

Three things jump out:

  1. pgvector is fast at baseline. The unfiltered queries average 2.5ms compared to ChromaDB’s 4.5ms. That’s nearly twice as fast, which makes sense. Decades of PostgreSQL query optimization plus in-process execution (no HTTP overhead) really shows here.
  2. Integer filters are essentially free. The year range filter adds almost zero overhead (1.01x). PostgreSQL is incredibly good at filtering on integers. It can use standard database indexes and optimization techniques that have been refined over 30+ years.
  3. Text filters have a cost, but it’s reasonable. Category filtering shows 2.3x overhead, which is better than ChromaDB’s 3.3x but still noticeable. Text matching is inherently more expensive than integer comparisons, even in a mature database like PostgreSQL.

The pattern here is really interesting: pgvector doesn’t magically make all filtering free, but it leverages PostgreSQL’s strengths. When you filter on things PostgreSQL is already good at (numbers, dates, IDs), the overhead is minimal. When you filter on text fields, you pay a price, but it’s more manageable than in ChromaDB.

What pgvector Gets Right

  • SQL Integration: If your team already thinks in SQL, pgvector feels natural. You write regular SQL queries with a special distance operator. That’s it. No new query language to learn.
  • Transaction Support: Need to update a paper’s metadata and its embedding together? Wrap it in a transaction. PostgreSQL handles it the same way it handles any other transactional update.
  • Existing Infrastructure: Many teams already have PostgreSQL in production, complete with backups, monitoring, high availability setups, and people who know how to keep it running. Adding pgvector means leveraging all that existing investment.
  • Mature Ecosystem: Want to connect it to your data pipeline? There’s probably a tool for that. Need to replicate it? PostgreSQL replication works. Want to query it from your favorite language? PostgreSQL drivers exist everywhere.

What pgvector Doesn’t Handle For You

  • VACUUM is Your Problem: PostgreSQL’s VACUUM process can interact weirdly with vector indexes. The indexes can bloat over time if you’re doing lots of updates and deletes. You need to monitor this and potentially rebuild indexes periodically.
  • Index Maintenance: As your data grows and changes, you might need to rebuild indexes to maintain performance. This isn’t automatic. It’s part of your operational responsibility.
  • Memory Pressure: Vector indexes live in memory for best performance. As your dataset grows, you need to size your database appropriately. This is normal for PostgreSQL, but it’s something you have to plan for.
  • Replication Overhead: If you’re replicating your PostgreSQL database, those vector columns come along for the ride. Replicating high-dimensional vectors can be bandwidth-intensive.

In production, you’d typically also add regular indexes (for example, B-tree indexes) on frequently filtered columns like category and year, alongside the vector index.

None of these are dealbreakers. They’re just real operational considerations. Teams with PostgreSQL expertise can handle them. Teams without that expertise might prefer a managed service or specialized database.

When pgvector Makes Sense

pgvector is an excellent choice when:

  • You already run PostgreSQL in production
  • Your team has strong SQL skills and PostgreSQL experience
  • You need transactional guarantees with your vector operations
  • You primarily filter on integer fields (dates, IDs, counts, years)
  • Your scale is moderate (up to a few million vectors)
  • You want to leverage existing PostgreSQL infrastructure

pgvector might not be the best fit when:

  • You’re filtering heavily on text fields with unpredictable combinations
  • You need to scale beyond what a single PostgreSQL server can handle
  • Your team doesn’t have PostgreSQL operational expertise
  • You want someone else to handle all the database maintenance

Database 2: Qdrant

pgvector gave us fast baseline queries, but text filtering still added noticeable overhead. That’s not a PostgreSQL problem. It’s just that PostgreSQL was built to handle all kinds of data, and vector search with heavy filtering is one specific use case among thousands.

Qdrant takes a different approach. It’s a vector database built specifically for filtered vector search. The entire architecture is designed around one question: how do we make similarity search fast even when you’re filtering on multiple metadata fields?

Let’s see if that focus pays off.

Loading Data into Qdrant

Qdrant runs as a separate service (in our Docker setup, it’s already running). We’ll connect to it via HTTP API and load our papers. Create load_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import pandas as pd

# Connect to Qdrant
# If using Docker, QDRANT_HOST is set to 'qdrant'
# If running locally, use 'localhost'
import os
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)

# Create collection with vector configuration
collection_name = "arxiv_papers"

# Delete collection if it exists (useful for re-running)
try:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")
except:
    pass

# Create new collection
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=1536,  # Cohere embedding dimension
        distance=Distance.COSINE
    )
)
print(f"Created collection: {collection_name}")

# Load data
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"\nLoading {len(papers_df)} papers into Qdrant...")

# Prepare points for upload
# Qdrant stores metadata as "payload"
points = []
for idx, row in papers_df.iterrows():
    point = PointStruct(
        id=idx,
        vector=embeddings[idx].tolist(),
        payload={
            "paper_id": row['id'],
            "title": row['title'],
            "authors": row['authors'],
            "abstract": row['abstract'],
            "year": int(row['year']),
            "category": row['categories']
        }
    )
    points.append(point)

    # Show progress every 1000 papers
    if (idx + 1) % 1000 == 0:
        print(f"  Prepared {idx + 1} / {len(papers_df)} papers")

# Upload all points at once
# Qdrant handles large batches well (no 5k limit like ChromaDB)
print("\nUploading to Qdrant...")
client.upsert(
    collection_name=collection_name,
    points=points
)

print(f"Upload complete!")

# Verify
collection_info = client.get_collection(collection_name)
print(f"\nCollection '{collection_name}' now has {collection_info.points_count} papers")

A few things to notice:

  • Collection Setup: We specify the vector size (1536) and distance metric (COSINE) when creating the collection. This is similar to ChromaDB but more explicit.
  • Payload Structure: Qdrant calls metadata “payload.” We store all our paper metadata here as a dictionary. This is where Qdrant’s filtering power comes from.
  • No Batch Size Limits: Unlike ChromaDB’s 5,461 embedding limit, Qdrant handled all 5,000 papers in a single upload without issues.
  • Point IDs: We use the DataFrame index as point IDs. In production, you’d probably use your paper IDs, but integers work fine for this example.

Run the script:

python load_qdrant.py

You’ll see output like this:

Deleted existing collection: arxiv_papers
Created collection: arxiv_papers

Loading 5000 papers into Qdrant...
  Prepared 1000 / 5000 papers
  Prepared 2000 / 5000 papers
  Prepared 3000 / 5000 papers
  Prepared 4000 / 5000 papers
  Prepared 5000 / 5000 papers

Uploading to Qdrant...
Upload complete!

Collection 'arxiv_papers' now has 5000 papers

Querying with Qdrant

Now let’s run the same query scenarios. Create query_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
import numpy as np
import os

# Connect to Qdrant
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)
collection_name = "arxiv_papers"

# Get a query vector from a machine learning paper
results = client.scroll(
    collection_name=collection_name,
    scroll_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=1,
    with_vectors=True,
    with_payload=True
)

query_point = results[0][0]
query_vector = query_point.vector
query_payload = query_point.payload

print("Query paper:")
print(f"  Title: {query_payload['title']}")
print(f"  Category: {query_payload['category']}")
print(f"  Year: {query_payload['year']}")
print()

# Scenario 1: Unfiltered similarity search
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=6,  # Get 6 so we can skip the query paper itself
    with_payload=True
)

for hit in results.points[1:6]:  # Skip first result (the query paper)
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=6,
    with_payload=True
)

for hit in results.points[1:6]:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="year", range=Range(gte=2025))]
    ),
    limit=5,
    with_payload=True
)

for hit in results.points:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="cs.LG")),
            FieldCondition(key="year", range=Range(gte=2025))
        ]
    ),
    limit=5,
    with_payload=True
)

for hit in results.points:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")

A couple of things about Qdrant’s API:

  • Method Name: We use client.query_points() to search with vectors. The client also has methods called query() and search(), but they work differently. query_points() is what you want for vector similarity search.
  • Filter Syntax: Qdrant uses structured filter objects. Text matching uses MatchValue, numeric ranges use Range. You can combine multiple conditions in the must list.
  • Scores vs Distances: Qdrant returns similarity scores (higher is better) rather than distances (lower is better). This is just a presentation difference.

Run it:

python query_qdrant.py

You’ll see output like this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.7555 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.7266 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.7077 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.6988 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.6855 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.6733 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.6599 | Reward Shaping for Complex Tasks

Notice the scores are higher numbers than the distances we saw with pgvector. That’s just because Qdrant shows similarity (higher = more similar) while pgvector showed distance (lower = more similar). The rankings are what matter.

Measuring Performance

Now for the interesting part. Create benchmark_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
import numpy as np
import time
import os

# Connect to Qdrant
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)
collection_name = "arxiv_papers"

# Get a query vector
results = client.scroll(
    collection_name=collection_name,
    scroll_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=1,
    with_vectors=True
)
query_vector = results[0][0].vector

def benchmark_query(query_filter, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        client.query_points(
            collection_name=collection_name,
            query=query_vector,
            query_filter=query_filter,
            limit=10,
            with_payload=True
        )

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        client.query_points(
            collection_name=collection_name,
            query=query_vector,
            query_filter=query_filter,
            limit=10,
            with_payload=True
        )
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking Qdrant performance...")
print("=" * 80)

# Scenario 1: Unfiltered
avg, std = benchmark_query(None, "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
filter_category = Filter(
    must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
)
avg, std = benchmark_query(filter_category, "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
filter_year = Filter(
    must=[FieldCondition(key="year", range=Range(gte=2025))]
)
avg, std = benchmark_query(filter_year, "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
filter_combined = Filter(
    must=[
        FieldCondition(key="category", match=MatchValue(value="cs.LG")),
        FieldCondition(key="year", range=Range(gte=2025))
    ]
)
avg, std = benchmark_query(filter_combined, "Combined filter")
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

Run this:

python benchmark_qdrant.py

Here’s what we found in our testing:

Benchmarking Qdrant performance...
================================================================================
Unfiltered search:        52.52ms (±1.15ms)
Category filter:          57.19ms (±1.09ms) | 1.09x overhead
Year filter (>= 2025):    58.55ms (±1.11ms) | 1.11x overhead
Combined filter:          58.11ms (±1.08ms) | 1.11x overhead
================================================================================

What the Numbers Tell Us

The pattern here is striking. Let’s compare all three databases we’ve tested:

Scenario ChromaDB pgvector Qdrant
Unfiltered 4.5ms 2.5ms 52ms
Category filter overhead 3.3x 2.3x 1.09x
Year filter overhead 8.0x 1.0x 1.11x
Combined filter overhead 5.0x 2.3x 1.11x

Three observations:

  1. Qdrant’s baseline is slower. At 52ms, unfiltered queries are significantly slower than pgvector’s 2.5ms or ChromaDB’s 4.5ms. This is because we’re going through an HTTP API to a separate service, while pgvector runs in-process with PostgreSQL. Network overhead and serialization add latency.
  2. Filtering overhead is remarkably consistent. Category filter, year filter, combined filters all show roughly 1.1x overhead. It doesn’t matter if you’re filtering on one field or five. This is dramatically better than ChromaDB’s 3-8x overhead or even pgvector’s 2.3x overhead on text fields.
  3. The architecture is designed for filtered search. Qdrant doesn’t treat filtering as an afterthought. The entire system is built around the assumption that you’ll be filtering on metadata while doing vector similarity search. That focus shows in these numbers.

So when does Qdrant make sense? When your queries look like: “Find similar documents that are in category X, from year Y, with tag Z, and access level W.” If you’re doing lots of complex filtered searches, that consistent 1.1x overhead beats pgvector’s variable performance and absolutely crushes ChromaDB.

What Qdrant Gets Right

  • Filtering Efficiency: This is the big one. Complex filters don’t explode your query time. You can filter on multiple fields without worrying about performance falling off a cliff.
  • Purpose-Built Architecture: Everything about Qdrant is designed for vector search. The API makes sense, the filtering syntax is clear, and the performance characteristics are predictable.
  • Easy Development Setup: Running Qdrant in Docker for local development is straightforward. The API is well-documented, and the Python client works smoothly.
  • Scalability Path: When you outgrow a single instance, Qdrant offers distributed deployment options. You’re not locked into a single-server architecture.

What to Consider

  • Network Latency: Because Qdrant runs as a separate service, you pay the cost of HTTP requests. For latency-sensitive applications where every millisecond counts, that 52ms baseline might matter.
  • Operational Overhead: You need to run and maintain another service. It’s not as complex as managing a full database cluster, but it’s more than just using an existing PostgreSQL database.
  • Infrastructure Requirements: Qdrant needs its own resources (CPU, memory, disk). If you’re resource-constrained, adding another service might not be ideal.

When Qdrant Makes Sense

Qdrant is an excellent choice when:

  • You need to filter on multiple metadata fields frequently
  • Your filters are complex and unpredictable (users can combine many different fields)
  • You can accept ~50ms baseline latency in exchange for consistent filtering performance
  • You want a purpose-built vector database but prefer self-hosting over managed services
  • You’re comfortable running Docker containers or Kubernetes in production
  • Your scale is in the millions to tens of millions of vectors

Qdrant might not be the best fit when:

  • You need sub-10ms query latency and filtering is secondary
  • You’re trying to minimize infrastructure complexity (fewer moving parts)
  • You already have PostgreSQL and pgvector handles your filtering needs
  • You want a fully managed service (Qdrant offers cloud hosting, but we tested the self-hosted version)

Database 3: Pinecone

pgvector gave us speed but required PostgreSQL expertise. Qdrant gave us efficient filtering but required running another service. Now let’s try a completely different approach: a managed service where someone else handles all the infrastructure.

Pinecone is a vector database offered as a cloud service. You don’t install anything locally. You don’t manage servers. You don’t tune indexes or monitor disk space. You create an index through their API, upload your vectors, and query them. That’s it.

This simplicity comes with tradeoffs. You’re paying for the convenience, you’re dependent on their infrastructure, and every query goes over the internet to their servers. Let’s see how those tradeoffs play out in practice.

Setting Up Pinecone

First, you need a Pinecone account. Go to pinecone.io and sign up for the free tier. The free serverless plan is enough for this tutorial (hundreds of thousands of 1536-dim vectors and several indexes); check Pinecone’s current pricing page for exact limits.

Once you have your API key, create a .env file in your code/ folder:

PINECONE_API_KEY=your-api-key-here

Now let’s create our index and load data. Create load_pinecone.py:

from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
import os
import time
from dotenv import load_dotenv

# Load API key
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')

# Initialize Pinecone
pc = Pinecone(api_key=api_key)

# Create index
index_name = "arxiv-papers-5k"

# Delete index if it exists
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")
    time.sleep(5)  # Wait for deletion to complete

# Create new index
pc.create_index(
    name=index_name,
    dimension=1536,  # Cohere embedding dimension
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"  # Free tier only supports us-east-1
    )
)
print(f"Created index: {index_name}")

# Wait for index to be ready
while not pc.describe_index(index_name).status['ready']:
    print("Waiting for index to be ready...")
    time.sleep(1)

# Connect to index
index = pc.Index(index_name)

# Load data
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"\nLoading {len(papers_df)} papers into Pinecone...")

# Prepare vectors for upload
# Pinecone expects: (id, vector, metadata)
vectors = []
for idx, row in papers_df.iterrows():
    # Truncate authors field to avoid hitting metadata size limits
    # Pinecone has a 40KB metadata limit per vector
    authors = row['authors'][:500] if len(row['authors']) > 500 else row['authors']

    vector = {
        "id": row['id'],
        "values": embeddings[idx].tolist(),
        "metadata": {
            "title": row['title'],
            "authors": authors,
            "abstract": row['abstract'],
            "year": int(row['year']),
            "category": row['categories']
        }
    }
    vectors.append(vector)

    # Upload in batches of 100
    if len(vectors) == 100:
        index.upsert(vectors=vectors)
        print(f"  Uploaded {idx + 1} / {len(papers_df)} papers")
        vectors = []

# Upload remaining vectors
if vectors:
    index.upsert(vectors=vectors)
    print(f"  Uploaded {len(papers_df)} / {len(papers_df)} papers")

print("\nUpload complete!")

# Pinecone has eventual consistency
# Wait a bit for all vectors to be indexed
print("Waiting for indexing to complete...")
time.sleep(10)

# Verify
stats = index.describe_index_stats()
print(f"\nIndex '{index_name}' now has {stats['total_vector_count']} vectors")

A few things to notice:

Serverless Configuration: The free tier uses serverless infrastructure in AWS us-east-1. You don’t specify machine types or capacity. Pinecone handles scaling automatically.

Metadata Size Limit: Pinecone limits metadata to 40KB per vector. We truncate the authors field just to be safe. In practice, most metadata is well under this limit.

Batch Uploads: We upload 100 vectors at a time. This is a reasonable batch size that balances upload speed with API constraints.

Eventual Consistency: After uploading, we wait 10 seconds for indexing to complete. Pinecone doesn’t make vectors immediately queryable. They need to be indexed first.

Run the script:

python load_pinecone.py

You’ll see output like this:

Deleted existing index: arxiv-papers-5k
Created index: arxiv-papers-5k

Loading 5000 papers into Pinecone...
  Uploaded 100 / 5000 papers
  Uploaded 200 / 5000 papers
  Uploaded 300 / 5000 papers
  ...
  Uploaded 4900 / 5000 papers
  Uploaded 5000 / 5000 papers

Upload complete!
Waiting for indexing to complete...

Index 'arxiv-papers-5k' now has 5000 vectors

Querying with Pinecone

Now let’s run our queries. Create query_pinecone.py:

from pinecone import Pinecone
import numpy as np
import os
from dotenv import load_dotenv

# Load API key and connect
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=api_key)
index = pc.Index("arxiv-papers-5k")

# Get a query vector from a machine learning paper
results = index.query(
    vector=[0] * 1536,  # Dummy vector just to use filter
    filter={"category": {"$eq": "cs.LG"}},
    top_k=1,
    include_metadata=True,
    include_values=True
)

query_match = results['matches'][0]
query_vector = query_match['values']
query_metadata = query_match['metadata']

print("Query paper:")
print(f"  Title: {query_metadata['title']}")
print(f"  Category: {query_metadata['category']}")
print(f"  Year: {query_metadata['year']}")
print()

# Scenario 1: Unfiltered similarity search
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)

results = index.query(
    vector=query_vector,
    top_k=6,  # Get 6 so we can skip the query paper itself
    include_metadata=True
)

for match in results['matches'][1:6]:  # Skip first result (query paper)
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={"category": {"$eq": "cs.LG"}},
    top_k=6,
    include_metadata=True
)

for match in results['matches'][1:6]:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={"year": {"$gte": 2025}},
    top_k=5,
    include_metadata=True
)

for match in results['matches']:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={
        "$and": [
            {"category": {"$eq": "cs.LG"}},
            {"year": {"$gte": 2025}}
        ]
    },
    top_k=5,
    include_metadata=True
)

for match in results['matches']:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")

Filter Syntax: Pinecone uses MongoDB-style operators ($eq, $gte, $and). If you’ve worked with MongoDB, this will feel familiar.

Default Namespace: Pinecone uses namespaces to partition vectors within an index. If you don’t specify one, vectors go into the default namespace (empty string ““). This caught us initially because we expected a namespace called”default.”

Run it:

python query_pinecone.py

You’ll see output like this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.7555 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.7266 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.7077 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.6988 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.6855 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.6733 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.6599 | Reward Shaping for Complex Tasks

Measuring Performance

One last benchmark. Create benchmark_pinecone.py:

from pinecone import Pinecone
import numpy as np
import time
import os
from dotenv import load_dotenv

# Load API key and connect
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=api_key)
index = pc.Index("arxiv-papers-5k")

# Get a query vector
results = index.query(
    vector=[0] * 1536,
    filter={"category": {"$eq": "cs.LG"}},
    top_k=1,
    include_values=True
)
query_vector = results['matches'][0]['values']

def benchmark_query(query_filter, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        index.query(
            vector=query_vector,
            filter=query_filter,
            top_k=10,
            include_metadata=True
        )

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        index.query(
            vector=query_vector,
            filter=query_filter,
            top_k=10,
            include_metadata=True
        )
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking Pinecone performance...")
print("=" * 80)

# Scenario 1: Unfiltered
avg, std = benchmark_query(None, "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
avg, std = benchmark_query({"category": {"$eq": "cs.LG"}}, "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
avg, std = benchmark_query({"year": {"$gte": 2025}}, "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
avg, std = benchmark_query(
    {"$and": [{"category": {"$eq": "cs.LG"}}, {"year": {"$gte": 2025}}]},
    "Combined filter"
)
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

Run this:

python benchmark_pinecone.py

Here’s what we found (your numbers will vary based on your distance from AWS us-east-1):

Benchmarking Pinecone performance...
================================================================================
Unfiltered search:        87.45ms (±2.15ms)
Category filter:          88.41ms (±3.12ms) | 1.01x overhead
Year filter (>= 2025):    88.69ms (±2.84ms) | 1.01x overhead
Combined filter:          87.18ms (±2.67ms) | 1.00x overhead
================================================================================

What the Numbers Tell Us

Now let’s look at all four databases:

Scenario ChromaDB pgvector Qdrant Pinecone
Unfiltered 4.5ms 2.5ms 52ms 87ms
Category filter overhead 3.3x 2.3x 1.09x 1.01x
Year filter overhead 8.0x 1.0x 1.11x 1.01x
Combined filter overhead 5.0x 2.3x 1.11x 1.00x

Two patterns emerge:

  1. Filtering overhead is essentially zero. Pinecone shows 1.00-1.01x overhead across all filter types. Category filters, year filters, combined filters all take the same time as unfiltered queries. Pinecone’s infrastructure handles filtering so efficiently that it’s invisible in the measurements.
  2. Network latency dominates baseline performance. At 87ms, Pinecone is the slowest for unfiltered queries. But this isn’t because Pinecone is slow at vector search. It’s because we’re sending queries from Mexico City to AWS us-east-1 over the internet. Every query pays the cost of network round-trip time plus serialization/deserialization.

If you ran this benchmark from Virginia (close to us-east-1), your baseline would be much lower. If you ran it from Tokyo, it would be higher. The filtering overhead would stay at 1.0x regardless.

What Pinecone Gets Right

  • Zero Operational Overhead: You don’t install anything. You don’t manage servers. You don’t tune indexes. You don’t monitor disk space or memory usage. You just use the API.
  • Automatic Scaling: Pinecone’s serverless tier scales automatically based on your usage. You don’t provision capacity upfront. You don’t worry about running out of resources.
  • Filtering Performance: Complex filters don’t slow down queries. Filter on one field or ten fields, it doesn’t matter. The overhead is invisible.
  • High Availability: Pinecone handles replication, failover, and uptime. You don’t build these capabilities yourself.

What to Consider

  • Network Latency: Every query goes over the internet to Pinecone’s servers. For latency-sensitive applications, that baseline 87ms (or whatever your network adds) might be too much.
  • Cost Structure: The free tier is great for learning, but production usage costs money. Pinecone charges based on pod usage and storage. You need to understand their pricing model and how it scales with your needs.
  • Vendor Lock-In: Your data lives in Pinecone’s infrastructure. Migrating to a different solution means extracting all your vectors and rebuilding indexes elsewhere. This isn’t impossible, but it’s not trivial either.
  • Limited Control: You can’t tune the underlying index parameters. You can’t see how Pinecone implements filtering. You get what they give you, which is usually good but might not be optimal for your specific case.

When Pinecone Makes Sense

Pinecone is an excellent choice when:

  • You want zero operational overhead (no servers to manage)
  • Your team should focus on application features, not database operations
  • You can accept ~100ms baseline latency for the convenience
  • You need heavy filtering on multiple metadata fields
  • You want automatic scaling without capacity planning
  • You’re building a new application without existing infrastructure constraints
  • Your scale could grow unpredictably (Pinecone handles this automatically)

Pinecone might not be the best fit when:

  • You need sub-10ms query latency
  • You want to minimize ongoing costs (self-hosting can be cheaper at scale)
  • You prefer to control your infrastructure completely
  • You already have existing database infrastructure you can leverage
  • You’re uncomfortable with vendor lock-in

Comparing All Four Approaches

We’ve now tested four different ways to handle vector search with metadata filtering. Let’s look at what we learned.

The Performance Picture

Here’s the complete comparison:

Database Unfiltered Category Overhead Year Overhead Combined Overhead Setup Complexity Ops Overhead
ChromaDB 4.5ms 3.3x 8.0x 5.0x Trivial None
pgvector 2.5ms 2.3x 1.0x 2.3x Moderate Moderate
Qdrant 52ms 1.09x 1.11x 1.11x Easy Minimal
Pinecone 87ms 1.01x 1.01x 1.00x Trivial None

Three Different Strategies

Looking at these numbers, three distinct strategies emerge:

Strategy 1: Optimize for Raw Speed (pgvector)

pgvector wins on baseline query speed at 2.5ms. It runs in-process with PostgreSQL, so there’s no network overhead. If your primary concern is getting results back as fast as possible and filtering is occasional, pgvector delivers.

The catch: text filtering adds 2.3x overhead. Integer filtering is essentially free, but if you’re doing complex text filters frequently, that overhead accumulates.

Strategy 2: Optimize for Filtering Consistency (Qdrant)

Qdrant accepts a slower baseline (52ms) but delivers remarkably consistent filtering performance. Whether you filter on one field or five, category or year, simple or complex, you get roughly 1.1x overhead.

The catch: you’re running another service, and that baseline 52ms includes HTTP API overhead. For latency-critical applications, that might be too much.

Strategy 3: Optimize for Convenience (Pinecone)

Pinecone gives you zero operational overhead and essentially zero filtering overhead (1.0x). You don’t manage anything. You just use an API.

The catch: network latency to their cloud infrastructure means ~87ms baseline queries (from our location). The convenience costs you in baseline latency.

The Decision Framework

So which one should you choose? It depends entirely on your constraints.

Choose pgvector when:

  • Raw query speed is critical (need sub-5ms)
  • You already have PostgreSQL infrastructure
  • Your team has strong SQL and PostgreSQL skills
  • You primarily filter on integer fields (dates, IDs, counts)
  • Your scale is moderate (up to a few million vectors)
  • You’re comfortable with PostgreSQL operational tasks (VACUUM, index maintenance)

Choose Qdrant when:

  • You need predictable performance regardless of filter complexity
  • You filter on many fields with unpredictable combinations
  • You can accept ~50ms baseline latency
  • You want self-hosting but need better filtering than ChromaDB
  • You’re comfortable with Docker or Kubernetes deployment
  • Your scale is millions to tens of millions of vectors

Choose Pinecone when:

  • You want zero operational overhead
  • Your team should focus on features, not database operations
  • You can accept ~100ms baseline latency (varies by geography)
  • You need heavy filtering on multiple metadata fields
  • You want automatic scaling without capacity planning
  • Your scale could grow unpredictably

Choose ChromaDB when:

  • You’re prototyping and learning
  • You need simple local development
  • Filtering is occasional, not critical path
  • You want minimal setup complexity
  • Your scale is small (thousands to tens of thousands of vectors)

The Tradeoffs That Matter

Speed vs Filtering: pgvector is fastest but filtering costs you. Qdrant and Pinecone accept slower baselines for better filtering.

Control vs Convenience: Self-hosting (pgvector, Qdrant) gives you control but requires operational expertise. Managed services (Pinecone) remove operational burden but limit control.

Infrastructure: pgvector requires PostgreSQL. Qdrant needs container orchestration. Pinecone just needs an API key.

Geography: Local databases (pgvector, Qdrant) don’t care where you are. Cloud services (Pinecone) add latency based on your distance from their data centers.

No Universal Answer

There’s no “best” database here. Each one makes different tradeoffs. The right choice depends on your specific situation:

  • What’s your query volume and latency requirements?
  • How complex are your filters?
  • What infrastructure do you already have?
  • What expertise does your team have?
  • What’s your budget for operational overhead?

These questions matter more than any benchmark number.

What We Didn’t Test

Before you take these numbers as absolute truth, let’s be honest about what we didn’t measure. All four databases use approximate nearest-neighbor indexes for speed. That means queries are fast, but they can sometimes miss the true closest results—especially when filters are applied. In real applications, you should measure not just latency, but also result quality (recall), and tune settings if needed.

Scale

We tested 5,000 vectors. That’s useful for learning, but it’s small. Real applications might have 50,000 or 500,000 or 5 million vectors. Performance characteristics can change at different scales.

The patterns we observed (pgvector’s speed, Qdrant’s consistent filtering, Pinecone’s zero overhead filters) likely hold at larger scales. But the absolute numbers will be different. Run your own benchmarks at your target scale.

Configuration

All databases used default settings. We didn’t tune HNSW parameters. We didn’t experiment with different index types. Tuned configurations could show different performance characteristics.

For learning, defaults make sense. For production, you’ll want to tune based on your specific data and query patterns.

Geographic Variance

We ran Pinecone tests from Mexico City to AWS us-east-1. If you’re in Virginia, your latency will be lower. If you’re in Tokyo, it will be higher. With self-hosted pgvector or Qdrant, you can deploy the database close to your application, enabling you to control geographic latency.

Load Patterns

We measured queries at one moment in time with consistent load. Production systems experience variable query patterns, concurrent users, and resource contention. Real performance under real production load might differ.

Write Performance

We focused on query performance. We didn’t benchmark bulk updates, deletions, or reindexing operations. If you’re constantly updating vectors, write performance matters too.

Advanced Features

We didn’t test hybrid search with BM25, learned rerankers, multi-vector search, or other advanced features some databases offer. These capabilities might influence your choice.

What’s Next

You now have hands-on experience with four different vector databases. You understand their performance characteristics, their tradeoffs, and when to choose each one.

More importantly, you have a framework for thinking about database selection. It’s not about finding the “best” database. It’s about matching your requirements to each database’s strengths.

When you build your next application:

  1. Start with your requirements. What are your latency needs? How complex are your filters? What scale are you targeting?
  2. Match requirements to database characteristics. Need speed? Consider pgvector. Need consistent filtering? Look at Qdrant. Want zero ops? Try Pinecone.
  3. Prototype quickly. Spin up a test with your actual data and query patterns. Measure what matters for your use case.
  4. Be ready to change. Your requirements might evolve. The database that works at 10,000 vectors might not work at 10 million. That’s fine. You can migrate.

The vector database landscape is evolving rapidly. New options appear. Existing options improve. The fundamentals we covered here (understanding tradeoffs, measuring what matters, matching requirements to capabilities) will serve you regardless of which specific databases you end up using.

In our next tutorial, we’ll look at semantic caching and memory patterns for AI applications. We’ll use the knowledge from this tutorial to choose the right database for different caching scenarios.

Until then, experiment with these databases. Load your own data. Run your own queries. See how they behave with your specific workload. That hands-on experience is more valuable than any benchmark we could show you.


Key Takeaways

  • Performance Patterns Are Clear: pgvector delivers 2.5ms baseline (fastest), Qdrant 52ms (moderate with HTTP overhead), Pinecone 87ms (network latency dominates). Each optimizes for different goals.
  • Filtering Overhead Varies Dramatically: ChromaDB shows 3-8x overhead. pgvector shows 2.3x for text but 1.0x for integers. Qdrant maintains consistent 1.1x regardless of filter complexity. Pinecone achieves essentially zero filtering overhead (1.0x).
  • Three Distinct Strategies Emerge: Optimize for raw speed (pgvector), optimize for filtering consistency (Qdrant), or optimize for convenience (Pinecone). No universal "best" choice exists.
  • Purpose-Built Databases Excel at Filtering: Qdrant and Pinecone, designed specifically for filtered vector search, handle complex filters without performance degradation. pgvector leverages PostgreSQL's strengths but wasn't built primarily for this use case.
  • Operational Overhead Is Real: pgvector requires PostgreSQL expertise (VACUUM, index maintenance). Qdrant needs container orchestration. Pinecone removes ops but introduces vendor dependency. Match operational capacity to database choice.
  • Geography Matters for Cloud Services: Pinecone's 87ms baseline from Mexico City to AWS us-east-1 is dominated by network latency. Self-hosted options (pgvector, Qdrant) don't have this variance.
  • Scale Changes Everything: We tested 5,000 vectors. Behavior at 50k, 500k, or 5M vectors will differ. The patterns we observed likely hold, but absolute numbers will change. Always benchmark at your target scale.
  • Decision Frameworks Beat Feature Lists: Choose based on your constraints: latency requirements, filter complexity, existing infrastructure, team expertise, and operational capacity. Not on marketing claims.
  • Prototyping Beats Speculation: The fastest way to know if a database works for you is to load your actual data and run your actual queries. Benchmarks guide thinking but can't replace hands-on testing.

Metadata Filtering and Hybrid Search for Vector Databases

6 December 2025 at 02:43

In the first tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. We discovered that vector search excels at understanding meaning: a query about "neural network training" successfully retrieved papers about optimization algorithms, even when they didn't use those exact words.

But here's what we couldn't do yet: What if we only want papers from the last two years? What if we need to search specifically within the Machine Learning category? What if someone searches for a rare technical term that vector search might miss?

This tutorial teaches you how to enhance vector search with two powerful capabilities: metadata filtering and hybrid search. By the end, you'll understand how to combine semantic similarity with traditional filters, when keyword search adds value, and how to make intelligent trade-offs between different search strategies.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Design metadata schemas that enable powerful filtering without performance pitfalls
  • Implement filtered vector searches in ChromaDB using metadata constraints
  • Measure and understand the performance overhead of different filter types
  • Build BM25 keyword search alongside your vector search
  • Combine vector similarity and keyword matching using weighted hybrid scoring
  • Evaluate different search strategies systematically using category precision
  • Make informed decisions about when metadata filtering and hybrid search add value

Most importantly, you'll learn to be honest about what works and what doesn't. Our experiments revealed some surprising results that challenge common assumptions about hybrid search.

Dataset and Environment Setup

We'll use the same 5,000 arXiv papers we used previously. If you completed the first tutorial, you already have these files. If you're starting fresh, download them now:

arxiv_papers_5k.csv download (7.7 MB) → Paper metadata
embeddings_cohere_5k.npy download (61.4 MB) → Pre-generated embeddings

The dataset contains 5,000 papers perfectly balanced across five categories:

  • cs.CL (Computational Linguistics): 1,000 papers
  • cs.CV (Computer Vision): 1,000 papers
  • cs.DB (Databases): 1,000 papers
  • cs.LG (Machine Learning): 1,000 papers
  • cs.SE (Software Engineering): 1,000 papers

Environment Setup

You'll need the same packages from previous tutorials, plus one new library for BM25:

# Create virtual environment (if starting fresh)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# rank-bm25==0.2.2  # NEW for keyword search

pip install chromadb numpy pandas cohere python-dotenv rank-bm25

Make sure your .env file contains your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Dataset

Let's load our data and verify everything is in place:

import numpy as np
import pandas as pd
import chromadb
from cohere import ClientV2
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
if not cohere_api_key:
    raise ValueError("COHERE_API_KEY not found in .env file")

co = ClientV2(api_key=cohere_api_key)

# Load the dataset
df = pd.read_csv('arxiv_papers_5k.csv')
embeddings = np.load('embeddings_cohere_5k.npy')

print(f"Loaded {len(df)} papers")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Check what metadata we have
print(f"\nAvailable metadata columns:")
print(df.columns.tolist())
Loaded 5000 papers
Embeddings shape: (5000, 1536)

Papers per category:
category
cs.CL    1000
cs.CV    1000
cs.DB    1000
cs.LG    1000
cs.SE    1000
Name: count, dtype: int64

Available metadata columns:
['arxiv_id', 'title', 'abstract', 'authors', 'published', 'category']

We have rich metadata to work with: paper IDs, titles, abstracts, authors, publication dates, and categories. This metadata will power our filtering and help evaluate our search strategies.

Designing Metadata Schemas

Before we start filtering, we need to think carefully about what metadata to store and how to structure it. Good metadata design makes search powerful and performant. Poor design creates headaches.

What Makes Good Metadata

Good metadata is:

  • Filterable: Choose values that match how users actually search. If users filter by publication year, store year as an integer. If they filter by topic, store normalized category strings.
  • Atomic: Store individual fields separately rather than dumping everything into a single JSON blob. Want to filter by year? Don't make ChromaDB parse "Published: 2024-03-15" from a text field.
  • Indexed: Most vector databases index metadata fields differently than vector embeddings. Keep metadata fields small and specific so indexing works efficiently.
  • Consistent: Use the same data types and formats across all documents. Don't store year as "2024" for one paper and "March 2024" for another.

What Doesn't Belong in Metadata

Avoid storing:

  • Long text in metadata fields: The paper abstract is content, not metadata. Store it as the document text, not in a metadata field.
  • Nested structures: ChromaDB supports nested metadata, but complex JSON trees are hard to filter and often signal confused schema design.
  • Redundant information: If you can derive a field from another (like "decade" from "year"), consider computing it at query time instead of storing it.
  • Frequently changing values: Metadata updates can be expensive. Don't store view counts or frequently updated statistics in metadata.

Preparing Our Metadata

Let's prepare metadata for our 5,000 papers:

def prepare_metadata(df):
    """
    Prepare metadata for ChromaDB from our dataframe.

    Returns list of metadata dictionaries, one per paper.
    """
    metadatas = []

    for _, row in df.iterrows():
        # Extract year from published date (format: YYYY-MM-DD)
        year = int(str(row['published'])[:4])

        # Truncate authors if too long (ChromaDB has reasonable limits)
        authors = row['authors'][:200] if len(row['authors']) <= 200 else row['authors'][:197] + "..."

        metadata = {
            'title': row['title'],
            'category': row['category'],
            'year': year,  # Store as integer for range queries
            'authors': authors
        }
        metadatas.append(metadata)

    return metadatas

# Prepare metadata for all papers
metadatas = prepare_metadata(df)

# Check a sample
print("Sample metadata:")
print(metadatas[0])
Sample metadata:
{'title': 'Optimizing Mixture of Block Attention', 'category': 'cs.LG', 'year': 2025, 'authors': 'Tao He, Liang Ding, Zhenya Huang, Dacheng Tao'}

Notice we're storing:

  • title: The full paper title for display in results
  • category: One of our five CS categories for topic filtering
  • year: Extracted as an integer for range queries like "papers after 2024"
  • authors: Truncated to avoid extremely long strings

This metadata schema supports the filtering patterns users actually want: search within a category, filter by publication date, or display author information in results.

Anti-Patterns to Avoid

Let's look at what NOT to do:

Bad: JSON blob as metadata

# DON'T DO THIS
metadata = {
    'info': json.dumps({
        'title': title,
        'category': category,
        'year': year,
        # ... everything dumped in JSON
    })
}

This makes filtering painful. You can't efficiently filter by year when it's buried in a JSON string.

Bad: Long text as metadata

# DON'T DO THIS
metadata = {
    'abstract': full_abstract_text,  # This belongs in documents, not metadata
    'category': category
}

ChromaDB stores abstracts as document content. Duplicating them in metadata wastes space and doesn't improve search.

Bad: Inconsistent types

# DON'T DO THIS
metadata1 = {'year': 2024}          # Integer
metadata2 = {'year': '2024'}        # String
metadata3 = {'year': 'March 2024'}  # Unparseable

Consistent data types make filtering reliable. Always store years as integers if you want range queries.

Bad: Missing or inconsistent metadata fields

# DON'T DO THIS
paper1_metadata = {'title': 'Paper 1', 'category': 'cs.LG', 'year': 2024}
paper2_metadata = {'title': 'Paper 2', 'category': 'cs.CV'}  # Missing year!
paper3_metadata = {'title': 'Paper 3', 'year': 2023}  # Missing category!

Here's a common source of frustration: if a document is missing a metadata field, ChromaDB's filters won't match it at all. If you filter by {"year": {"$gte": 2024}} and some papers lack a year field, those papers simply won't appear in results. This causes the confusing "where did my document go?" problem.

The fix: Make sure all documents have the same metadata fields. If a value is unknown, store it as None or use a sensible default rather than omitting the field entirely. Consistency prevents documents from mysteriously disappearing when you add filters.

Creating a Collection with Rich Metadata

Now let's create a ChromaDB collection with all our metadata. If you will be experimenting, you'll need to use the delete-and-recreate pattern we used previously:

# Initialize ChromaDB client
client = chromadb.Client()

# Delete existing collection if present (useful for experimentation)
try:
    client.delete_collection(name="arxiv_with_metadata")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create collection with metadata
collection = client.create_collection(
    name="arxiv_with_metadata",
    metadata={
        "description": "5000 arXiv papers with rich metadata for filtering",
        "hnsw:space": "cosine"  # Using cosine similarity
    }
)

print(f"Created collection: {collection.name}")
Created collection: arxiv_with_metadata

Now let's insert our papers with metadata. Remember that ChromaDB has a batch size limit:

# Prepare data for insertion
ids = [f"paper_{i}" for i in range(len(df))]
documents = df['abstract'].tolist()

# Insert with metadata
# Our 5000 papers fit in one batch (limit is ~5,461)
print(f"Inserting {len(df)} papers with metadata...")

collection.add(
    ids=ids,
    embeddings=embeddings.tolist(),
    documents=documents,
    metadatas=metadatas
)

print(f"✓ Collection contains {collection.count()} papers with metadata")
Inserting 5000 papers with metadata...
✓ Collection contains 5000 papers with metadata

We now have a collection where every paper has both its embedding and rich metadata. This enables powerful combinations of semantic search and traditional filtering.

Metadata Filtering in Practice

Let's start filtering our searches using metadata. ChromaDB uses a where clause syntax similar to database queries.

Basic Filtering by Category

Suppose we want to search only within Machine Learning papers:

# First, let's create a helper function for queries
def search_with_filter(query_text, where_clause=None, n_results=5):
    """
    Search with optional metadata filtering.

    Args:
        query_text: The search query
        where_clause: Optional ChromaDB where clause for filtering
        n_results: Number of results to return

    Returns:
        Search results
    """
    # Embed the query
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Search with optional filter
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        where=where_clause  # Apply metadata filter here
    )

    return results

# Example: Search for "deep learning optimization" only in ML papers
query = "deep learning optimization techniques"

results_filtered = search_with_filter(
    query,
    where_clause={"category": "cs.LG"}  # Only Machine Learning papers
)

print(f"Query: '{query}'")
print("Filter: category = 'cs.LG'")
print("\nTop 5 results:")
for i in range(len(results_filtered['ids'][0])):
    metadata = results_filtered['metadatas'][0][i]
    distance = results_filtered['distances'][0][i]

    print(f"\n{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']} | Year: {metadata['year']}")
    print(f"   Distance: {distance:.4f}")
Query: 'deep learning optimization techniques'
Filter: category = 'cs.LG'

Top 5 results:

1. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
   Category: cs.LG | Year: 2025
   Distance: 0.6449

2. Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
   Category: cs.LG | Year: 2025
   Distance: 0.6571

3. Training Neural Networks at Any Scale
   Category: cs.LG | Year: 2025
   Distance: 0.6674

4. Deep Progressive Training: scaling up depth capacity of zero/one-layer models
   Category: cs.LG | Year: 2025
   Distance: 0.6682

5. DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
   Category: cs.LG | Year: 2025
   Distance: 0.6732

All five results are from cs.LG, exactly as we requested. The filtering worked correctly. The distances are also tightly clustered between 0.64 and 0.67.

This close grouping tells us we found papers that all match our query equally well. The lower distances (compared to the 1.1+ ranges we saw previously) show that filtering down to a specific category helped us find stronger semantic matches.

Filtering by Year Range

What if we want papers from a specific time period?

# Search for papers from 2024 or later
results_recent = search_with_filter(
    "neural network architectures",
    where_clause={"year": {"$gte": 2024}}  # Greater than or equal to 2024
)

print("Recent papers (2024+) about neural network architectures:")
for i in range(3):  # Show top 3
    metadata = results_recent['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']} ({metadata['year']})")
Recent papers (2024+) about neural network architectures:
1. Bearing Syntactic Fruit with Stack-Augmented Neural Networks (2025)
2. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)
3. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)

Notice that results #2 and #3 are the same paper. This happens because some arXiv papers get cross-posted to multiple categories. A paper about neural architectures for language models might appear in both cs.LG and cs.CL, so when we filter only by year, it shows up once for each category assignment.

You could deduplicate results by tracking paper IDs and skipping ones you've already seen, but whether you should depends on your use case. Sometimes knowing a paper appears in multiple categories is actually valuable information. For this tutorial, we're keeping duplicates as-is because they reflect how real databases behave and help us understand what filtering does and doesn't handle. If you were building a paper recommendation system, you'd probably deduplicate. If you were analyzing category overlap patterns, you'd want to keep them.

Comparison Operators

ChromaDB supports several comparison operators for numeric fields:

  • $eq: Equal to
  • $ne: Not equal to
  • $gt: Greater than
  • $gte: Greater than or equal to
  • $lt: Less than
  • $lte: Less than or equal to

Combined Filters

The real power comes from combining multiple filters:

# Find Computer Vision papers from 2025
results_combined = search_with_filter(
    "image recognition and classification",
    where_clause={
        "$and": [
            {"category": "cs.CV"},
            {"year": {"$eq": 2025}}
        ]
    }
)

print("Computer Vision papers from 2025 about image recognition:")
for i in range(3):
    metadata = results_combined['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']}")
    print(f"   {metadata['category']} | {metadata['year']}")
Computer Vision papers from 2025 about image recognition:
1. SWAN -- Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces
   cs.CV | 2025
2. Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
   cs.CV | 2025
3. UniADC: A Unified Framework for Anomaly Detection and Classification
   cs.CV | 2025

ChromaDB also supports $or for alternatives:

# Papers from either Database or Software Engineering categories
where_db_or_se = {
    "$or": [
        {"category": "cs.DB"},
        {"category": "cs.SE"}
    ]
}

These filtering capabilities let you narrow searches to exactly the subset you need.

Measuring Filtering Performance Overhead

Metadata filtering isn't free. Let's measure the actual performance impact of different filter types. We'll run multiple queries to get stable measurements:

import time

def benchmark_filter(where_clause, n_iterations=100, description=""):
    """
    Benchmark query performance with a specific filter.

    Args:
        where_clause: The filter to apply (None for unfiltered)
        n_iterations: Number of times to run the query
        description: Description of what we're testing

    Returns:
        Average query time in milliseconds
    """
    # Use a fixed query embedding to keep comparisons fair
    query_text = "machine learning model training"
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Warm up (run once to load any caches)
    collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=5,
        where=where_clause
    )

    # Benchmark
    start_time = time.time()
    for _ in range(n_iterations):
        collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5,
            where=where_clause
        )
    elapsed = time.time() - start_time
    avg_ms = (elapsed / n_iterations) * 1000

    print(f"{description}")
    print(f"  Average query time: {avg_ms:.2f} ms")
    return avg_ms

print("Running filtering performance benchmarks (100 iterations each)...")
print("=" * 70)

# Baseline: No filtering
baseline_ms = benchmark_filter(None, description="Baseline (no filter)")

print()

# Category filter
category_ms = benchmark_filter(
    {"category": "cs.LG"},
    description="Category filter (category = 'cs.LG')"
)
category_overhead = (category_ms / baseline_ms)
print(f"  Overhead: {category_overhead:.1f}x slower ({(category_overhead-1)*100:.0f}%)")

print()

# Year range filter
year_ms = benchmark_filter(
    {"year": {"$gte": 2024}},
    description="Year range filter (year >= 2024)"
)
year_overhead = (year_ms / baseline_ms)
print(f"  Overhead: {year_overhead:.1f}x slower ({(year_overhead-1)*100:.0f}%)")

print()

# Combined filter
combined_ms = benchmark_filter(
    {"$and": [{"category": "cs.LG"}, {"year": {"$gte": 2024}}]},
    description="Combined filter (category AND year)"
)
combined_overhead = (combined_ms / baseline_ms)
print(f"  Overhead: {combined_overhead:.1f}x slower ({(combined_overhead-1)*100:.0f}%)")

print("\n" + "=" * 70)
print("Summary: Filtering adds 3-10x overhead depending on filter type")
Running filtering performance benchmarks (100 iterations each)...
======================================================================
Baseline (no filter)
  Average query time: 4.45 ms

Category filter (category = 'cs.LG')
  Average query time: 14.82 ms
  Overhead: 3.3x slower (233%)

Year range filter (year >= 2024)
  Average query time: 35.67 ms
  Overhead: 8.0x slower (702%)

Combined filter (category AND year)
  Average query time: 22.34 ms
  Overhead: 5.0x slower (402%)

======================================================================
Summary: Filtering adds 3-10x overhead depending on filter type

What these numbers tell us:

  • Unfiltered queries are fast: Our baseline of 4.45ms means ChromaDB's HNSW index works well.
  • Category filtering costs 3.3x overhead: The query still completes in 14.82ms, which is totally usable, but it's noticeably slower than unfiltered search.
  • Numeric range queries are most expensive: Year filtering at 8x overhead (35.67ms) shows that range queries on numeric fields are particularly costly in ChromaDB.
  • Combined filters fall in between: At 5x overhead (22.34ms), combining filters doesn't just multiply the costs. There's some optimization happening.
  • Real-world variability: If you run these benchmarks yourself, you'll see the exact numbers vary between runs. We saw category filtering range from 13.8-16.1ms across multiple benchmark sessions. This variability is normal. What stays consistent is the order: year filters are always most expensive, then combined filters, then category filters.

Understanding the Performance Trade-off

This overhead is significant. A multi-fold slowdown matters when you're processing hundreds of queries per second. But context is important:

When filtering makes sense despite overhead:

  • Users explicitly request filters ("Show me recent papers")
  • The filtered results are substantially better than unfiltered
  • Your query volume is manageable (even 35ms per query handles 28 queries/second)
  • User experience benefits outweigh the performance cost

When to reconsider filtering:

  • Very high query volume with tight latency requirements
  • Filters don't meaningfully improve results for most queries
  • You need sub-10ms response times at scale

Important context: This overhead is how ChromaDB implements filtering at this scale. When we explore production vector databases in the next tutorial, you'll see how systems like Qdrant handle filtering more efficiently. This isn't a fundamental limitation of vector databases, it's a characteristic of how different systems approach the problem.

For now, understand that metadata filtering in ChromaDB works and is usable, but it comes with measurable performance costs. Design your metadata schema carefully and filter only when the value justifies the overhead.

Implementing BM25 Keyword Search

Vector search excels at understanding semantic meaning, but it can struggle with rare keywords, specific technical terms, or exact name matches. BM25 keyword search complements vector search by ranking documents based on term frequency and document length.

Understanding BM25

BM25 (Best Matching 25) is a ranking function that scores documents based on:

  • How many times query terms appear in the document (term frequency)
  • How rare those terms are across all documents (inverse document frequency)
  • Document length normalization (shorter documents aren't penalized)

BM25 treats words as independent tokens. If you search for "SQL query optimization," BM25 looks for documents containing those exact words, giving higher scores to documents where these terms appear frequently.

Building a BM25 Index

Let's implement BM25 search on our arXiv abstracts:

from rank_bm25 import BM25Okapi
import string

def simple_tokenize(text):
    """
    Basic tokenization for BM25.

    Lowercase text, remove punctuation, split on whitespace.
    """
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

# Tokenize all abstracts
print("Building BM25 index from 5000 abstracts...")
tokenized_corpus = [simple_tokenize(abstract) for abstract in df['abstract']]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print("✓ BM25 index created")

# Test it with a sample query
query = "SQL query optimization indexing"
tokenized_query = simple_tokenize(query)

# Get BM25 scores for all documents
bm25_scores = bm25.get_scores(tokenized_query)

# Find top 5 papers by BM25 score
top_indices = np.argsort(bm25_scores)[::-1][:5]

print(f"\nQuery: '{query}'")
print("Top 5 by BM25 keyword matching:")
for rank, idx in enumerate(top_indices, 1):
    score = bm25_scores[idx]
    title = df.iloc[idx]['title']
    category = df.iloc[idx]['category']
    print(f"{rank}. [{category}] {title[:60]}...")
    print(f"   BM25 Score: {score:.2f}")
Building BM25 index from 5000 abstracts...
✓ BM25 index created

Query: 'SQL query optimization indexing'
Top 5 by BM25 keyword matching:
1. [cs.DB] Learned Adaptive Indexing...
   BM25 Score: 13.34
2. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   BM25 Score: 13.25
3. [cs.LG] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
4. [cs.DB] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
5. [cs.DB] A Functional Data Model and Query Language is All You Need...
   BM25 Score: 11.91

BM25 correctly identified Database papers about query optimization, with 4 out of 5 results from cs.DB. The third result is from Machine Learning but still relevant to SQL processing (Cortex AISQL), showing how keyword matching can surface related papers from adjacent domains. When the query contains specific technical terms, keyword matching works well.

A note about scale: The rank-bm25 library works great for our 5,000 abstracts and similar small datasets. It's perfect for learning BM25 concepts without complexity. For larger datasets or production systems, you'd typically use faster BM25 implementations found in search engines like Elasticsearch, OpenSearch, or Apache Lucene. These are optimized for millions of documents and high query volumes. For now, rank-bm25 gives us everything we need to understand how keyword search complements vector search.

Comparing BM25 to Vector Search

Let's run the same query through vector search:

# Vector search for the same query
results_vector = search_with_filter(query, n_results=5)

print(f"\nSame query: '{query}'")
print("Top 5 by vector similarity:")
for i in range(5):
    metadata = results_vector['metadatas'][0][i]
    distance = results_vector['distances'][0][i]
    print(f"{i+1}. [{metadata['category']}] {metadata['title'][:60]}...")
    print(f"   Distance: {distance:.4f}")
Same query: 'SQL query optimization indexing'
Top 5 by vector similarity:
1. [cs.DB] VIDEX: A Disaggregated and Extensible Virtual Index for the ...
   Distance: 0.5510
2. [cs.DB] AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational ...
   Distance: 0.5586
3. [cs.DB] AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor...
   Distance: 0.5602
4. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   Distance: 0.5837
5. [cs.DB] Training-Free Query Optimization via LLM-Based Plan Similari...
   Distance: 0.5856

Interesting! While only one paper (LLM4Hint) appears in both top 5 lists, both approaches successfully identify relevant Database papers. The keywords "SQL" and "query" and "optimization" appear frequently in database papers, and the semantic meaning also points to that domain. The different rankings show how keyword matching and semantic search can prioritize different aspects of relevance, even when both correctly identify the target category.

This convergence of categories (both returning cs.DB papers) is common when queries contain domain-specific terminology that appears naturally in relevant documents.

Hybrid Search: Combining Vector and Keyword Search

Hybrid search combines the strengths of both approaches: vector search for semantic understanding, keyword search for exact term matching. Let's implement weighted hybrid scoring.

Our Implementation

Before we dive into the code, let's be clear about what we're building. This is a simplified implementation designed to teach you the core concepts of hybrid search: score normalization, weighted combination, and balancing semantic versus keyword signals.

Production vector databases often handle hybrid scoring internally or use more sophisticated approaches like rank-based fusion (combining rankings rather than scores) or learned rerankers (neural models that re-score results). We'll explore these production systems in the next tutorial. For now, our implementation focuses on the fundamentals that apply across all hybrid approaches.

The Challenge: Normalizing Different Score Scales

BM25 scores range from 0 to potentially 20+ (higher is better). ChromaDB distances range from 0 to 2+ (lower is better). We can't just add them together. We need to:

  1. Normalize both score types to the same 0-1 range
  2. Convert ChromaDB distances to similarities (flip the scale)
  3. Apply weights to combine them

Implementation

Here's our complete hybrid search function:

def hybrid_search(query_text, alpha=0.5, n_results=10):
    """
    Combine BM25 keyword search with vector similarity search.

    Args:
        query_text: The search query
        alpha: Weight for BM25 (0 = pure vector, 1 = pure keyword)
        n_results: Number of results to return

    Returns:
        Combined results with hybrid scores
    """
    # Get BM25 scores
    tokenized_query = simple_tokenize(query_text)
    bm25_scores = bm25.get_scores(tokenized_query)

    # Get vector similarities (we'll search more to ensure good coverage)
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    vector_results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=100  # Get more candidates for better coverage
    )

    # Extract vector distances and convert to similarities
    # ChromaDB returns cosine distance (0 to 2, lower = more similar)
    # We'll convert to similarity scores where higher = better for easier combination
    vector_distances = {}
    for i, paper_id in enumerate(vector_results['ids'][0]):
        distance = vector_results['distances'][0][i]
        # Convert distance to similarity (simple inversion)
        similarity = 1 / (1 + distance)
        vector_distances[paper_id] = similarity

    # Normalize BM25 scores to 0-1 range
    max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
    min_bm25 = min(bm25_scores)
    bm25_normalized = {}
    for i, score in enumerate(bm25_scores):
        paper_id = f"paper_{i}"
        normalized = (score - min_bm25) / (max_bm25 - min_bm25) if max_bm25 > min_bm25 else 0
        bm25_normalized[paper_id] = normalized

    # Combine scores using weighted average
    # hybrid_score = alpha * bm25 + (1 - alpha) * vector
    hybrid_scores = {}
    all_paper_ids = set(bm25_normalized.keys()) | set(vector_distances.keys())

    for paper_id in all_paper_ids:
        bm25_score = bm25_normalized.get(paper_id, 0)
        vector_score = vector_distances.get(paper_id, 0)

        hybrid_score = alpha * bm25_score + (1 - alpha) * vector_score
        hybrid_scores[paper_id] = hybrid_score

    # Get top N by hybrid score
    top_paper_ids = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)[:n_results]

    # Format results
    results = []
    for paper_id, score in top_paper_ids:
        paper_idx = int(paper_id.split('_')[1])
        results.append({
            'paper_id': paper_id,
            'title': df.iloc[paper_idx]['title'],
            'category': df.iloc[paper_idx]['category'],
            'abstract': df.iloc[paper_idx]['abstract'][:200] + "...",
            'hybrid_score': score,
            'bm25_score': bm25_normalized.get(paper_id, 0),
            'vector_score': vector_distances.get(paper_id, 0)
        })

    return results

# Test with different alpha values
query = "neural network training optimization"

print(f"Query: '{query}'")
print("=" * 80)

# Pure vector (alpha = 0)
print("\nPure Vector Search (alpha=0.0):")
results = hybrid_search(query, alpha=0.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 30% keyword, 70% vector
print("\nHybrid 30/70 (alpha=0.3):")
results = hybrid_search(query, alpha=0.3, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 50/50
print("\nHybrid 50/50 (alpha=0.5):")
results = hybrid_search(query, alpha=0.5, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Pure keyword (alpha = 1.0)
print("\nPure BM25 Keyword (alpha=1.0):")
results = hybrid_search(query, alpha=1.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
Query: 'neural network training optimization'
================================================================================

Pure Vector Search (alpha=0.0):
1. [cs.LG] Training Neural Networks at Any Scale...
   Hybrid: 0.642 (Vector: 0.642, BM25: 0.749)
2. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.630 (Vector: 0.630, BM25: 1.000)
3. [cs.LG] Adam symmetry theorem: characterization of the convergence o...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.381)
4. [cs.LG] A Distributed Training Architecture For Combinatorial Optimi...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.884)
5. [cs.LG] Can Training Dynamics of Scale-Invariant Neural Networks Be ...
   Hybrid: 0.609 (Vector: 0.609, BM25: 0.566)

Hybrid 30/70 (alpha=0.3):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.741 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.714 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.709 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.708 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.707 (Vector: 0.603, BM25: 0.948)

Hybrid 50/50 (alpha=0.5):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.815 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.787 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.775 (Vector: 0.603, BM25: 0.948)

Pure BM25 Keyword (alpha=1.0):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 1.000 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.971 (Vector: 0.603, BM25: 0.971)
3. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
4. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.948 (Vector: 0.603, BM25: 0.948)

The output shows how different alpha values affect which papers surface. With pure vector search (alpha=0), you'll see papers that semantically relate to neural network training. As you increase alpha toward 1, you'll increasingly weight papers that literally contain the words "neural," "network," "training," and "optimization."

Evaluating Search Strategies Systematically

We've implemented three search approaches: pure vector, pure keyword, and hybrid. But which one actually works better? We need systematic evaluation.

The Evaluation Metric: Category Precision

For our balanced 5k dataset, we can use category precision as our success metric:

Category precision @k: What percentage of the top k results are in the expected category?

If we search for "SQL query optimization," we expect Database papers (cs.DB). If 4 out of 5 top results are from cs.DB, we have 80% precision@5.

This metric works because our dataset is perfectly balanced and we can predict which category should dominate for specific queries.

Creating Test Queries

Let's create 10 diverse queries targeting different categories:

test_queries = [
    {
        "text": "natural language processing transformers",
        "expected_category": "cs.CL",
        "description": "NLP query"
    },
    {
        "text": "image segmentation computer vision",
        "expected_category": "cs.CV",
        "description": "Vision query"
    },
    {
        "text": "database query optimization indexing",
        "expected_category": "cs.DB",
        "description": "Database query"
    },
    {
        "text": "neural network training deep learning",
        "expected_category": "cs.LG",
        "description": "ML query with clear terms"
    },
    {
        "text": "software testing debugging quality assurance",
        "expected_category": "cs.SE",
        "description": "Software engineering query"
    },
    {
        "text": "attention mechanisms sequence models",
        "expected_category": "cs.CL",
        "description": "NLP architecture query"
    },
    {
        "text": "convolutional neural networks image recognition",
        "expected_category": "cs.CV",
        "description": "Vision with technical terms"
    },
    {
        "text": "distributed systems database consistency",
        "expected_category": "cs.DB",
        "description": "Database systems query"
    },
    {
        "text": "reinforcement learning policy gradient",
        "expected_category": "cs.LG",
        "description": "RL query"
    },
    {
        "text": "code review static analysis",
        "expected_category": "cs.SE",
        "description": "SE development query"
    }
]

print(f"Created {len(test_queries)} test queries")
print("Expected category distribution:")
categories = [q['expected_category'] for q in test_queries]
print(pd.Series(categories).value_counts().sort_index())
Created 10 test queries
Expected category distribution:
cs.CL    2
cs.CV    2
cs.DB    2
cs.LG    2
cs.SE    2
Name: count, dtype: int64

Our test set is balanced across categories, ensuring fair evaluation.

Running the Evaluation

Now let's test pure vector, pure keyword, and hybrid approaches:

def calculate_category_precision(query_text, expected_category, search_type="vector", alpha=0.5):
    """
    Calculate what percentage of top 5 results match expected category.

    Args:
        query_text: The search query
        expected_category: Expected category (e.g., 'cs.LG')
        search_type: 'vector', 'bm25', or 'hybrid'
        alpha: Weight for BM25 if using hybrid

    Returns:
        Precision (0.0 to 1.0)
    """
    if search_type == "vector":
        results = search_with_filter(query_text, n_results=5)
        categories = [r['category'] for r in results['metadatas'][0]]

    elif search_type == "bm25":
        tokenized_query = simple_tokenize(query_text)
        bm25_scores = bm25.get_scores(tokenized_query)
        top_indices = np.argsort(bm25_scores)[::-1][:5]
        categories = [df.iloc[idx]['category'] for idx in top_indices]

    elif search_type == "hybrid":
        results = hybrid_search(query_text, alpha=alpha, n_results=5)
        categories = [r['category'] for r in results]

    # Calculate precision
    matches = sum(1 for cat in categories if cat == expected_category)
    precision = matches / len(categories)

    return precision, categories

# Evaluate all strategies
results_summary = {
    'Pure Vector': [],
    'Hybrid 30/70': [],
    'Hybrid 50/50': [],
    'Pure BM25': []
}

print("Evaluating search strategies on 10 test queries...")
print("=" * 80)

for query_info in test_queries:
    query = query_info['text']
    expected = query_info['expected_category']

    print(f"\nQuery: {query}")
    print(f"Expected: {expected}")

    # Pure vector
    precision, _ = calculate_category_precision(query, expected, "vector")
    results_summary['Pure Vector'].append(precision)
    print(f"  Pure Vector: {precision*100:.0f}% precision")

    # Hybrid 30/70
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.3)
    results_summary['Hybrid 30/70'].append(precision)
    print(f"  Hybrid 30/70: {precision*100:.0f}% precision")

    # Hybrid 50/50
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.5)
    results_summary['Hybrid 50/50'].append(precision)
    print(f"  Hybrid 50/50: {precision*100:.0f}% precision")

    # Pure BM25
    precision, _ = calculate_category_precision(query, expected, "bm25")
    results_summary['Pure BM25'].append(precision)
    print(f"  Pure BM25: {precision*100:.0f}% precision")

# Calculate average precision for each strategy
print("\n" + "=" * 80)
print("OVERALL RESULTS")
print("=" * 80)
for strategy, precisions in results_summary.items():
    avg_precision = sum(precisions) / len(precisions)
    print(f"{strategy}: {avg_precision*100:.0f}% average category precision")
Evaluating search strategies on 10 test queries...
================================================================================

Query: natural language processing transformers
Expected: cs.CL
  Pure Vector: 80% precision
  Hybrid 30/70: 60% precision
  Hybrid 50/50: 60% precision
  Pure BM25: 60% precision

Query: image segmentation computer vision
Expected: cs.CV
  Pure Vector: 80% precision
  Hybrid 30/70: 80% precision
  Hybrid 50/50: 80% precision
  Pure BM25: 80% precision

[... additional queries ...]

================================================================================
OVERALL RESULTS
================================================================================
Pure Vector: 84% average category precision
Hybrid 30/70: 78% average category precision
Hybrid 50/50: 78% average category precision
Pure BM25: 78% average category precision

Understanding What the Results Tell Us

These results deserve careful interpretation. Let's be honest about what we discovered.

Finding 1: Pure Vector Performed Best on This Dataset

Pure vector search achieved 84% category precision compared to 78% for hybrid and 78% for BM25. This might surprise you if you've read guides claiming hybrid search always outperforms pure approaches.

Why pure vector dominated on academic abstracts:

Academic papers have rich vocabulary and technical terminology. ML papers naturally use words like "training," "optimization," "neural networks." Database papers naturally use words like "query," "index," "transaction." The semantic embeddings capture these domain-specific patterns well.

Adding BM25 keyword matching introduced false positives. Papers that coincidentally used similar words in different contexts got boosted incorrectly. For example, a database paper might mention "model training" when discussing query optimization models, causing it to rank high for "neural network training" queries even though it's not about neural networks.

Finding 2: Hybrid Search Can Still Add Value

Just because pure vector won on this dataset doesn't mean hybrid search is worthless. There are scenarios where keyword matching helps:

When hybrid might outperform pure vector:

  • Searching structured data (product catalogs, API documentation)
  • Queries with rare technical terms that might not embed well
  • Domains where exact keyword presence is meaningful
  • Documents with inconsistent writing quality where semantic meaning is unclear

On our academic abstracts: The rich vocabulary gave vector search everything it needed. Keyword matching added noise more than signal.

Finding 3: The Vocabulary Mismatch Problem

Some queries failed across ALL strategies. For example, we tested "reducing storage requirements for system event data" hoping to find a paper about log compression. None of the approaches found it. Why?

The query used "reducing storage requirements" but the paper said "compression" and "resource savings." These are semantically equivalent, but the vocabulary differs. At 5k scale with multiple papers legitimately matching each query, vocabulary mismatches become visible.

This isn't a failure of vector search or hybrid search. It's the reality of semantic retrieval: users search with general terms, papers use technical jargon. Sometimes the gap is too wide.

Finding 4: Query Quality Matters More Than Strategy

Throughout our evaluation, we noticed that well-crafted queries with clear technical terms performed well across all strategies, while vague queries struggled everywhere.

A query like "neural network training optimization techniques" succeeded because it used the same language papers use. A query like "making models work better" failed because it's too general and uses informal language.

The lesson: Before optimizing your search strategy, make sure your queries match how your documents are written. Understanding your corpus matters more than choosing between vector and keyword search.

Practical Guidance for Real Projects

Let's consolidate what we've learned into actionable advice.

When to Use Metadata Filtering

Use filtering when:

  • Users explicitly request filters ("show me papers from 2024")
  • Filtering meaningfully improves result quality
  • Your query volume is manageable (ChromaDB can handle dozens of filtered queries per second)
  • The performance cost is acceptable for your use case

Design your schema carefully:

  • Store filterable fields as atomic values (integers for years, strings for categories)
  • Avoid nested JSON blobs or long text in metadata
  • Keep metadata consistent across documents
  • Test filtering performance on your actual data before deploying

Accept the overhead:

  • Filtered queries run slower than unfiltered ones in ChromaDB
  • This is a characteristic of how ChromaDB approaches the problem
  • Production databases handle filtering with different tradeoffs (we'll see this in the next tutorial)
  • Design for the database you're actually using

When to Consider Hybrid Search

Try hybrid search when:

  • Your documents have structured fields where exact matches matter
  • Queries include rare technical terms that might not embed well
  • Testing shows hybrid outperforms pure vector on your test queries
  • You can afford the implementation and maintenance complexity

Stick with pure vector when:

  • Your documents have rich natural language (like our academic abstracts)
  • Vector search already achieves high precision on test queries
  • Simplicity and maintainability matter
  • Your embedding model captures domain terminology well

The decision framework:

  1. Build pure vector search first
  2. Create representative test queries
  3. Measure precision/recall on pure vector
  4. Only if results are inadequate, implement hybrid
  5. Compare hybrid against pure vector on same test queries
  6. Choose the approach with measurably better results

Don't add complexity without evidence it helps.

Start Simple, Measure, Then Optimize

The pattern that emerged across our experiments:

  1. Start with pure vector search: It's simpler to implement and maintain
  2. Build evaluation framework: Create test queries with expected results
  3. Measure performance: Calculate precision, recall, or domain-specific metrics
  4. Identify gaps: Where does pure vector fail?
  5. Add complexity thoughtfully: Try metadata filtering or hybrid search
  6. Re-evaluate: Does the added complexity improve results?
  7. Choose based on data: Not based on what tutorials claim always works

This approach keeps your system maintainable while ensuring each added feature provides real value.

Looking Ahead to Production Databases

Throughout this tutorial, we've explored filtering and hybrid search using ChromaDB. We've seen that:

  • Filtering adds measurable overhead, but remains usable for moderate query volumes
  • ChromaDB excels at local development and prototyping
  • Production systems optimize these patterns differently

ChromaDB is designed to be lightweight, easy to use, and perfect for learning. We've used it to understand vector database concepts without worrying about infrastructure. The patterns we learned (metadata schema design, hybrid scoring, evaluation frameworks) transfer directly to production systems.

In the next tutorial, we'll explore production vector databases:

  • PostgreSQL with pgvector: See how vector search integrates with SQL and existing infrastructure
  • Pinecone: Experience managed services with auto-scaling
  • Qdrant: Explore Rust-backed performance and efficient filtering

You'll discover how different systems approach filtering, when managed services make sense, and how to choose the right database for your needs. The core concepts remain the same, but production systems offer different tradeoffs in performance, features, and operational complexity.

But you needed to understand these concepts with an accessible tool first. ChromaDB gave us that foundation.

Practical Exercises

Before moving on, try these experiments to deepen your understanding:

Exercise 1: Explore Different Queries

Test pure vector vs hybrid search on queries from your own domain:

my_queries = [
    "your domain-specific query here",
    "another query relevant to your work",
    # Add more
]

for query in my_queries:
    print(f"\n{'='*70}")
    print(f"Query: {query}")

    # Try pure vector
    results_vector = search_with_filter(query, n_results=5)

    # Try hybrid
    results_hybrid = hybrid_search(query, alpha=0.5, n_results=5)

    # Compare the categories returned
    # Which approach surfaces more relevant papers?

Exercise 2: Tune Hybrid Alpha

Find the optimal alpha value for a specific query:

query = "your challenging query here"

for alpha in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
    results = hybrid_search(query, alpha=alpha, n_results=5)
    categories = [r['category'] for r in results]

    print(f"Alpha={alpha}: {categories}")
    # Which alpha gives the best results for this query?

Exercise 3: Analyze Filter Combinations

Test different metadata filter combinations:

# Try various filter patterns
filters_to_test = [
    {"category": "cs.LG"},
    {"year": {"$gte": 2024}},
    {"category": "cs.LG", "year": {"$eq": 2025}},
    {"$or": [{"category": "cs.LG"}, {"category": "cs.CV"}]}
]

query = "deep learning applications"

for where_clause in filters_to_test:
    results = search_with_filter(query, where_clause, n_results=5)
    categories = [r['category'] for r in results['metadatas'][0]]
    print(f"Filter {where_clause}: {categories}")

Exercise 4: Build Your Own Evaluation

Create test queries for a different domain:

# If you have expertise in a specific field,
# create queries where you KNOW which papers should match

domain_specific_queries = [
    {
        "text": "your expert query",
        "expected_category": "cs.XX",
        "notes": "why this query should return this category"
    },
    # Add more
]

# Run evaluation and see which strategy performs best
# on YOUR domain-specific queries

Summary: What You've Learned

We've covered a lot of ground in this tutorial. Here's what you can now do:

Core Skills

Metadata Schema Design:

  • Store filterable fields as atomic, consistent values
  • Avoid anti-patterns like JSON blobs and long text in metadata
  • Ensure all documents have the same metadata fields to prevent filtering issues
  • Understand that good schema design enables powerful filtering

Metadata Filtering in ChromaDB:

  • Implement category filters, numeric range filters, and combinations
  • Measure the performance overhead of filtering
  • Make informed decisions about when filtering justifies the cost

BM25 Keyword Search:

  • Build BM25 indexes from document text
  • Understand term frequency and inverse document frequency
  • Recognize when keyword matching complements vector search
  • Know the scale limitations of different BM25 implementations

Hybrid Search Implementation:

  • Normalize different score scales (BM25 and vector similarity)
  • Combine scores using weighted averages
  • Test different alpha values to balance keyword vs semantic search
  • Understand this is a teaching implementation of fundamental concepts

Systematic Evaluation:

  • Create test queries with ground truth expectations
  • Calculate precision metrics to compare strategies
  • Make data-driven decisions rather than assuming one approach always wins

Key Insights

1. Pure vector search performed best on our academic abstracts (84% category precision vs 78% for hybrid/BM25). This challenges the assumption that hybrid always wins. The rich vocabulary in academic papers gave vector search everything it needed.

2. Filtering overhead is real but manageable for moderate query volumes. ChromaDB's approach to filtering creates measurable costs that production databases handle differently.

3. Vocabulary mismatch is the biggest challenge. Users search with general terms ("reducing storage"), papers use jargon ("compression"). This gap affects all search strategies.

4. Query quality matters more than search strategy. Well-crafted queries using domain terminology succeed across approaches. Vague queries struggle everywhere.

5. Start simple, measure, then optimize. Build pure vector first, evaluate systematically, add complexity only when data shows it helps.

What's Next

We now understand how to enhance vector search with metadata filtering and hybrid approaches. We've seen what works, what doesn't, and how to measure the difference.

In the next tutorial, we'll explore production vector databases:

  • Set up PostgreSQL with pgvector and see how vector search integrates with SQL
  • Create a Pinecone index and experience managed vector database services
  • Run Qdrant locally and compare its filtering performance
  • Learn decision frameworks for choosing the right database for your needs

You'll get hands-on experience with multiple production systems and develop the judgment to choose appropriately for different scenarios.

Before moving on, make sure you understand:

  • How to design metadata schemas that enable effective filtering
  • The performance tradeoffs of metadata filtering
  • When hybrid search adds value vs adding complexity
  • How to evaluate search strategies systematically using precision metrics
  • Why pure vector search can outperform hybrid on certain datasets

When you're comfortable with these concepts, you're ready to explore production vector databases and learn when to move beyond ChromaDB.


Key Takeaways:

  • Metadata schema design matters: store filterable fields as atomic, consistent values and ensure all documents have the same fields
  • Filtering adds overhead in ChromaDB (category cheapest, year range most expensive, combined in between)
  • Pure vector achieved 84% category precision vs 78% for hybrid/BM25 on academic abstracts due to rich vocabulary
  • Hybrid search has value in specific scenarios (structured data, rare keywords) but adds complexity
  • Vocabulary mismatch between queries and documents affects all search strategies equally
  • Start with pure vector search, measure systematically, add complexity only when data justifies it
  • ChromaDB taught us filtering concepts; production databases optimize differently
  • Evaluation frameworks with test queries matter more than assumptions about "best practices"

Document Chunking Strategies for Vector Databases

6 December 2025 at 02:30

In the previous tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. Our dataset consisted of paper abstracts, each about 200 words long. These abstracts were perfect for embedding as single units: short enough to fit comfortably in an embedding model's context window, yet long enough to capture meaningful semantic content.

But here's the challenge we didn't face yet: What happens when you need to search through full research papers, technical documentation, or long articles? A typical research paper contains 10,000 words. A comprehensive technical guide might have 50,000 words. These documents are far too long to embed as single vectors.

When documents are too long, you need to break them into chunks. This tutorial teaches you how to implement different chunking strategies, evaluate their performance systematically, and understand the tradeoffs between approaches. By the end, you'll know how to make informed decisions about chunking for your own projects.

Why Chunking Still Matters

You might be thinking: "Modern LLMs can handle massive amounts of data. Can't I just embed entire documents?"

There are three reasons why chunking remains essential:

1. Embedding Models Have Context Limits

Many embedding models still have much smaller context limits than modern chat models, and long inputs are also more expensive to embed. Even when a model can technically handle a whole paper, you usually don't want one giant vector: smaller chunks give you better retrieval and lower cost.

2. Retrieval Quality Depends on Granularity

Imagine someone searches for "robotic manipulation techniques." If you embedded an entire 10,000-word paper as a single vector, that search would match the whole paper, even if only one 400-word section actually discusses robotic manipulation. Chunking lets you retrieve the specific relevant section rather than forcing the user to read an entire paper.

3. Semantic Coherence Matters

A single document might cover multiple distinct topics. A paper about machine learning for healthcare might discuss neural network architectures in one section and patient privacy in another. These topics deserve separate embeddings so each can be retrieved independently when relevant.

The question isn't whether to chunk, but how to chunk intelligently. That's what we're going to figure out together.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Understand why chunking strategies affect retrieval quality
  • Implement two practical chunking approaches: fixed token windows and sentence-based chunking
  • Generate embeddings for chunks and store them in ChromaDB
  • Build a systematic evaluation framework to compare strategies
  • Interpret real performance data showing when each strategy excels
  • Make informed decisions about chunk size and strategy for your projects
  • Recognize that query quality matters more than chunking strategy

Most importantly, you'll learn how to evaluate your chunking decisions using real measurements rather than guesses.

Dataset and Setup

For this tutorial, we're working with 20 full research papers from the same arXiv dataset we used previously. These papers are balanced across five computer science categories:

  • cs.CL (Computational Linguistics): 4 papers
  • cs.CV (Computer Vision): 4 papers
  • cs.DB (Databases): 4 papers
  • cs.LG (Machine Learning): 4 papers
  • cs.SE (Software Engineering): 4 papers

We extracted the full text from these papers, and here's what makes them perfect for learning about chunking:

  • Total corpus: 196,181 words
  • Average paper length: 9,809 words (compared to 200-word abstracts)
  • Range: 2,735 to 20,763 words per paper
  • Content: Real academic papers with typical formatting artifacts

These papers are long enough to require chunking, diverse enough to test strategies across topics, and messy enough to reflect real-world document processing.

Required Files

Download arxiv_metadata_and_papers.zip and extract it to your working directory. This archive contains:

  • arxiv_20papers_metadata.csv - Metadata, including: title, abstract, authors, published date, category, and arXiv IDs for the 20 selected papers
  • arxiv_fulltext_papers/ - Directory with the 20 text files (one per corresponding paper)

You'll also need the same Python environment from the previous tutorial, plus two additional packages:

# If you're starting fresh, create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# nltk==3.9.1
# tiktoken==0.12.0

pip install chromadb numpy pandas cohere python-dotenv nltk tiktoken

Make sure you have a .env file with your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Papers

Let's load our papers and examine what we're working with:

import pandas as pd
from pathlib import Path

# Load paper metadata
df = pd.read_csv('arxiv_20papers_metadata.csv')
papers_dir = Path('arxiv_fulltext_papers')

print(f"Loaded {len(df)} papers")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Calculate corpus statistics
total_words = 0
word_counts = []

for arxiv_id in df['arxiv_id']:
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()
        words = len(text.split())
        word_counts.append(words)
        total_words += words

print(f"\nCorpus statistics:")
print(f"  Total words: {total_words:,}")
print(f"  Average words per paper: {sum(word_counts) / len(word_counts):.0f}")
print(f"  Range: {min(word_counts):,} to {max(word_counts):,} words")

# Show a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    text = f.read()
    print(f"\nSample paper ({sample_id}):")
    print(f"  Title: {df[df['arxiv_id'] == sample_id]['title'].iloc[0]}")
    print(f"  Category: {df[df['arxiv_id'] == sample_id]['category'].iloc[0]}")
    print(f"  Length: {len(text.split()):,} words")
    print(f"  Preview: {text[:300]}...")
Loaded 20 papers

Papers per category:
category
cs.CL    4
cs.CV    4
cs.DB    4
cs.LG    4
cs.SE    4
Name: count, dtype: int64

Corpus statistics:
  Total words: 196,181
  Average words per paper: 9809
  Range: 2,735 to 20,763 words

Sample paper (2511.09708v1):
  Title: Efficient Hyperdimensional Computing with Modular Composite Representations
  Category: cs.LG
  Length: 11,293 words
  Preview: 1
Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular composite representation (MCR) is
a computing model that represents information with high-
dimensional...

We have 20 papers averaging nearly 10,000 words each. Compare this to abstracts at 200 words we used previously, and you can see why chunking becomes necessary. A 10,000-word paper cannot be embedded as a single unit without losing the ability to retrieve specific relevant sections.

A Note About Paper Extraction

The papers you're working with were extracted from PDFs using PyPDF2. We've provided the extracted text files so you can focus on chunking strategies rather than PDF processing. The extraction process is straightforward but involves details that aren't central to learning about chunking.

If you're curious about how we downloaded the PDFs and extracted the text, or if you want to extend this work with different papers, you'll find the complete code in the Appendix at the end of this tutorial. For now, just know that we:

  1. Downloaded 20 papers from arXiv (4 from each category)
  2. Extracted text from each PDF using PyPDF2
  3. Saved the extracted text to individual files

The extracted text has minor formatting artifacts like extra spaces or split words, but that's realistic. Real-world document processing always involves some noise. The chunking strategies we'll implement handle this gracefully.

Strategy 1: Fixed Token Windows with Overlap

Let's start with the most common chunking approach in production systems: sliding a fixed-size window across the document with some overlap.

The Concept

Imagine reading a book through a window that shows exactly 500 words at a time. When you finish one window, you slide it forward by 400 words, creating a 100-word overlap with the previous window. This continues until you reach the end of the book.

Fixed token windows work the same way:

  1. Choose a chunk size (we'll use 512 tokens)
  2. Choose an overlap (we'll use 100 tokens, about 20%)
  3. Slide the window through the document
  4. Each window becomes one chunk

Why overlap? Concepts often span boundaries between chunks. If we chunk without overlap, we might split a crucial sentence or paragraph, losing semantic coherence. The 20% overlap ensures that even if something gets split, it appears complete in at least one chunk.

Implementation

Let's implement this strategy. We'll use tiktoken for accurate token counting:

import tiktoken

def chunk_text_fixed_tokens(text, chunk_size=512, overlap=100):
    """
    Chunk text using fixed token windows with overlap.

    Args:
        text: The document text to chunk
        chunk_size: Number of tokens per chunk (default 512)
        overlap: Number of tokens to overlap between chunks (default 100)

    Returns:
        List of text chunks
    """
    # We'll use tiktoken just to approximate token lengths.
    # In production, you'd usually match the tokenizer to your embedding model.
    encoding = tiktoken.get_encoding("cl100k_base")

    # Tokenize the entire text
    tokens = encoding.encode(text)

    chunks = []
    start_idx = 0

    while start_idx < len(tokens):
        # Get chunk_size tokens starting from start_idx
        end_idx = start_idx + chunk_size
        chunk_tokens = tokens[start_idx:end_idx]

        # Decode tokens back to text
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        # Move start_idx forward by (chunk_size - overlap)
        # This creates the overlap between consecutive chunks
        start_idx += (chunk_size - overlap)

        # Stop if we've reached the end
        if end_idx >= len(tokens):
            break

    return chunks

# Test on a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    sample_text = f.read()

sample_chunks = chunk_text_fixed_tokens(sample_text)
print(f"Sample paper chunks: {len(sample_chunks)}")
print(f"First chunk length: {len(sample_chunks[0].split())} words")
print(f"First chunk preview: {sample_chunks[0][:200]}...")
Sample paper chunks: 51
First chunk length: 323 words
First chunk preview: 1 Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...

Our sample paper produced 51 chunks with the first chunk containing 323 words. The implementation is working as expected.

Processing All Papers

Now let's apply this chunking strategy to all 20 papers:

# Process all papers and collect chunks
all_chunks = []
chunk_metadata = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk the paper
    chunks = chunk_text_fixed_tokens(text, chunk_size=512, overlap=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        chunk_metadata.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'fixed_token_windows'
        })

print(f"Fixed token chunking results:")
print(f"  Total chunks created: {len(all_chunks)}")
print(f"  Average chunks per paper: {len(all_chunks) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f}")

# Check chunk size distribution
chunk_word_counts = [len(chunk.split()) for chunk in all_chunks]
print(f"  Chunk size range: {min(chunk_word_counts)} to {max(chunk_word_counts)} words")
Fixed token chunking results:
  Total chunks created: 914
  Average chunks per paper: 45.7
  Average words per chunk: 266
  Chunk size range: 16 to 438 words

We created 914 chunks from our 20 papers. Each paper produced about 46 chunks, averaging 266 words each. Notice the wide range: 16 to 438 words. This happens because tokens don't map exactly to words, and our stopping condition creates a small final chunk for some papers.

Edge Cases and Real-World Behavior

That 16-word chunk? It's not a bug. It's what happens when the final portion of a paper contains fewer tokens than our chunk size. In production, you might choose to:

  • Merge tiny final chunks with the previous chunk
  • Set a minimum chunk size threshold
  • Accept them as is (they're rare and often don't hurt retrieval)

We're keeping them to show real-world chunking behavior. Perfect uniformity isn't always necessary or beneficial.

Generating Embeddings

Now we need to embed our 914 chunks using Cohere's API. This is where we need to be careful about rate limits:

from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np

# Load API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
co = ClientV2(api_key=cohere_api_key)

# Configure batching to respect rate limits
# Cohere trial and free keys have strict rate limits.
# We'll use small batches and short pauses so we don't spam the API.
batch_size = 15  # Small batches to stay well under rate limits
wait_time = 15   # Seconds between batches

print("Generating embeddings for fixed token chunks...")
print(f"Total chunks: {len(all_chunks)}")
print(f"Batch size: {batch_size}")

all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks))
    batch = all_chunks[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        # Wait between batches to avoid rate limits
        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit or error: {e}")
        print(f"  Waiting 60 seconds before retry...")
        time.sleep(60)

        # Retry the same batch
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"✓ Generated {len(all_embeddings)} embeddings")

# Convert to numpy array for storage
embeddings_array = np.array(all_embeddings)
print(f"Embeddings shape: {embeddings_array.shape}")
Generating embeddings for fixed token chunks...
Total chunks: 914
Batch size: 15
  Processing batch 1/61 (chunks 0 to 15)...
  Processing batch 2/61 (chunks 15 to 30)...
  ...
  Processing batch 35/61 (chunks 510 to 525)...
  ⚠ Hit rate limit or error: Rate limit exceeded
  Waiting 60 seconds before retry...
  Processing batch 36/61 (chunks 525 to 540)...
  ...
✓ Generated 914 embeddings
Embeddings shape: (914, 1536)

Important note about rate limiting: We hit Cohere's rate limit during embedding generation. This isn't a failure or something to hide. It's a production reality. Our code handled it with a 60-second wait and retry. Good production code always anticipates and handles rate limits gracefully.

Exact limits depend on your plan and may change over time, so always check the provider docs and be ready to handle 429 "rate limit" errors.

Storing in ChromaDB

Now let's store our chunks in ChromaDB. Remember that ChromaDB won't let you create a collection that already exists. During development, you'll often regenerate chunks with different parameters, so we'll delete any existing collection first:

import chromadb

# Initialize ChromaDB client
client = chromadb.Client()  # In-memory client

# This in-memory client resets whenever you start a fresh Python session.
# Your collections and data will disappear when the script ends. Later tutorials
# will show you how to persist data across sessions using PersistentClient.

# Delete collection if it exists (useful for experimentation)
try:
    client.delete_collection(name="fixed_token_chunks")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create fresh collection
collection = client.create_collection(
    name="fixed_token_chunks",
    metadata={
        "description": "20 arXiv papers chunked with fixed token windows",
        "chunking_strategy": "fixed_token_windows",
        "chunk_size": 512,
        "overlap": 100
    }
)

# Prepare data for insertion
ids = [f"chunk_{i}" for i in range(len(all_chunks))]

print(f"Inserting {len(all_chunks)} chunks into ChromaDB...")

collection.add(
    ids=ids,
    embeddings=embeddings_array.tolist(),
    documents=all_chunks,
    metadatas=chunk_metadata
)

print(f"✓ Collection contains {collection.count()} chunks")
Deleted existing collection
Inserting 914 chunks into ChromaDB...
✓ Collection contains 914 chunks

Why delete and recreate? During development, you'll iterate on chunking strategies. Maybe you'll try different chunk sizes or overlap values. ChromaDB requires unique collection names, so the cleanest pattern is to delete the old version before creating the new one. This is standard practice while experimenting.

Our fixed token strategy is now complete: 914 chunks embedded and stored in ChromaDB.

Strategy 2: Sentence-Based Chunking

Let's implement our second approach: chunking based on sentence boundaries rather than arbitrary token positions.

The Concept

Instead of sliding a fixed window through tokens, sentence-based chunking respects the natural structure of language:

  1. Split text into sentences
  2. Group sentences together until reaching a target word count
  3. Never split a sentence in the middle
  4. Create a new chunk when adding the next sentence would exceed the target

This approach prioritizes semantic coherence over size consistency. A chunk might be 400 or 600 words, but it will always contain complete sentences that form a coherent thought.

Why sentence boundaries matter: Splitting mid-sentence destroys meaning. The sentence "Neural networks require careful tuning of hyperparameters to achieve optimal performance" loses critical context if split after "hyperparameters." Sentence-based chunking prevents this.

Implementation

We'll use NLTK for sentence tokenization:

import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import sent_tokenize

A quick note: Sentence tokenization on PDF-extracted text isn't always perfect, especially for technical papers with equations, citations, or unusual formatting. It works well enough for this tutorial, but if you experiment with your own papers, you might see occasional issues with sentences getting split or merged incorrectly.

def chunk_text_by_sentences(text, target_words=400, min_words=100):
    """
    Chunk text by grouping sentences until reaching target word count.

    Args:
        text: The document text to chunk
        target_words: Target words per chunk (default 400)
        min_words: Minimum words for a valid chunk (default 100)

    Returns:
        List of text chunks
    """
    # Split text into sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence_words = len(sentence.split())

        # If adding this sentence would exceed target, save current chunk
        if current_word_count > 0 and current_word_count + sentence_words > target_words:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_word_count = sentence_words
        else:
            current_chunk.append(sentence)
            current_word_count += sentence_words

    # Don't forget the last chunk
    if current_chunk and current_word_count >= min_words:
        chunks.append(' '.join(current_chunk))

    return chunks

# Test on the same sample paper
sample_chunks_sent = chunk_text_by_sentences(sample_text, target_words=400)
print(f"Sample paper chunks (sentence-based): {len(sample_chunks_sent)}")
print(f"First chunk length: {len(sample_chunks_sent[0].split())} words")
print(f"First chunk preview: {sample_chunks_sent[0][:200]}...")
Sample paper chunks (sentence-based): 29
First chunk length: 392 words
First chunk preview: 1
Efficient Hyperdimensional Computing with
Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...

The same paper that produced 51 fixed-token chunks now produces 29 sentence-based chunks. The first chunk is 392 words, close to our 400-word target but not exact.

Processing All Papers

Let's apply sentence-based chunking to all 20 papers:

# Process all papers with sentence-based chunking
all_chunks_sent = []
chunk_metadata_sent = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk by sentences
    chunks = chunk_text_by_sentences(text, target_words=400, min_words=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks_sent.append(chunk)
        chunk_metadata_sent.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'sentence_based'
        })

print(f"Sentence-based chunking results:")
print(f"  Total chunks created: {len(all_chunks_sent)}")
print(f"  Average chunks per paper: {len(all_chunks_sent) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks_sent) / len(all_chunks_sent):.0f}")

# Check chunk size distribution
chunk_word_counts_sent = [len(chunk.split()) for chunk in all_chunks_sent]
print(f"  Chunk size range: {min(chunk_word_counts_sent)} to {max(chunk_word_counts_sent)} words")
Sentence-based chunking results:
  Total chunks created: 513
  Average chunks per paper: 25.6
  Average words per chunk: 382
  Chunk size range: 110 to 548 words

Sentence-based chunking produced 513 chunks compared to fixed token's 914. That's about 44% fewer chunks. Each chunk averages 382 words instead of 266. This isn't better or worse, it's a different tradeoff:

Fixed Token (914 chunks):

  • More chunks, smaller sizes
  • Consistent token counts
  • More embeddings to generate and store
  • Finer-grained retrieval granularity

Sentence-Based (513 chunks):

  • Fewer chunks, larger sizes
  • Variable sizes respecting sentences
  • Less storage and fewer embeddings
  • Preserves semantic coherence

Comparing Strategies Side-by-Side

Let's create a comparison table:

import pandas as pd

comparison_df = pd.DataFrame({
    'Metric': ['Total Chunks', 'Chunks per Paper', 'Avg Words per Chunk',
               'Min Words', 'Max Words'],
    'Fixed Token': [914, 45.7, 266, 16, 438],
    'Sentence-Based': [513, 25.6, 382, 110, 548]
})

print(comparison_df.to_string(index=False))
              Metric  Fixed Token  Sentence-Based
        Total Chunks          914             513
   Chunks per Paper          45.7            25.6
Avg Words per Chunk           266             382
           Min Words           16             110
           Max Words          438             548

Sentence-based chunking creates 44% fewer chunks. This means:

  • Lower costs: 44% fewer embeddings to generate
  • Less storage: 44% less data to store and query
  • Larger context: Each chunk contains more complete thoughts
  • Better coherence: Never splits mid-sentence

But remember, this isn't automatically "better." Smaller chunks can enable more precise retrieval. The choice depends on your use case.

Generating Embeddings for Sentence-Based Chunks

We'll use the same embedding process as before, with the same rate limiting pattern:

print("Generating embeddings for sentence-based chunks...")
print(f"Total chunks: {len(all_chunks_sent)}")

all_embeddings_sent = []
num_batches = (len(all_chunks_sent) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks_sent))
    batch = all_chunks_sent[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit: {e}")
        print(f"  Waiting 60 seconds...")
        time.sleep(60)

        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"✓ Generated {len(all_embeddings_sent)} embeddings")

embeddings_array_sent = np.array(all_embeddings_sent)
print(f"Embeddings shape: {embeddings_array_sent.shape}")
Generating embeddings for sentence-based chunks...
Total chunks: 513
  Processing batch 1/35 (chunks 0 to 15)...
  ...
✓ Generated 513 embeddings
Embeddings shape: (513, 1536)

With 513 chunks instead of 914, embedding generation is faster and costs less. This is a concrete benefit of the sentence-based approach.

Storing Sentence-Based Chunks in ChromaDB

We'll create a separate collection for sentence-based chunks:

# Delete existing collection if present
try:
    client.delete_collection(name="sentence_chunks")
except:
    pass

# Create sentence-based collection
collection_sent = client.create_collection(
    name="sentence_chunks",
    metadata={
        "description": "20 arXiv papers chunked by sentences",
        "chunking_strategy": "sentence_based",
        "target_words": 400,
        "min_words": 100
    }
)

# Prepare and insert data
ids_sent = [f"chunk_{i}" for i in range(len(all_chunks_sent))]

print(f"Inserting {len(all_chunks_sent)} chunks into ChromaDB...")

collection_sent.add(
    ids=ids_sent,
    embeddings=embeddings_array_sent.tolist(),
    documents=all_chunks_sent,
    metadatas=chunk_metadata_sent
)

print(f"✓ Collection contains {collection_sent.count()} chunks")
Inserting 513 chunks into ChromaDB...
✓ Collection contains 513 chunks

Now we have two collections:

  • fixed_token_chunks with 914 chunks
  • sentence_chunks with 513 chunks

Both contain the same 20 papers, just chunked differently. Now comes the critical question: which strategy actually retrieves relevant content better?

Building an Evaluation Framework

We've created two chunking strategies and embedded all the chunks. But how do we know which one works better? We need a systematic way to measure retrieval quality.

The Evaluation Approach

Our evaluation framework works like this:

  1. Create test queries for specific papers we know should be retrieved
  2. Run each query against both chunking strategies
  3. Check if the expected paper appears in the top results
  4. Compare performance across strategies

The key is having ground truth: knowing which papers should match which queries.

Creating Good Test Queries

Here's something we learned the hard way during development: bad queries make any chunking strategy look bad.

When we first built this evaluation, we tried queries like "reinforcement learning optimization" for a paper that was actually about masked diffusion models. Both chunking strategies "failed" because we gave them an impossible task. The problem wasn't the chunking, it was our poor understanding of the documents.

The fix: Before creating queries, read the paper abstracts. Understand what each paper actually discusses. Then create queries that match real content.

Let's create five test queries based on actual paper content:

# Test queries designed from actual paper content
test_queries = [
    {
        "text": "knowledge editing in language models",
        "expected_paper": "2510.25798v1",  # MemEIC paper (cs.CL)
        "description": "Knowledge editing"
    },
    {
        "text": "masked diffusion models for inference optimization",
        "expected_paper": "2511.04647v2",  # Masked diffusion (cs.LG)
        "description": "Optimal inference schedules"
    },
    {
        "text": "robotic manipulation with spatial representations",
        "expected_paper": "2511.09555v1",  # SpatialActor (cs.CV)
        "description": "Robot manipulation"
    },
    {
        "text": "blockchain ledger technology for database integrity",
        "expected_paper": "2507.13932v1",  # Chain Table (cs.DB)
        "description": "Blockchain database integrity"
    },
    {
        "text": "automated test generation and oracle synthesis",
        "expected_paper": "2510.26423v1",  # Nexus (cs.SE)
        "description": "Multi-agent test oracles"
    }
]

print("Test queries:")
for i, query in enumerate(test_queries, 1):
    print(f"{i}. {query['text']}")
    print(f"   Expected paper: {query['expected_paper']}")
    print()
Test queries:
1. knowledge editing in language models
   Expected paper: 2510.25798v1

2. masked diffusion models for inference optimization
   Expected paper: 2511.04647v2

3. robotic manipulation with spatial representations
   Expected paper: 2511.09555v1

4. blockchain ledger technology for database integrity
   Expected paper: 2507.13932v1

5. automated test generation and oracle synthesis
   Expected paper: 2510.26423v1

These queries are specific enough to target particular papers but general enough to represent realistic search behavior. Each query matches actual content from its expected paper.

Implementing the Evaluation Loop

Now let's run these queries against both chunking strategies and compare results:

def evaluate_chunking_strategy(collection, test_queries, strategy_name):
    """
    Evaluate a chunking strategy using test queries.

    Returns:
        Dictionary with success rate and detailed results
    """
    results = []

    for query_info in test_queries:
        query_text = query_info['text']
        expected_paper = query_info['expected_paper']

        # Embed the query
        response = co.embed(
            texts=[query_text],
            model='embed-v4.0',
            input_type='search_query',
            embedding_types=['float']
        )
        query_embedding = np.array(response.embeddings.float_[0])

        # Search the collection
        search_results = collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5
        )

        # Extract paper IDs from chunks
        retrieved_papers = []
        for metadata in search_results['metadatas'][0]:
            paper_id = metadata['arxiv_id']
            if paper_id not in retrieved_papers:
                retrieved_papers.append(paper_id)

        # Check if expected paper was found
        found = expected_paper in retrieved_papers
        position = retrieved_papers.index(expected_paper) + 1 if found else None
        best_distance = search_results['distances'][0][0]

        results.append({
            'query': query_text,
            'expected_paper': expected_paper,
            'found': found,
            'position': position,
            'best_distance': best_distance,
            'retrieved_papers': retrieved_papers[:3]  # Top 3 for comparison
        })

    # Calculate success rate
    success_rate = sum(1 for r in results if r['found']) / len(results)

    return {
        'strategy': strategy_name,
        'success_rate': success_rate,
        'results': results
    }

# Evaluate both strategies
print("Evaluating fixed token strategy...")
fixed_token_eval = evaluate_chunking_strategy(
    collection,
    test_queries,
    "Fixed Token Windows"
)

print("Evaluating sentence-based strategy...")
sentence_eval = evaluate_chunking_strategy(
    collection_sent,
    test_queries,
    "Sentence-Based"
)

print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
Evaluating fixed token strategy...
Evaluating sentence-based strategy...

================================================================================
EVALUATION RESULTS
================================================================================

Comparing Results

Let's examine how each strategy performed:

def print_evaluation_results(eval_results):
    """Print evaluation results in a readable format"""
    strategy = eval_results['strategy']
    success_rate = eval_results['success_rate']
    results = eval_results['results']

    print(f"\n{strategy}")
    print("-" * 80)
    print(f"Success Rate: {len([r for r in results if r['found']])}/{len(results)} queries ({success_rate*100:.0f}%)")
    print()

    for i, result in enumerate(results, 1):
        status = "✓" if result['found'] else "✗"
        position = f"(position #{result['position']})" if result['found'] else ""

        print(f"{i}. {status} {result['query']}")
        print(f"   Expected: {result['expected_paper']}")
        print(f"   Found: {result['found']} {position}")
        print(f"   Best match distance: {result['best_distance']:.4f}")
        print(f"   Top 3 papers: {', '.join(result['retrieved_papers'][:3])}")
        print()

# Print results for both strategies
print_evaluation_results(fixed_token_eval)
print_evaluation_results(sentence_eval)

# Compare directly
print("\n" + "="*80)
print("DIRECT COMPARISON")
print("="*80)
print(f"{'Query':<60} {'Fixed':<10} {'Sentence':<10}")
print("-" * 80)

for i in range(len(test_queries)):
    query = test_queries[i]['text'][:55]
    fixed_pos = fixed_token_eval['results'][i]['position']
    sent_pos = sentence_eval['results'][i]['position']

    fixed_str = f"#{fixed_pos}" if fixed_pos else "Not found"
    sent_str = f"#{sent_pos}" if sent_pos else "Not found"

    print(f"{query:<60} {fixed_str:<10} {sent_str:<10}")
Fixed Token Windows
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. ✓ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8865
   Top 3 papers: 2510.25798v1

2. ✓ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9526
   Top 3 papers: 2511.04647v2

3. ✓ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.9209
   Top 3 papers: 2511.09555v1

4. ✓ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6678
   Top 3 papers: 2507.13932v1

5. ✓ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9395
   Top 3 papers: 2510.26423v1

Sentence-Based
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. ✓ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8831
   Top 3 papers: 2510.25798v1

2. ✓ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9586
   Top 3 papers: 2511.04647v2, 2511.07930v1

3. ✓ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.8863
   Top 3 papers: 2511.09555v1

4. ✓ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6746
   Top 3 papers: 2507.13932v1

5. ✓ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9320
   Top 3 papers: 2510.26423v1

================================================================================
DIRECT COMPARISON
================================================================================
Query                                                        Fixed      Sentence  
--------------------------------------------------------------------------------
knowledge editing in language models                         #1         #1        
masked diffusion models for inference optimization           #1         #1        
robotic manipulation with spatial representations            #1         #1        
blockchain ledger technology for database integrity          #1         #1        
automated test generation and oracle synthesis               #1         #1       

Understanding the Results

Let's break down what these results tell us:

Key Finding 1: Both Strategies Work Well

Both chunking strategies achieved 100% success rate. Every test query successfully retrieved its expected paper at position #1. This is the most important result.

When you have good queries that match actual document content, chunking strategy matters less than you might think. Both approaches work because they both preserve the semantic meaning of the content, just in slightly different ways.

Key Finding 2: Sentence-Based Has Better Distances

Look at the distance values. ChromaDB uses squared Euclidean distance by default, where lower values indicate higher similarity:

Query 1 (knowledge editing):

  • Fixed token: 0.8865
  • Sentence-based: 0.8831 (better)

Query 3 (robotic manipulation):

  • Fixed token: 0.9209
  • Sentence-based: 0.8863 (better)

Query 5 (automated test generation):

  • Fixed token: 0.9395
  • Sentence-based: 0.9320 (better)

In 3 out of 5 queries, sentence-based chunking produced lower distances, meaning higher similarity scores. This suggests that preserving sentence boundaries helps maintain semantic coherence, resulting in embeddings that better capture document meaning.

Key Finding 3: Low Agreement in Secondary Results

While both strategies found the right paper at #1, look at the papers in positions #2 and #3. They often differ between strategies:

Query 1: Both found the same top 3 papers
Query 2: Top paper matches, but #2 and #3 differ
Query 5: Only the top paper matches; #2 and #3 are completely different

This happens because chunk size affects which papers surface as similar. Neither is "wrong," they just have different perspectives on what else might be relevant. The important thing is they both got the most relevant paper right.

What This Means for Your Projects

So which chunking strategy should you choose? The answer is: it depends on your constraints and priorities.

Choose Fixed Token Windows when:

  • You need consistent chunk sizes for batch processing or downstream tasks
  • Storage isn't a concern and you want finer-grained retrieval
  • Your documents lack clear sentence structure (logs, code, transcripts)
  • You're working with multilingual content where sentence detection is unreliable

Choose Sentence-Based Chunking when:

  • You want to minimize storage costs (44% fewer chunks)
  • Semantic coherence is more important than size consistency
  • Your documents have clear sentence boundaries (articles, papers, documentation)
  • You want better similarity scores (as our results suggest)

The honest truth: Both strategies work well. If you implement either one properly, you'll get good retrieval results. The choice is less about "which is better" and more about which tradeoffs align with your project constraints.

Beyond These Two Strategies

We've implemented two practical chunking strategies, but there's a third approach worth knowing about: structure-aware chunking.

The Concept

Instead of chunking based on arbitrary token boundaries or sentence groupings, structure-aware chunking respects the logical organization of documents:

  • Academic papers have sections: Introduction, Methods, Results, Discussion
  • Technical documentation has headers, code blocks, and lists
  • Web pages have HTML structure: headings, paragraphs, articles
  • Markdown files have explicit hierarchy markers

Structure-aware chunking says: "Don't just group words or sentences. Recognize that this is an Introduction section, and this is a Methods section, and keep them separate."

Simple Implementation Example

Here's what structure-aware chunking might look like for markdown documents:

def chunk_by_markdown_sections(text, min_words=100):
    """
    Chunk text by markdown section headers.
    Each section becomes one chunk (or multiple if very long).
    """
    chunks = []
    current_section = []

    for line in text.split('\n'):
        # Detect section headers (lines starting with #)
        if line.startswith('#'):
            # Save previous section if it exists
            if current_section:
                section_text = '\n'.join(current_section)
                if len(section_text.split()) >= min_words:
                    chunks.append(section_text)
            # Start new section
            current_section = [line]
        else:
            current_section.append(line)

    # Don't forget the last section
    if current_section:
        section_text = '\n'.join(current_section)
        if len(section_text.split()) >= min_words:
            chunks.append(section_text)

    return chunks

This is pseudocode-level simplicity, but it illustrates the concept: identify structure markers, use them to define chunk boundaries.

When Structure-Aware Chunking Helps

Structure-aware chunking excels when:

  • Document structure matches query patterns: If users search for "Methods," they probably want the Methods section, not a random 512-token window that happens to include some methods
  • Context boundaries are important: Code with comments, FAQs with Q&A pairs, API documentation with endpoints
  • Sections have distinct topics: A paper discussing both neural networks and patient privacy should keep those sections separate

Why We Didn't Implement It Fully

The evaluation framework we built works for any chunking strategy. You have all the tools needed to implement and test structure-aware chunking yourself:

  1. Write a chunking function that respects document structure
  2. Generate embeddings for your chunks
  3. Store them in ChromaDB
  4. Use our evaluation framework to compare against the strategies we built

The process is identical. The only difference is how you define chunk boundaries.

Hyperparameter Tuning Guidance

We made specific choices for our chunking parameters:

  • Fixed token: 512 tokens with 100-token overlap (20%)
  • Sentence-based: 400-word target with 100-word minimum

Are these optimal? Maybe, maybe not. They're reasonable defaults that worked well for academic papers. But your documents might benefit from different values.

When to Experiment with Different Parameters

Try smaller chunks (256 tokens or 200 words) when:

  • Queries target specific facts rather than broad concepts
  • Precision matters more than context
  • Storage costs aren't a constraint

Try larger chunks (1024 tokens or 600 words) when:

  • Context matters more than precision
  • Your queries are conceptual rather than factual
  • You want to reduce the total number of embeddings

Adjust overlap when:

  • Concepts frequently span chunk boundaries (increase overlap to 30-40%)
  • Storage costs are critical (reduce overlap to 10%)
  • You notice important information getting split

How to Experiment Systematically

The evaluation framework we built makes experimentation straightforward:

  1. Modify chunking parameters
  2. Generate new chunks and embeddings
  3. Store in a new ChromaDB collection
  4. Run your test queries
  5. Compare results

Don't spend hours tuning parameters before you know if chunking helps at all. Start with reasonable defaults (like ours), measure performance, then tune if needed. Most projects never need aggressive parameter tuning.

Practical Exercise

Now it's your turn to experiment. Here are some variations to try:

Option 1: Modify Fixed Token Strategy

Change the chunk size to 256 or 1024 tokens. How does this affect:

  • Total number of chunks?
  • Success rate on test queries?
  • Average similarity distances?
# Try this
chunks_small = chunk_text_fixed_tokens(sample_text, chunk_size=256, overlap=50)
chunks_large = chunk_text_fixed_tokens(sample_text, chunk_size=1024, overlap=200)

Option 2: Modify Sentence-Based Strategy

Adjust the target word count to 200 or 600 words:

# Try this
chunks_small_sent = chunk_text_by_sentences(sample_text, target_words=200)
chunks_large_sent = chunk_text_by_sentences(sample_text, target_words=600)

Option 3: Implement Structure-Aware Chunking

If your papers have clear section markers, try implementing a structure-aware chunker. Use the evaluation framework to compare it against our two strategies.

Reflection Questions

As you experiment, consider:

  • When would you choose fixed token over sentence-based chunking?
  • How would you chunk code documentation? Chat logs? News articles?
  • What chunk size makes sense for a chatbot knowledge base? For legal documents?
  • How does overlap affect retrieval quality in your tests?

Summary and Next Steps

We've built and evaluated two complete chunking strategies for vector databases. Here's what we accomplished:

Core Skills Gained

Implementation:

  • Fixed token window chunking with overlap (914 chunks from 20 papers)
  • Sentence-based chunking respecting linguistic boundaries (513 chunks)
  • Batch processing with rate limit handling
  • ChromaDB collection management for experimentation

Evaluation:

  • Systematic evaluation framework with ground truth queries
  • Measuring success rate and ranking position
  • Comparing strategies quantitatively using real performance data
  • Understanding that query quality matters more than chunking strategy

Key Takeaways

  • No Universal "Best" Chunking Strategy: Both strategies achieved 100% success when given good queries. The choice depends on your constraints (storage, semantic coherence, document structure) rather than one approach being objectively better.
  • Query Quality Matters Most: Bad queries make any chunking strategy look bad. Before evaluating chunking, understand your documents and create queries that match actual content. This lesson applies to all retrieval systems, not just chunking.
  • Sentence-Based Provides Better Distances: In 3 out of 5 test queries, sentence-based chunking had lower distances (higher similarity). Preserving sentence boundaries helps maintain semantic coherence in embeddings.
  • Tradeoffs Are Real: Fixed token creates 1.8x more chunks than sentence-based (914 vs 513). This means more storage and more embeddings to generate (which gets expensive at scale). But you get finer retrieval granularity. Neither is wrong, they optimize for different things. Remember that with overlap, you're paying for every chunk: smaller chunks plus overlap means significantly higher API costs when embedding large document collections.
  • Edge Cases Are Normal: That 16-word chunk from fixed token chunking? The 601-word chunk from sentence-based? These are real-world behaviors, not bugs. Production systems handle imperfect inputs gracefully.

Looking Ahead

We now know how to chunk documents and store them in ChromaDB. But what if we want to enhance our searches? What if we need to filter results by publication year? Search only computer vision papers? Combine semantic similarity with traditional keyword matching?

An upcoming tutorial will teach:

  • Designing metadata schemas for effective filtering
  • Combining vector similarity with metadata constraints
  • Implementing hybrid search (BM25 + vector similarity)
  • Understanding performance tradeoffs of different filtering approaches
  • Making metadata work at scale

Before moving on, make sure you understand:

  • How fixed token and sentence-based chunking differ
  • When to choose each strategy based on project needs
  • How to evaluate chunking systematically with test queries
  • Why query quality matters more than chunking strategy
  • How to handle rate limits and ChromaDB collection management

When you're comfortable with these chunking fundamentals, you're ready to enhance your vector search with metadata and hybrid approaches.


Appendix: Dataset Preparation Code

This appendix provides the complete code we used to prepare the dataset for this tutorial. You don't need to run this code to complete the tutorial, but it's here if you want to:

  • Understand how we selected and downloaded papers from arXiv
  • Extract text from your own PDF files
  • Extend the dataset with different papers or categories

Downloading Papers from arXiv

We selected 20 papers (4 from each category) from the 5,000-paper dataset used in the previous tutorial. Here's how we downloaded the PDFs:

import urllib.request
import pandas as pd
from pathlib import Path
import time

def download_arxiv_pdf(arxiv_id, save_dir):
    """
    Download a paper PDF from arXiv.

    Args:
        arxiv_id: The arXiv ID (e.g., '2510.25798v1')
        save_dir: Directory to save the PDF

    Returns:
        Path to downloaded PDF or None if failed
    """
    # Create save directory if it doesn't exist
    save_dir = Path(save_dir)
    save_dir.mkdir(exist_ok=True)

    # Construct arXiv PDF URL
    # arXiv URLs follow pattern: https://arxiv.org/pdf/{id}.pdf
    pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
    save_path = save_dir / f"{arxiv_id}.pdf"

    try:
        print(f"Downloading {arxiv_id}...")
        urllib.request.urlretrieve(pdf_url, save_path)
        print(f"  ✓ Saved to {save_path}")
        return save_path
    except Exception as e:
        print(f"  ✗ Failed: {e}")
        return None

# Example: Download papers from our metadata file
df = pd.read_csv('arxiv_20papers_metadata.csv')

pdf_dir = Path('arxiv_pdfs')
for arxiv_id in df['arxiv_id']:
    download_arxiv_pdf(arxiv_id, pdf_dir)
    time.sleep(1)  # Be respectful to arXiv servers

Important: The code above respects arXiv's servers by adding a 1-second delay between downloads. For larger downloads, consider using arXiv's bulk data access or API.

Extracting Text from PDFs

Once we had the PDFs, we extracted text using PyPDF2:

import PyPDF2
from pathlib import Path

def extract_paper_text(pdf_path):
    """
    Extract text from a PDF file.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Extracted text as a string
    """
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)

            # Extract text from all pages
            text = ""
            for page in reader.pages:
                text += page.extract_text()

            return text
    except Exception as e:
        print(f"Error extracting {pdf_path}: {e}")
        return None

def extract_all_papers(pdf_dir, output_dir):
    """
    Extract text from all PDFs in a directory.

    Args:
        pdf_dir: Directory containing PDF files
        output_dir: Directory to save extracted text files
    """
    pdf_dir = Path(pdf_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    pdf_files = list(pdf_dir.glob('*.pdf'))
    print(f"Found {len(pdf_files)} PDF files")

    success_count = 0
    for pdf_path in pdf_files:
        print(f"Extracting {pdf_path.name}...")

        text = extract_paper_text(pdf_path)

        if text:
            # Save as text file with same name
            output_path = output_dir / f"{pdf_path.stem}.txt"
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(text)

            word_count = len(text.split())
            print(f"  ✓ Extracted {word_count:,} words")
            success_count += 1
        else:
            print(f"  ✗ Failed to extract")

    print(f"\nSuccessfully extracted {success_count}/{len(pdf_files)} papers")

# Example: Extract all papers
extract_all_papers('arxiv_pdfs', 'arxiv_fulltext_papers')

Paper Selection Process

We selected 20 papers from the 5,000-paper dataset used in the previous tutorial. The selection criteria were:

import pandas as pd
import numpy as np

# Load the original 5k dataset
df_5k = pd.read_csv('arxiv_papers_5k.csv')

# Select 4 papers from each category
categories = ['cs.CL', 'cs.CV', 'cs.DB', 'cs.LG', 'cs.SE']
selected_papers = []

np.random.seed(42)  # For reproducibility

for category in categories:
    # Get papers from this category
    category_papers = df_5k[df_5k['category'] == category]

    # Randomly sample 4 papers
    # In practice, we also checked that abstracts were substantial
    sampled = category_papers.sample(n=4, random_state=42)
    selected_papers.append(sampled)

# Combine all selected papers
df_selected = pd.concat(selected_papers, ignore_index=True)

# Save to new metadata file
df_selected.to_csv('arxiv_20papers_metadata.csv', index=False)
print(f"Selected {len(df_selected)} papers:")
print(df_selected['category'].value_counts().sort_index())

Text Quality Considerations

PDF extraction isn't perfect. Common issues include:

Formatting artifacts:

  • Extra spaces between words
  • Line breaks in unexpected places
  • Mathematical symbols rendered as Unicode
  • Headers/footers appearing in body text

Handling these issues:

def clean_extracted_text(text):
    """
    Basic cleaning for extracted PDF text.
    """
    # Remove excessive whitespace
    text = ' '.join(text.split())

    # Remove common artifacts (customize based on your PDFs)
    text = text.replace('ï¬', 'fi')  # Common ligature issue
    text = text.replace('’', "'")   # Apostrophe encoding issue

    return text

# Apply cleaning when extracting
text = extract_paper_text(pdf_path)
if text:
    text = clean_extracted_text(text)
    # Now save cleaned text

We kept cleaning minimal for this tutorial to show realistic extraction results. In production, you might implement more aggressive cleaning depending on your PDF sources.

Why We Chose These 20 Papers

The 20 papers in this tutorial were selected to provide:

  1. Diversity across topics: 4 papers each from Machine Learning, Computer Vision, Computational Linguistics, Databases, and Software Engineering
  2. Variety in length: Papers range from 2,735 to 20,763 words
  3. Realistic content: Papers published in 2024-2025 with modern topics
  4. Successful extraction: All 20 papers extracted cleanly with readable text

This diversity ensures that chunking strategies are tested across different writing styles, document lengths, and technical domains rather than being optimized for a single type of content.


You now have all the code needed to prepare your own document chunking datasets. The same pattern works for any PDF collection: download, extract, clean, and chunk.

Key Reminders:

  • Both chunking strategies work well (100% success rate) with proper queries
  • Sentence-based requires 44% less storage (513 vs 914 chunks)
  • Sentence-based shows slightly better similarity distances
  • Fixed token provides more consistent sizes and finer granularity
  • Query quality matters more than chunking strategy
  • Rate limiting is normal production behavior, handle it gracefully
  • ChromaDB collection deletion is standard during experimentation
  • Edge cases (tiny chunks, variable sizes) are expected and usually fine
  • Evaluation frameworks transfer to any chunking strategy
  • Choose based on your constraints (storage, coherence, structure) not on "best practice"

Introduction to Vector Databases using ChromaDB

25 November 2025 at 22:35

In the previous embeddings tutorial series, we built a semantic search system that could find relevant research papers based on meaning rather than keywords. We generated embeddings for 500 arXiv papers, implemented similarity calculations using cosine similarity, and created a search function that returned ranked results.

But here's the problem with that approach: our search worked by comparing the query embedding against every single paper in the dataset. For 500 papers, this brute-force approach was manageable. But what happens when we scale to 5,000 papers? Or 50,000? Or 500,000?

Why Brute-Force Won’t Work

Brute-force similarity search scales linearly. If we have 5,000 papers, checking all of them takes a noticeable amount of time. Scale to 50,000 papers and queries become painfully slower. At 500,000 papers, each search would become unusable. That's the reality of brute-force similarity search: query time grows directly with dataset size. This approach simply doesn't scale to production systems.

Vector databases solve this problem. They use specialized data structures called approximate nearest neighbor (ANN) indexes that can find similar vectors in milliseconds, even with millions of documents. Instead of checking every single embedding, they use clever algorithms to quickly narrow down to the most promising candidates.

This tutorial teaches you how to use ChromaDB, a local vector database perfect for learning and prototyping. We'll load 5,000 arXiv papers with their embeddings, build our first vector database collection, and discover exactly when and why vector databases provide real performance advantages over brute-force NumPy calculations.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Set up ChromaDB and create your first collection
  • Insert embeddings efficiently using batch patterns
  • Run vector similarity queries that return ranked results
  • Understand HNSW indexing and how it trades accuracy for speed
  • Filter results using metadata (categories, years, authors)
  • Compare performance between NumPy and ChromaDB at different scales
  • Make informed decisions about when to use a vector database

Most importantly, you'll understand the break-even point. We're not going to tell you "vector databases always win." We're going to show you exactly where they provide value and where simpler approaches work just fine.

Understanding the Dataset

For this tutorial series, we'll work with 5,000 research papers from arXiv spanning five computer science categories:

  • cs.LG (Machine Learning): 1,000 papers about neural networks, training algorithms, and ML theory
  • cs.CV (Computer Vision): 1,000 papers about image processing, object detection, and visual recognition
  • cs.CL (Computational Linguistics): 1,000 papers about NLP, language models, and text processing
  • cs.DB (Databases): 1,000 papers about data storage, query optimization, and database systems
  • cs.SE (Software Engineering): 1,000 papers about development practices, testing, and software architecture

These papers come with pre-generated embeddings from Cohere's API using the same approach from the embeddings series. Each paper is represented as a 1536-dimensional vector that captures its semantic meaning. The balanced distribution across categories will help us see how well vector search and metadata filtering work across different topics.

Setting Up Your Environment

First, create a virtual environment (recommended best practice):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Using a virtual environment keeps your project dependencies isolated and prevents conflicts with other Python projects.

Now install the required packages. This tutorial was developed with Python 3.12.12 and the following versions:

# Developed with: Python 3.12.12
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# scikit-learn==1.6.1
# matplotlib==3.10.0
# cohere==5.20.0
# python-dotenv==1.1.1

pip install chromadb numpy pandas scikit-learn matplotlib cohere python-dotenv

ChromaDB is lightweight and runs entirely on your local machine. No servers to configure, no cloud accounts to set up. This makes it perfect for learning and prototyping before moving to production databases.

You'll also need your Cohere API key from the embeddings series. Make sure you have a .env file in your working directory with:

COHERE_API_KEY=your_key_here

Downloading the Dataset

The dataset consists of two files you'll download and place in your working directory:

arxiv_papers_5k.csv download (7.7 MB)
Contains paper metadata: titles, abstracts, authors, publication dates, and categories

embeddings_cohere_5k.npy download (61.4 MB)
Contains 1536-dimensional embedding vectors for all 5,000 papers

Download both files and place them in the same directory as your Python script or notebook.

Let's verify the files loaded correctly:

import numpy as np
import pandas as pd

# Load the metadata
df = pd.read_csv('arxiv_papers_5k.csv')
print(f"Loaded {len(df)} papers")

# Load the embeddings
embeddings = np.load('embeddings_cohere_5k.npy')
print(f"Loaded embeddings with shape: {embeddings.shape}")
print(f"Each paper is represented by a {embeddings.shape[1]}-dimensional vector")

# Verify they match
assert len(df) == len(embeddings), "Mismatch between papers and embeddings!"

# Check the distribution across categories
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Look at a sample paper
print(f"\nSample paper:")
print(f"Title: {df['title'].iloc[0]}")
print(f"Category: {df['category'].iloc[0]}")
print(f"Abstract: {df['abstract'].iloc[0][:200]}...")
Loaded 5000 papers
Loaded embeddings with shape: (5000, 1536)
Each paper is represented by a 1536-dimensional vector

Papers per category:
category
cs.CL    1000
cs.CV    1000
cs.DB    1000
cs.LG    1000
cs.SE    1000
Name: count, dtype: int64

Sample paper:
Title: Optimizing Mixture of Block Attention
Category: cs.LG
Abstract: Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value...

We now have 5,000 papers with embeddings, perfectly balanced across five categories. Each embedding is 1536 dimensions, and papers and embeddings match exactly.

Your First ChromaDB Collection

A collection in ChromaDB is like a table in a traditional database. It stores embeddings along with associated metadata and provides methods for querying. Let's create our first collection:

import chromadb

# Initialize ChromaDB in-memory client (data only exists while script runs)
client = chromadb.Client()

# Create a collection
collection = client.create_collection(
    name="arxiv_papers",
    metadata={"description": "5000 arXiv papers from computer science"}
)

print(f"Created collection: {collection.name}")
print(f"Collection count: {collection.count()}")
Created collection: arxiv_papers
Collection count: 0

The collection starts empty. Now let's add our embeddings. But here's something critical you need to know: Production systems always batch operations, and for good reasons: memory efficiency, error handling, progress tracking, and the ability to process datasets larger than RAM. ChromaDB reinforces this best practice by enforcing a version-dependent maximum batch size per add() call (approximately 5,461 embeddings in ChromaDB 1.3.4).

Rather than viewing this as a limitation, think of it as ChromaDB nudging you toward production-ready patterns from day one. Let's implement proper batching:

# Prepare the data for ChromaDB
# ChromaDB wants: IDs, embeddings, metadata, and optional documents
ids = [f"paper_{i}" for i in range(len(df))]
metadatas = [
    {
        "title": row['title'],
        "category": row['category'],
        "year": int(str(row['published'])[:4]),  # Store year as integer for filtering
        "authors": row['authors'][:100] if len(row['authors']) <= 100 else row['authors'][:97] + "..."
    }
    for _, row in df.iterrows()
]
documents = df['abstract'].tolist()

# Insert in batches to respect the ~5,461 embedding limit
batch_size = 5000  # Safe batch size well under the limit
print(f"Inserting {len(embeddings)} embeddings in batches of {batch_size}...")

for i in range(0, len(embeddings), batch_size):
    batch_end = min(i + batch_size, len(embeddings))
    print(f"  Batch {i//batch_size + 1}: Adding papers {i} to {batch_end}")

    collection.add(
        ids=ids[i:batch_end],
        embeddings=embeddings[i:batch_end].tolist(),
        metadatas=metadatas[i:batch_end],
        documents=documents[i:batch_end]
    )

print(f"\nCollection now contains {collection.count()} papers")
Inserting 5000 embeddings in batches of 5000...
  Batch 1: Adding papers 0 to 5000

Collection now contains 5000 papers

Since our dataset has exactly 5,000 papers, we can add them all in one batch. But this batching pattern is essential knowledge because:

  1. If we had 8,000 or 10,000 papers, we'd need multiple batches
  2. Production systems always batch operations for efficiency
  3. It's good practice to think in batches from the start

The metadata we're storing (title, category, year, authors) will enable filtered searches later. ChromaDB stores this alongside each embedding, making it instantly available when we query.

Your First Vector Similarity Query

Now comes the exciting part: searching our collection using semantic similarity. But first, we need to address something critical: queries need to use the same embedding model as the documents.

If you mix models—say, querying Cohere embeddings with OpenAI embeddings—you'll either get dimension mismatch errors or, if the dimensions happen to align, results that are... let's call them "creatively unpredictable." The rankings won't reflect actual semantic similarity, making your search effectively random.

Our collection contains Cohere embeddings (1536 dimensions), so we'll use Cohere for queries too. Let's set it up:

from cohere import ClientV2
from dotenv import load_dotenv
import os

# Load your Cohere API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')

if not cohere_api_key:
    raise ValueError(
        "COHERE_API_KEY not found. Make sure you have a .env file with your API key."
    )

co = ClientV2(api_key=cohere_api_key)
print("✓ Cohere API key loaded")

Now let's query for papers about neural network training:

# First, embed the query using Cohere (same model as our documents)
query_text = "neural network training and optimization techniques"

response = co.embed(
    texts=[query_text],
    model='embed-v4.0',
    input_type='search_query',
    embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])

print(f"Query: '{query_text}'")
print(f"Query embedding shape: {query_embedding.shape}")

# Now search the collection
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5
)

# Display the results
print(f"\nTop 5 most similar papers:")
print("=" * 80)

for i in range(len(results['ids'][0])):
    paper_id = results['ids'][0][i]
    distance = results['distances'][0][i]
    metadata = results['metadatas'][0][i]

    print(f"\n{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']} | Year: {metadata['year']}")
    print(f"   Distance: {distance:.4f}")
    print(f"   Abstract: {results['documents'][0][i][:150]}...")
Query: 'neural network training and optimization techniques'
Query embedding shape: (1536,)

Top 5 most similar papers:
================================================================================

1. Training Neural Networks at Any Scale
   Category: cs.LG | Year: 2025
   Distance: 1.1162
   Abstract: This article reviews modern optimization methods for training neural networks with an emphasis on efficiency and scale. We present state-of-the-art op...

2. On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks
   Category: cs.LG | Year: 2025
   Distance: 1.2571
   Abstract: This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the g...

3. A Distributed Training Architecture For Combinatorial Optimization
   Category: cs.LG | Year: 2025
   Distance: 1.3027
   Abstract: In recent years, graph neural networks (GNNs) have been widely applied in tackling combinatorial optimization problems. However, existing methods stil...

4. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
   Category: cs.LG | Year: 2025
   Distance: 1.3254
   Abstract: Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimi...

5. Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks
   Category: cs.CV | Year: 2025
   Distance: 1.3430
   Abstract: Neural networks are widely used for image-related tasks but typically demand considerable computing power. Once a network has been trained, however, i...

Let's talk about what we're seeing here. The results show exactly what we want:

The top 4 papers are all cs.LG (Machine Learning) and directly discuss neural network training, optimization, convergence, and the Adam optimizer. The 5th result is from Computer Vision but discusses neural network compression - still topically relevant.

The distances range from 1.12 to 1.34, which corresponds to cosine similarities of about 0.44 to 0.33. While these aren't the 0.8+ scores you might see in highly specialized single-domain datasets, they represent solid semantic matches for a multi-domain collection.

This is the reality of production vector search: Modern research papers share significant vocabulary overlap across fields. ML terminology appears in computer vision, NLP, databases, and software engineering papers. What we get is a ranking system that consistently surfaces relevant papers at the top, even if absolute similarity scores are moderate.

Why did we manually embed the query? Because our collection contains Cohere embeddings (1536 dimensions), queries must also use Cohere embeddings. If we tried using ChromaDB's default embedding model (all-MiniLM-L6-v2, which produces 384-dimensional vectors), we'd get a dimension mismatch error. Query embeddings and document embeddings must come from the same model. This is a fundamental rule in vector search.

About those distance values: ChromaDB uses squared L2 distance by default. For normalized embeddings (like Cohere's), there's a mathematical relationship: distance ≈ 2(1 - cosine_similarity). So a distance of 1.16 corresponds to a cosine similarity of about 0.42. That might seem low compared to theoretical maximums, but it's typical for real-world multi-domain datasets where vocabulary overlaps significantly.

Understanding What Just Happened

Let's break down what occurred behind the scenes:

1. Query Embedding
We explicitly embedded our query text using Cohere's API (the same model that generated our document embeddings). This is crucial because ChromaDB doesn't know or care what embedding model you used. It just stores vectors and calculates distances. If query embeddings don't match document embeddings (same model, same dimensions), search results will be garbage.

2. HNSW Index
ChromaDB uses an algorithm called HNSW (Hierarchical Navigable Small World) to organize embeddings. Think of HNSW as building a multi-level map of the vector space. Instead of checking all 5,000 papers, it uses this map to quickly navigate to the most promising regions.

3. Approximate Search
HNSW is an approximate nearest neighbor algorithm. It doesn't guarantee finding the absolute closest papers, but it finds very close papers extremely quickly. For most applications, this trade-off between perfect accuracy and blazing speed is worth it.

4. Distance Calculation
ChromaDB returns distances between the query and each result. By default, it uses squared Euclidean distance (L2), where lower values mean higher similarity. This is different from the cosine similarity we used in the embeddings series, but both metrics work well for comparing embeddings.

We'll explore HNSW in more depth later, but for now, the key insight is: ChromaDB doesn't check every single paper. It uses a smart index to jump directly to relevant regions of the vector space.

Why We're Storing Metadata

You might have noticed we're storing title, category, year, and authors as metadata alongside each embedding. While we won't use this metadata in this tutorial, we're setting it up now for future tutorials where we'll explore powerful combinations: filtering by metadata (category, year, author) and hybrid search approaches that combine semantic similarity with keyword matching.

For now, just know that ChromaDB stores this metadata efficiently alongside embeddings, and it becomes available in query results without any performance penalty.

The Performance Question: When Does ChromaDB Actually Help?

Now let's address the big question: when is ChromaDB actually faster than just using NumPy? Let's run a head-to-head comparison at our 5,000-paper scale.

First, let's implement the NumPy brute-force approach (what we built in the embeddings series):

from sklearn.metrics.pairwise import cosine_similarity
import time

def numpy_search(query_embedding, embeddings, top_k=5):
    """Brute-force similarity search using NumPy"""
    # Calculate cosine similarity between query and all papers
    similarities = cosine_similarity(
        query_embedding.reshape(1, -1),
        embeddings
    )[0]

    # Get top k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    return top_indices

# Generate a query embedding (using one of our paper embeddings as a proxy)
query_embedding = embeddings[0]

# Test NumPy approach
start_time = time.time()
for _ in range(100):  # Run 100 queries to get stable timing
    top_indices = numpy_search(query_embedding, embeddings, top_k=5)
numpy_time = (time.time() - start_time) / 100 * 1000  # Convert to milliseconds

print(f"NumPy brute-force search (5000 papers): {numpy_time:.2f} ms per query")
NumPy brute-force search (5000 papers): 110.71 ms per query

Now let's compare with ChromaDB:

# Test ChromaDB approach (query using the embedding directly)
start_time = time.time()
for _ in range(100):
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=5
    )
chromadb_time = (time.time() - start_time) / 100 * 1000

print(f"ChromaDB search (5000 papers): {chromadb_time:.2f} ms per query")
print(f"\nSpeedup: {numpy_time / chromadb_time:.1f}x faster")
ChromaDB search (5000 papers): 2.99 ms per query

Speedup: 37.0x faster

ChromaDB is 37x faster at 5,000 papers. That's the difference between a query taking 111ms versus 3ms. Let's visualize how this scales:

import matplotlib.pyplot as plt

# Scaling data based on actual 5k benchmark
# NumPy scales linearly (110.71ms / 5000 = 0.022142 ms per paper)
# ChromaDB stays flat due to HNSW indexing
dataset_sizes = [500, 1000, 2000, 5000, 8000, 10000]
numpy_times = [11.1, 22.1, 44.3, 110.7, 177.1, 221.4]  # ms (extrapolated from 5k benchmark)
chromadb_times = [3.0, 3.0, 3.0, 3.0, 3.0, 3.0]  # ms (stays constant)

plt.figure(figsize=(10, 6))
plt.plot(dataset_sizes, numpy_times, 'o-', linewidth=2, markersize=8,
         label='NumPy (Brute Force)', color='#E63946')
plt.plot(dataset_sizes, chromadb_times, 's-', linewidth=2, markersize=8,
         label='ChromaDB (HNSW)', color='#2A9D8F')

plt.xlabel('Number of Papers', fontsize=12)
plt.ylabel('Query Time (milliseconds)', fontsize=12)
plt.title('Vector Search Performance: NumPy vs ChromaDB',
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='upper left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate speedup at different scales
print("\nSpeedup at different dataset sizes:")
for size, numpy, chroma in zip(dataset_sizes, numpy_times, chromadb_times):
    speedup = numpy / chroma
    print(f"  {size:5d} papers: {speedup:5.1f}x faster")

Vector Search Performance - Numpy vs ChromaDB

Speedup at different dataset sizes:
    500 papers:   3.7x faster
   1000 papers:   7.4x faster
   2000 papers:  14.8x faster
   5000 papers:  36.9x faster
   8000 papers:  59.0x faster
  10000 papers:  73.8x faster

Note: These benchmarks were measured on a standard development machine with Python 3.12.12. Your actual query times will vary based on hardware, but the relative performance characteristics (flat scaling for ChromaDB vs linear for NumPy) will remain consistent.

This chart tells a clear story:

NumPy's time grows linearly with dataset size. Double the papers, double the query time. That's because brute-force search checks every single embedding.

ChromaDB's time stays flat regardless of dataset size. Whether we have 500 papers or 10,000 papers, queries take about 3ms in our benchmarks. These timings are illustrative (extrapolated from our 5k test on a standard development machine) and will vary based on your hardware and index configuration—but the core insight holds: ChromaDB query time stays relatively flat as your dataset grows, unlike NumPy's linear scaling.

The break-even point is around 1,000-2,000 papers. Below that, the overhead of maintaining an index might not be worth it. Above that, ChromaDB provides clear advantages that grow with scale.

Understanding HNSW: The Magic Behind Fast Queries

We've seen that ChromaDB is dramatically faster than brute-force search, but how does HNSW make this possible? Let's build intuition without diving into complex math.

The Basic Idea: Navigable Small Worlds

Imagine you're in a massive library looking for books similar to one you're holding. A brute-force approach would be to check every single book on every shelf. HNSW is like having a smart navigation system:

Layer 0 (Ground Level): Contains all embeddings, densely connected to nearby neighbors

Layer 1: Contains a subset of embeddings with longer-range connections

Layer 2: Even fewer embeddings with even longer connections

Layer 3: The top layer with just a few embeddings spanning the entire space

When we query, HNSW starts at the top layer (with very few points) and quickly narrows down to promising regions. Then it drops to the next layer and refines. By the time it reaches the ground layer, it's already in the right neighborhood and only needs to check a small fraction of the total embeddings.

The Trade-off: Accuracy vs Speed

HNSW is an approximate algorithm. It doesn't guarantee finding the absolute closest papers, but it finds very close papers very quickly. This trade-off is controlled by parameters:

  • ef_construction: How carefully the index is built (higher = better quality, slower build)
  • ef_search: How thoroughly queries search (higher = better recall, slower queries)
  • M: Number of connections per point (higher = better search, more memory)

ChromaDB uses sensible defaults that work well for most applications. Let's verify the quality of approximate search:

# Compare ChromaDB results to exact NumPy results
query_embedding = embeddings[100]

# Get top 10 from NumPy (exact)
numpy_results = numpy_search(query_embedding, embeddings, top_k=10)

# Get top 10 from ChromaDB (approximate)
chromadb_results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=10
)

# Extract paper indices from ChromaDB results (convert "paper_123" to 123)
chromadb_indices = [int(id.split('_')[1]) for id in chromadb_results['ids'][0]]

# Calculate overlap
overlap = len(set(numpy_results) & set(chromadb_indices))

print(f"NumPy top 10 (exact): {numpy_results}")
print(f"ChromaDB top 10 (approximate): {chromadb_indices}")
print(f"\nOverlap: {overlap}/10 papers match")
print(f"Recall@10: {overlap/10*100:.1f}%")
NumPy top 10 (exact): [ 100  984  509 2261 3044  701 1055  830 3410 1311]
ChromaDB top 10 (approximate): [100, 984, 509, 2261, 3044, 701, 1055, 830, 3410, 1311]

Overlap: 10/10 papers match
Recall@10: 100.0%

With default settings, ChromaDB achieves 100% recall on this query, meaning it found exactly the same top 10 papers as the exact brute-force search. This high accuracy is typical for the dataset sizes we're working with. The approximate nature of HNSW becomes more noticeable at massive scales (millions of vectors), but even then, the quality is excellent for most applications.

Memory Usage and Resource Requirements

ChromaDB keeps its HNSW index in memory for fast access. Let's measure how much RAM our 5,000-paper collection uses:

# Estimate memory usage
embedding_memory = embeddings.nbytes / (1024 ** 2)  # Convert to MB

print(f"Memory usage estimates:")
print(f"  Raw embeddings: {embedding_memory:.1f} MB")
print(f"  HNSW index overhead: ~{embedding_memory * 0.5:.1f} MB (estimated)")
print(f"  Total (approximate): ~{embedding_memory * 1.5:.1f} MB")
Memory usage estimates:
  Raw embeddings: 58.6 MB
  HNSW index overhead: ~29.3 MB (estimated)
  Total (approximate): ~87.9 MB

For 5,000 papers with 1536-dimensional embeddings, we're looking at roughly 90-100MB of RAM. This scales linearly: 10,000 papers would be about 180-200MB, 50,000 papers about 900MB-1GB.

This is completely manageable for modern computers. Even a basic laptop can easily handle collections with tens of thousands of documents. The memory requirements only become a concern at massive scales (hundreds of thousands or millions of vectors), which is when you'd move to production vector databases designed for distributed deployment.

Important ChromaDB Behaviors to Know

Before we move on, let's cover some important behaviors that will save you debugging time:

1. In-Memory vs Persistent Storage

Our code uses chromadb.Client(), which creates an in-memory client. The collection only exists while the Python script runs. When the script ends, the data disappears.

For persistent storage, use:

# Persistent storage (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")

This saves the collection to a local directory. Next time you run the script, the data will still be there.

2. Collection Deletion and Index Growth

ChromaDB's HNSW index grows but never shrinks. If we add 5,000 documents then delete 4,000, the index still uses memory for 5,000. The only way to reclaim this space is to create a new collection and re-add the documents we want to keep.

This is a known limitation with HNSW indexes. It's not a bug, it's a fundamental trade-off for the algorithm's speed. Keep this in mind when designing systems that frequently add and remove documents.

3. Batch Size Limits

Remember the ~5,461 embedding limit per add() call? This isn't ChromaDB being difficult; it's protecting you from overwhelming the system. Always batch your insertions in production systems.

4. Default Embedding Function

When you call collection.query(query_texts=["some text"]), ChromaDB automatically embeds your query using its default model (all-MiniLM-L6-v2). This is convenient but might not match the embeddings you added to the collection.

For production systems, you typically want to:

  • Use the same embedding model for queries and documents
  • Either embed queries yourself and use query_embeddings, or configure ChromaDB's embedding function to match your model

Comparing Results: Query Understanding

Let's run a few different queries to see how well vector search understands intent:

queries = [
    "machine learning model evaluation metrics",
    "how do convolutional neural networks work",
    "SQL query optimization techniques",
    "testing and debugging software systems"
]

for query in queries:
    # Embed the query
    response = co.embed(
        texts=[query],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Search
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=3
    )

    print(f"\nQuery: '{query}'")
    print("-" * 80)

    categories = [meta['category'] for meta in results['metadatas'][0]]
    titles = [meta['title'] for meta in results['metadatas'][0]]

    for i, (cat, title) in enumerate(zip(categories, titles)):
        print(f"{i+1}. [{cat}] {title[:60]}...")
Query: 'machine learning model evaluation metrics'
--------------------------------------------------------------------------------
1. [cs.CL] Factual and Musical Evaluation Metrics for Music Language Mo...
2. [cs.DB] GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2Ge...
3. [cs.SE] GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2Ge...

Query: 'how do convolutional neural networks work'
--------------------------------------------------------------------------------
1. [cs.LG] Covariance Scattering Transforms...
2. [cs.CV] Elements of Active Continuous Learning and Uncertainty Self-...
3. [cs.CV] Convolutional Fully-Connected Capsule Network (CFC-CapsNet):...

Query: 'SQL query optimization techniques'
--------------------------------------------------------------------------------
1. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
2. [cs.DB] Including Bloom Filters in Bottom-up Optimization...
3. [cs.DB] Query Optimization in the Wild: Realities and Trends...

Query: 'testing and debugging software systems'
--------------------------------------------------------------------------------
1. [cs.SE] Enhancing Software Testing Education: Understanding Where St...
2. [cs.SE] Design and Implementation of Data Acquisition and Analysis S...
3. [cs.SE] Identifying Video Game Debugging Bottlenecks: An Industry Pe...

Notice how the search correctly identifies the topic for each query:

  • ML evaluation → Machine Learning and evaluation-related papers
  • CNNs → Computer Vision papers with one ML paper
  • SQL optimization → Database papers
  • Testing → Software Engineering papers

The system understands semantic meaning. Even when queries use natural language phrasing like "how do X work," it finds topically relevant papers. The rankings are what matter - relevant papers consistently appear at the top, even if absolute similarity scores are moderate.

When ChromaDB Is Enough vs When You Need More

We now have a working vector database running on our laptop. But when is ChromaDB sufficient, and when do you need a production database like Pinecone, Qdrant, or Weaviate?

ChromaDB is perfect for:

  • Learning and prototyping: Get immediate feedback without infrastructure setup
  • Local development: No internet required, no API costs
  • Small to medium datasets: Up to 100,000 documents on a standard laptop
  • Single-machine applications: Desktop tools, local RAG systems, personal assistants
  • Rapid experimentation: Test different embedding models or chunking strategies

Move to production databases when you need:

  • Massive scale: Millions of vectors or high query volume (thousands of QPS)
  • Distributed deployment: Multiple machines, load balancing, high availability
  • Advanced features: Hybrid search, multi-tenancy, access control, backup/restore
  • Production SLAs: Guaranteed uptime, support, monitoring
  • Team collaboration: Multiple developers working with shared data

We'll explore production databases in a later tutorial. For now, ChromaDB gives us everything we need to learn the core concepts and build impressive projects.

Practical Exercise: Exploring Your Own Queries

Before we wrap up, try experimenting with different queries:

# Helper function to make querying easier
def search_papers(query_text, n_results=5):
    """Search papers using semantic similarity"""
    # Embed the query
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Search
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results
    )

    return results

# Your turn: try these queries and examine the results

# 1. Find papers about a specific topic
results = search_papers("reinforcement learning and robotics")

# 2. Try a different domain
results_cv = search_papers("image segmentation techniques")

# 3. Test with a broad query
results_broad = search_papers("deep learning applications")

# Examine the results for each query
# What patterns do you notice?
# Do the results make sense for each query?

Some things to explore:

  • Query phrasing: Does "neural networks" return different results than "deep learning" or "artificial neural networks"?
  • Specificity: How do very specific queries ("BERT model fine-tuning") compare to broad queries ("natural language processing")?
  • Cross-category topics: What happens when you search for topics that span multiple categories, like "machine learning for databases"?
  • Result quality: Look at the categories and distances - do the most similar papers make sense for each query?

This hands-on exploration will deepen your intuition about how vector search works and what to expect in real applications.

What You've Learned

We've built a complete vector database from scratch and understand the fundamentals:

Core Concepts:

  • Vector databases use ANN indexes (like HNSW) to search large collections efficiently
  • ChromaDB provides a simple, local database perfect for learning and prototyping
  • Collections store embeddings, metadata, and documents together
  • Batch insertion is required due to size limits (around 5,461 embeddings per call)

Performance Characteristics:

  • ChromaDB achieves 37x speedup over NumPy at 5,000 papers
  • Query time stays constant regardless of dataset size (around 3ms)
  • Break-even point is around 1,000-2,000 papers
  • Memory usage is manageable (about 90MB for 5,000 papers)

Practical Skills:

  • Loading pre-generated embeddings and metadata
  • Creating and querying ChromaDB collections
  • Running pure vector similarity searches
  • Comparing approximate vs exact search quality
  • Understanding when to use ChromaDB vs production databases

Critical Insights:

  • HNSW trades perfect accuracy for massive speed gains
  • Default settings achieve excellent recall for typical workloads
  • In-memory storage makes ChromaDB fast but limits persistence
  • Batching is not optional, it's a required pattern
  • Modern multi-domain datasets show moderate similarity scores due to vocabulary overlap
  • Query embeddings and document embeddings must use the same model

What's Next

We now have a vector database running locally with 5,000 papers. Next, we'll tackle a critical challenge: document chunking strategies.

Right now, we're searching entire paper abstracts as single units. But what if we want to search through full papers, documentation, or long articles? We need to break them into chunks, and how we chunk dramatically affects search quality.

The next tutorial will teach you:

  • Why chunking matters even with long-context LLMs in 2025
  • Different chunking strategies (sentence-based, token windows, structure-aware)
  • How to evaluate chunking quality using Recall@k
  • The trade-offs between chunk size, overlap, and search performance
  • Practical implementations you can use in production

Before moving on, make sure you understand these core concepts:

  • How vector similarity search works
  • What HNSW indexing does and why it's fast
  • When ChromaDB provides real advantages over brute-force search
  • How query and document embeddings must match

When you're comfortable with vector search basics, you’re ready to see how to handle real documents that are too long to embed as single units.


Key Takeaways:

  • Vector databases use approximate nearest neighbor algorithms (like HNSW) to search large collections in constant time
  • ChromaDB provides 37x speedup over NumPy brute-force at 5,000 papers, with query times staying flat as datasets grow
  • Batch insertion is mandatory due to embedding limit per add() call
  • HNSW creates a hierarchical navigation structure that checks only a fraction of embeddings while maintaining high accuracy
  • Default HNSW settings achieve excellent recall for typical datasets
  • Memory usage scales linearly (about 90MB for 5,000 papers with 1536-dimensional embeddings)
  • ChromaDB excels for learning, prototyping, and datasets up to ~100,000 documents on standard hardware
  • The break-even point for vector databases vs brute-force is around 1,000-2,000 documents
  • HNSW indexes grow but never shrink, requiring collection re-creation to reclaim space
  • In-memory storage provides speed but requires persistent client for data that survives script restarts
  • Modern multi-domain datasets show moderate similarity scores (0.3-0.5 cosine) due to vocabulary overlap across fields
  • Query embeddings and document embeddings must use the same model and dimensionality

Automating Amazon Book Data Pipelines with Apache Airflow and MySQL

25 November 2025 at 21:51

In our previous tutorial, we simulated market data pipelines to explore the full ETL lifecycle in Apache Airflow, from extracting and transforming data to loading it into a local database. Along the way, we integrated Git-based DAG management, automated validation through GitHub Actions, and synchronized our Airflow environment using git-sync, creating a workflow that closely mirrored real production setups.

Now, we’re taking things a step further by moving from simulated data to a real-world use case.

Imagine you’ve been given the task of finding the best engineering books on Amazon, extracting their titles, authors, prices, and ratings, and organizing all that information into a clean, structured table for analysis. Since Amazon’s listings change frequently, we need an automated workflow to fetch the latest data on a regular schedule. By orchestrating this extraction with Airflow, our pipeline can run every 24 hours, ensuring the dataset always reflects the most recent updates.

In this tutorial, you’ll take your Airflow skills beyond simulation and build a real-world ETL pipeline that extracts engineering book data from Amazon, transforms it with Python, and loads it into MySQL for structured analysis. You’ll orchestrate the process using Airflow’s TaskFlow API and custom operators for clean, modular design, while integrating GitHub Actions for version-controlled CI/CD deployment. To complete the setup, you’ll implement logging and monitoring so every stage, from extraction to loading, is tracked with full visibility and accuracy.

By the end, you’ll have a production-like pattern, an Airflow workflow that not only automates the extraction of Amazon book data but also demonstrates best practices in reliability, maintainability, and DevOps-driven orchestration.

Setting Up the Environment and Designing the ETL Pipeline

Setting Up the Environment

We have seen in our previous tutorial how running Airflow inside Docker provides a clean, portable, and reproducible setup for development. For this project, we’ll follow the same approach.

We’ve prepared a GitHub repository to help you get your environment up and running quickly. It includes the starter files you’ll need for this tutorial.

Begin by cloning the repository:

git clone [email protected]:dataquestio/tutorials.git

Then navigate to the Airflow tutorial directory:

cd airflow-docker-tutorial

Inside, you’ll find a structure similar to this:

airflow-docker-tutorial/
├── part-one/
├── part-two/
├── amazon-etl/
├── docker-compose.yaml
└── README.md

The part-one/ and part-two/ folders contain the complete reference files for our previous tutorials, while the amazon-etl/ folder is the workspace for this project, it contains all the DAGs, helper scripts, and configuration files we’ll build in this lesson. You don’t need to modify anything in the reference folders; they’re just there for review and comparison.

Your starting point is the docker-compose.yaml file, which defines the Airflow services. We’ve already seen how this file manages the Airflow api-server, scheduler, and supporting components.

Next, Airflow expects certain directories to exist before launching. Create them inside the same directory as your docker-compose.yaml file:

mkdir -p ./dags ./logs ./plugins ./config

Now, add a .env file in the same directory with the following line:

AIRFLOW_UID=50000

This ensures consistent file ownership between your host system and Docker containers.

For Linux users, you can generate this automatically with:

echo -e "AIRFLOW_UID=$(id -u)" > .env

Finally, initialize your Airflow metadata database:

docker compose up airflow-init

Make sure your Docker Desktop is already running before executing the command. Once initialization completes, bring up your Airflow environment:

docker compose up -d

If everything is set up correctly, you’ll see all Airflow containers running, including the webserver — which exposes the Airflow UI at http://localhost:8080. Open it in your browser and confirm that your environment is running smoothly by logging in using airflow as the username and airflow as password.

Designing the ETL Pipeline

Now that your environment is ready, it’s time to plan the structure of your ETL workflow before writing any code. Good data engineering practice begins with design, not implementation.

The first step is understanding the data flow, specifically, your source and destination.

In our case:

  • The source is Amazon’s public listings for data engineering books.
  • The destination is a MySQL database where we’ll store the extracted and transformed data for easy access and analysis.

The workflow will consist of three main stages:

  1. Extract – Scrape book information (titles, authors, prices, ratings) from Amazon pages.
  2. Transform – Clean and format the raw text into structured, numeric fields using Python and pandas.
  3. Load – Insert the processed data into a MySQL table for further use.

At a high level, our data pipeline will look like this:

Amazon Website (Source)
       ↓
   Extract Task
       ↓
   Transform Task
       ↓
   Load Task
       ↓
   MySQL Database (Destination)

To prepare data for loading into MySQL, we’ll need to convert scraped HTML into a tabular structure. The transformation step will include tasks like normalizing currency symbols, parsing ratings into numeric values, and ensuring all records are unique before loading.

When mapped into Airflow, these steps form a Directed Acyclic Graph (DAG) — a visual and logical representation of our workflow. Each box in the DAG represents a task (extract, transform, or load), and the arrows define their dependencies and execution order.

Here’s a conceptual view of the workflow:

[extract_amazon_books] → [transform_amazon_books] → [load_amazon_books]

Finally, we enhance our setup by adding git-sync for automatic DAG updates and GitHub Actions for CI validation, ensuring every change in GitHub reflects instantly in Airflow while your workflows are continuously checked for issues. By combining git-sync, CI checks, and Airflow’s built-in alerting (email or Slack), the entire Amazon ETL pipeline becomes stable, monitored, and much closer to a fully production-like patterns, orchestration system.

Building an ETL Pipeline with Airflow

Setting Up Your DAG File

Let’s start by creating the foundation of our workflow.

Before making changes, make sure to shut down any running containers to avoid conflicts:

docker compose down

Ensure that you disable the Example DAGs and switch to LocalExecutor, as we did in our previous tutorials

Now, open your airflow-docker-tutorial project folder and, inside the dags/ directory, create a new file named:

amazon_etl_dag.py

Every .py file inside this directory becomes a workflow that Airflow automatically detects and manages, no manual registration required. Airflow continuously scans the folder for new DAGs and dynamically loads them.

At the top of your file, import the core libraries needed for this project:

from airflow.decorators import dag, task
from datetime import datetime, timedelta
import pandas as pd
import random
import os
import time
import requests
from bs4 import BeautifulSoup

Let’s quickly review what each import does:

  • dag and task come from Airflow’s TaskFlow API, allowing us to define Python functions that become managed tasks, Airflow handles execution, dependencies, and retries automatically.
  • datetime and timedelta handle scheduling logic, such as when the DAG should start and how often it should run.
  • pandas, random, BeautifulSoup , requests , and os are standard Python libraries we’ll use to process and manage our data within the ETL steps.

This minimal setup is all you need to start orchestrating real-world data workflows.

Defining the DAG Structure

With the imports ready, let’s define the core of our workflow, the DAG configuration.

This determines when, how often, and under what conditions your pipeline runs.

default_args = {
    "owner": "Data Engineering Team",
    "retries": 3,
    "retry_delay": timedelta(minutes=2),
}

@dag(
    dag_id="amazon_books_etl_pipeline",
    description="Automated ETL pipeline to fetch and load Amazon Data Engineering book data into MySQL",
    schedule="@daily",
    start_date=datetime(2025, 11, 13),
    catchup=False,
    default_args=default_args,
    tags=["amazon", "etl", "airflow"],
)
def amazon_books_etl():
    ...

dag = amazon_books_etl()

Let’s break down what’s happening here:

  • default_args define reusable settings for all tasks, in this case, Airflow will automatically retry any failed task up to three times, with a two-minute delay between attempts. This is especially useful since our workflow depends on live web requests to Amazon, which can occasionally fail due to rate limits or connectivity issues.
  • The @dag decorator marks this function as an Airflow DAG. Everything inside amazon_books_etl() will form part of one cohesive workflow.
  • schedule="@daily" ensures our DAG runs once every 24 hours, keeping the dataset fresh.
  • start_date defines when Airflow starts scheduling runs.
  • catchup=False prevents Airflow from trying to backfill missed runs.
  • tags categorize the DAG in the Airflow UI for easier filtering.

Finally, the line:

dag = amazon_books_etl()

instantiates the workflow, making it visible and schedulable within Airflow.

Data Extraction with Airflow

With our DAG structure in place, the first step in our ETL pipeline is data extraction — pulling book data directly from Amazon’s live listings.

  • Note that this is for educational purposes: Amazon frequently updates its page structure and uses anti-bot protections, which can break scrapers without warning. In a real production project, we’d rely on official APIs or other stable data sources instead, since they provide consistent data across runs and keep long-term maintenance low.

When you search for something like “data engineering books” on Amazon, the search results page displays listings inside structured HTML containers such as:

<div data-component-type="s-impression-counter">

Each of these containers holds nested elements for the book title, author, price, and rating—information we can reliably parse using BeautifulSoup.

For example, when inspecting any of the listed books, we see a consistent HTML structure that guides how our scraper should behave:

Data Extraction with Airflow (Amazon Book Data Project)

Because Amazon paginates its results, our extraction logic systematically iterates through the first 10 pages, returning approximately 30 to 50 books. We intentionally limit the extraction to this number to keep the workload manageable while still capturing the most relevant items.

This approach ensures we gather the most visible and actively featured books—those trending or recently updated—rather than scraping random or deeply buried results. By looping through these pages, we create a dataset that is both fresh and representative, striking the right balance between completeness and efficiency.

Even though Amazon updates its listings frequently, our Airflow DAG runs every 24 hours, ensuring the extracted data always reflects the latest marketplace activity.

Here’s the Python logic behind the extraction step:

@task
def get_amazon_data_books(num_books=50, max_pages=10, ti=None):
    """
    Extracts Amazon Data Engineering book details such as Title, Author, Price, and Rating. Saves the raw extracted data locally and pushes it to XCom for downstream tasks.
    """
    headers = {
        "Referer": 'https://www.amazon.com/',
        "Sec-Ch-Ua": "Not_A Brand",
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": "macOS",
        'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    base_url = "https://www.amazon.com/s?k=data+engineering+books"
    books, seen_titles = [], set()
    page = 1  # start with page 1

    while page <= max_pages and len(books) < num_books:
        url = f"{base_url}&page={page}"

        try:
            response = requests.get(url, headers=headers, timeout=15)
        except requests.RequestException as e:
            print(f" Request failed: {e}")
            break

        if response.status_code != 200:
            print(f"Failed to retrieve page {page} (status {response.status_code})")
            break

        soup = BeautifulSoup(response.text, "html.parser")
        book_containers = soup.find_all("div", {"data-component-type": "s-impression-counter"})

        for book in book_containers:
            title_tag = book.select_one("h2 span")
            author_tag = book.select_one("a.a-size-base.a-link-normal")
            price_tag = book.select_one("span.a-price > span.a-offscreen")
            rating_tag = book.select_one("span.a-icon-alt")

            if title_tag and price_tag:
                title = title_tag.text.strip()
                if title not in seen_titles:
                    seen_titles.add(title)
                    books.append({
                        "Title": title,
                        "Author": author_tag.text.strip() if author_tag else "N/A",
                        "Price": price_tag.text.strip(),
                        "Rating": rating_tag.text.strip() if rating_tag else "N/A"
                    })
        if len(books) >= num_books:
            break

        page += 1
        time.sleep(random.uniform(1.5, 3.0))

    # Convert to DataFrame
    df = pd.DataFrame(books)
    df.drop_duplicates(subset="Title", inplace=True)

    # Create directory for raw data.
        # Note: This works here because everything runs in one container.
        # In real deployments, you'd use shared storage (e.g., S3/GCS) instead.
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    raw_path = "/opt/airflow/tmp/amazon_books_raw.csv"

    # Save the extracted dataset
    df.to_csv(raw_path, index=False)
    print(f"[EXTRACT] Amazon book data successfully saved at {raw_path}")

    # Push DataFrame path to XCom
    import json

    summary = {
        "rows": len(df),
        "columns": list(df.columns),
        "sample": df.head(3).to_dict('records'),
    }

    # Clean up non-breaking spaces and format neatly
    formatted_summary = json.dumps(summary, indent=2, ensure_ascii=False).replace('\xa0', ' ')

    if ti:
        ti.xcom_push(key='df_summary', value= formatted_summary)
        print("[XCOM] Pushed JSON summary to XCom.")

    # Optional preview
    print("\nPreview of Extracted Data:")
    print(df.head(5).to_string(index=False))

    return raw_path

This approach makes your pipeline deterministic and meaningful; it doesn’t rely on arbitrary randomness but on a fixed, observable window of recent and visible listings.

You can run this one task and view the logs.

Data Extraction with Airflow (Amazon Book Data Project) (2)

By calling this function, within our def amazon_books_etl() function, we are actually creating a task, and Airflow will consider this as one task:

def amazon_books_etl():
    ---
    # Task dependencies 
    raw_file = get_amazon_data_books()

dag = amazon_books_etl()

You should also notice that we are passing a few pieces of information related to our extracted data to XCOM. These include the total length of our dataframe, the total number of columns, and also the first three rows. This will help us understand the transformation(including cleaning) we need for our data.

Data Transformation with Airflow

Once our raw Amazon book data is extracted and stored, the next step is data transformation — converting the messy, unstructured output into a clean, analysis-ready format.

If we inspect the sample data passed through XCom, it looks something like this:

{
  "rows": 42,
  "sample": [
    {
      "Title": "Data Engineering Foundations: Core Techniques for Data Analysis with Pandas, NumPy, and Scikit-Learn (Advanced Data Analysis Series Book 1)",
      "Author": "Kindle Edition",
      "Price": "$44.90",
      "Rating": "4.2 out of 5 stars"
    }
  ],
  "columns": ["Title", "Author", "Price", "Rating"]
}

We can already notice a few data quality issues:

  • The Price column includes the dollar sign ($) — we’ll remove it and convert prices to numeric values.
  • The Rating column contains text like "4.2 out of 5 stars" — we’ll extract just the numeric part (4.2).
  • The Price column name isn’t very clear — we’ll rename it to Price($) for consistency.

Here’s the updated transformation task:

@task
def transform_amazon_books(raw_file: str):
    """
    Standardizes the extracted Amazon book dataset for analysis.
    - Converts price strings (e.g., '$45.99') into numeric values
    - Extracts numeric ratings (e.g., '4.2' from '4.2 out of 5 stars')
    - Renames 'Price' to 'Price($)'
    - Handles missing or unexpected field formats safely
    - Performs light validation after numeric conversion
    """
    if not os.path.exists(raw_file):
        raise FileNotFoundError(f" Raw file not found: {raw_file}")

    df = pd.read_csv(raw_file)
    print(f"[TRANSFORM] Loaded {len(df)} records from raw dataset.")

    # --- Price cleaning (defensive) ---
    if "Price" in df.columns:
        df["Price($)"] = (
            df["Price"]
            .astype(str)                                   # prevents .str on NaN
            .str.replace("$", "", regex=False)
            .str.replace(",", "", regex=False)
            .str.extract(r"(\d+\.?\d*)")[0]                # safely extract numbers
        )
        df["Price($)"] = pd.to_numeric(df["Price($)"], errors="coerce")
    else:
        print("[TRANSFORM] Missing 'Price' column — filling with None.")
        df["Price($)"] = None

    # --- Rating cleaning (defensive) ---
    if "Rating" in df.columns:
        df["Rating"] = (
            df["Rating"]
            .astype(str)
            .str.extract(r"(\d+\.?\d*)")[0]
        )
        df["Rating"] = pd.to_numeric(df["Rating"], errors="coerce")
    else:
        print("[TRANSFORM] Missing 'Rating' column — filling with None.")
        df["Rating"] = None

    # --- Validation: drop rows where BOTH fields failed (optional) ---
    df.dropna(subset=["Price($)", "Rating"], how="all", inplace=True)

    # --- Drop original Price column (if present) ---
    if "Price" in df.columns:
        df.drop(columns=["Price"], inplace=True)

    # --- Save cleaned dataset ---
    transformed_path = raw_file.replace("raw", "transformed")
    df.to_csv(transformed_path, index=False)

    print(f"[TRANSFORM] Cleaned data saved at {transformed_path}")
    print(f"[TRANSFORM] {len(df)} valid records after standardization.")
    print(f"[TRANSFORM] Sample cleaned data:\n{df.head(5).to_string(index=False)}")

    return transformed_path

This transformation ensures that by the time our data reaches the loading stage (MySQL), it’s tidy, consistent, and ready for querying, for instance, to quickly find the highest-rated or most affordable data engineering books.

Data Transformation with Airflow (Amazon Book Data Project)

  • Note: Although we’re working with real Amazon data, this transformation logic is intentionally simplified for the purposes of the tutorial. Amazon’s page structure can change, and real-world pipelines typically include more robust safeguards, such as retries, stronger validation rules, fallback parsing strategies, and alerting, so that temporary layout changes or missing fields don’t break the entire workflow. The defensive checks added here help keep the DAG stable, but a production deployment would apply additional hardening to handle a broader range of real-world variations

Data Loading with Airflow

The final step in our ETL pipeline is data loading, moving our transformed dataset into a structured database where it can be queried, analyzed, and visualized.

At this stage, we’ve already extracted live book listings from Amazon and transformed them into clean, numeric-friendly records. Now we’ll store this data in a MySQL database, ensuring that every 24 hours our dataset refreshes with the latest available titles, authors, prices, and ratings.

We’ll use a local MySQL instance for simplicity, but the same logic applies to cloud-hosted databases like Amazon RDS, Google Cloud SQL, or Azure MySQL.

Before proceeding, make sure MySQL is installed and running locally, with a database and user configured as:

CREATE DATABASE airflow_db;
CREATE USER 'airflow'@'%' IDENTIFIED BY 'airflow';
GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow'@'%';
FLUSH PRIVILEGES;

When running Airflow in Docker and MySQL locally on Linux, Docker containers can’t automatically access localhost.

To fix this, you need to make your local machine reachable from inside Docker.

Open your docker-compose.yaml file and add the following line under the x-airflow-common service definition:

extra_hosts:
  - "host.docker.internal:host-gateway"

Once configured, we can define our load task:

@task
def load_to_mysql(transformed_file: str):
    """
    Loads the transformed Amazon book dataset into a MySQL table for analysis.
    Uses a truncate-and-load pattern to keep the table idempotent.
    """
    import mysql.connector
    import os
    import numpy as np

    # Note:
    # For production-ready projects, database credentials should never be hard-coded.
    # Airflow provides a built-in Connection system and can also integrate with
    # secret backends (AWS Secrets Manager, Vault, etc.).
    #
    # Example:
    #     hook = MySqlHook(mysql_conn_id="my_mysql_conn")
    #     conn = hook.get_conn()
    #
    # For this demo, we keep a simple local config:
    db_config = {
        "host": "host.docker.internal",
        "user": "airflow",
        "password": "airflow",
        "database": "airflow_db",
        "port": 3306
    }

    df = pd.read_csv(transformed_file)
    table_name = "amazon_books_data"

    # Replace NaN with None (important for MySQL compatibility)
    df = df.replace({np.nan: None})

    conn = mysql.connector.connect(**db_config)
    cursor = conn.cursor()

    # Create table if it does not exist
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            Title VARCHAR(512),
            Author VARCHAR(255),
            `Price($)` DECIMAL(10,2),
            Rating DECIMAL(4,2)
        );
    """)

    # Truncate table for idempotency
    cursor.execute(f"TRUNCATE TABLE {table_name};")

    # Insert rows
    insert_query = f"""
        INSERT INTO {table_name} (Title, Author, `Price($)`, Rating)
        VALUES (%s, %s, %s, %s)
    """

    for _, row in df.iterrows():
        try:
            cursor.execute(
                insert_query,
                (row["Title"], row["Author"], row["Price($)"], row["Rating"])
            )
        except Exception as e:
            # For demo purposes we simply skip bad rows.
            # In real pipelines, you'd log or send them to a dead-letter table.
            print(f"[LOAD] Skipped corrupted row due to error: {e}")

    conn.commit()
    conn.close()

    print(f"[LOAD] Table '{table_name}' refreshed with {len(df)} rows.")

This task reads the cleaned dataset and inserts it into a table named amazon_books_data inside your airflow_db database.

Data Loading with Airflow (Amazon Book Data Project) (3)

You will also notice that, in the code above that we use a TRUNCATE statement before inserting new rows. This turns the loading step into a full-refresh pattern, making the task idempotent. In other words, running the DAG multiple times produces the same final table instead of accumulating duplicate snapshots from previous days. This is ideal for scraped datasets like Amazon listings, where we want each day’s table to represent only the latest available snapshot.

At this stage, your workflow is fully defined and properly instantiated in Airflow, with each task connected in the correct order to form a complete ETL pipeline. Your DAG structure should now look like this:

def amazon_books_etl():
    # Task dependencies
    raw_file = get_amazon_data_books()
    transformed_file = transform_amazon_books(raw_file)
    load_to_mysql(transformed_file)

dag = amazon_books_etl()

Data Loading with Airflow (Amazon Book Data Project)

After your DAG runs (use docker compose up -d), you can verify the results inside MySQL:

USE airflow_db;
SHOW TABLES;
SELECT * FROM amazon_books_data LIMIT 5;

Your table should now contain the latest snapshot of data engineering books, automatically updated daily through Airflow’s scheduling system.

If you'd like, I can also show the Airflow Connection UI configuration example or an example using MySqlHook directly instead of mysql.connector.

Data Loading with Airflow (Amazon Book Data Project) (2)

Adding Git Sync, CI, and Alerts

At this point, you’ve successfully extracted, transformed, and loaded your data. However, your DAGs are still stored locally on your computer, which makes it difficult for collaborators to contribute and puts your entire workflow at risk if your machine fails or gets corrupted.

In this final section, we’ll introduce a few production-like patterns, version control, automated DAG syncing, basic CI checks, and failure alerts. These don’t make the project fully production-ready, but they represent the core practices most data engineers start with when moving beyond local development. The goal here is to show the essential workflow: storing DAGs in Git, syncing them automatically, validating updates before deployment, and receiving notifications when something breaks.

(For a more detailed explanation of the Git Sync setup shown below, you can read the extended breakdown here.)

To begin, create a public or private repository named airflow_dags (e.g., https://github.com/<your-username>/airflow_dags).

Then, in your project root (airflow-docker), initialize Git and push your local dags/ directory:

git init
git remote add origin https://github.com/<your-username>/airflow_dags.git
git add dags/
git commit -m "Add Airflow ETL pipeline DAGs"
git branch -M main
git push -u origin main

Once complete, your DAGs live safely in GitHub, ready for syncing.

1. Automating Dags with Git Sync

Rather than manually copying DAG files into your Airflow container, we can automate this using git-sync. This lightweight sidecar container continuously clones your GitHub repository into a shared volume.

Each time you push new DAG updates to GitHub, git-sync automatically pulls them into your Airflow environment, no rebuilds, no restarts. This ensures every environment always runs the latest, version-controlled workflow.

As we saw previously, we need to add a new git-sync service to our docker-compose.yaml and create a shared volume called airflow-dags-volume (this can be any name, just make it consistent) that both git-sync and Airflow will use.

services:
  git-sync:
    image: registry.k8s.io/git-sync/git-sync:v4.1.0
    user: "0:0"    # run as root so it can create /dags/git-sync
    restart: always
    environment:
      GITSYNC_REPO: "https://github.com/<your-username>/airflow-dags.git"
      GITSYNC_BRANCH: "main"           # use BRANCH not REF
      GITSYNC_PERIOD: "30s"
      GITSYNC_DEPTH: "1"
      GITSYNC_ROOT: "/dags/git-sync"
      GITSYNC_DEST: "repo"
      GITSYNC_LINK: "current"
      GITSYNC_ONE_TIME: "false"
      GITSYNC_ADD_USER: "true"
      GITSYNC_CHANGE_PERMISSIONS: "1"
      GITSYNC_STALE_WORKTREE_TIMEOUT: "24h"
    volumes:
      - airflow-dags-volume:/dags
    healthcheck:
      test: ["CMD-SHELL", "test -L /dags/git-sync/current && test -d /dags/git-sync/current/dags && [ \"$(ls -A /dags/git-sync/current/dags 2>/dev/null)\" ]"]
      interval: 10s
      timeout: 3s
      retries: 10
      start_period: 10s

volumes:
  airflow-dags-volume:

We then replace the original DAGs mount line in the volumes section(- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags) with - airflow-dags-volume:/opt/airflow/dags so Airflow reads DAGs directly from the synchronized Git volume.

We also set AIRFLOW__CORE__DAGS_FOLDER to /opt/airflow/dags/git-sync/current/dags so Airflow always points to the latest synced repository.

Finally, each Airflow service (airflow-apiserver, airflow-triggerer, airflow-dag-processor, and airflow-scheduler) is updated with a depends_on condition to ensure they only start after git-sync has successfully cloned the DAGs:

git-sync:
        condition: service_healthy

2. Adding Continuous Integration (CI) with GitHub Actions

To avoid deploying broken DAGs, we can add a lightweight GitHub Actions pipeline that validates DAG syntax before merging into the main branch.

Create a file in your repository:

.github/workflows/validate-dags.yml

name: Validate Airflow DAGs

on:
  push:
    branches: [ main ]
    paths:
      - 'dags/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      # Install all required packages for your DAG imports
      # (instead of only installing Apache Airflow)
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
        # Validate that all DAGs parse correctly
      # This imports every DAG file; if any import fails, CI fails.
      - name: Validate DAGs
        run: |
          echo "Validating DAG syntax..."
          airflow dags list || exit 1

This workflow automatically runs when new DAGs are pushed, ensuring they parse correctly before reaching Airflow.

3. Setting Up Alerts for Failures

Finally, for real-time visibility, Airflow provides alerting mechanisms that can notify you of any failed tasks via email or Slack.

Add this configuration under your DAG’s default_args:

default_args = {
    "owner": "Data Engineering Team",
    "email": ["[email protected]"],
    "email_on_failure": True,
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

If an extraction fails (for instance, Amazon changes its HTML structure or MySQL goes offline), Airflow automatically sends an alert with the error log and task details.

Summary and Up Next

In this tutorial, you built a complete, real-world ETL pipeline in Apache Airflow by moving beyond simulated workflows and extracting live book data from Amazon. You designed a production-style workflow that scrapes, cleans, and loads Amazon listings into MySQL, while organizing your code using Airflow’s TaskFlow API for clarity, modularity, and reliability.

You then strengthened your setup by integrating Git-based DAG management with git-sync, adding GitHub Actions CI to automatically validate workflows, and enabling alerting so failures are detected and surfaced immediately.

Together, these improvements transformed your project into a version-controlled, automated orchestration system that mirrors real production environments and prepares you for cloud deployment.

As a next step, you can expand this workflow by exploring other Amazon categories or by applying the same scraping and ETL techniques to entirely different websites, such as OpenWeather for weather insights or Indeed for job listings. This will broaden your data engineering experience with new, real-world data sources. Running Airflow in the cloud is also an important milestone; this tutorial will help you further deepen your understanding of cloud-based Airflow deployments.

PySpark Performance Tuning and Optimization

20 November 2025 at 02:55

PySpark pipelines that work beautifully with test data often crawl (or crash) in production. For example, a pipeline runs great at a small scale, with maybe a few hundred records finishing in under a minute. Then the company grows, and that same pipeline now processes hundreds of thousands of records and takes 45 minutes. Sometimes it doesn't finish at all. Your team lead asks, "Can you make this faster?"

In this tutorial, we’ll see how to systematically diagnose and fix those performance problems. We'll start with a deliberately unoptimized pipeline (47 seconds for just 75 records — yikes!), identify exactly what's slow using Spark's built-in tools, then fix it step-by-step until it runs in under 5 seconds. Same data, same hardware, just better code.

Before we start our optimization, let's set the stage. We've updated the ETL pipeline from the previous tutorial to use production-standard patterns, which means we're running everything in Docker with Spark's native parquet writers, just like you would in a real cluster environment. If you aren’t familiar with the previous tutorial, don’t worry! You can seamlessly jump into this one to learn pipeline optimization.

Getting the Starter Files

Let's get the starter files from our repository.

# Clone the full tutorials repo and navigate to this project
git clone <https://github.com/dataquestio/tutorials.git>
cd tutorials/pyspark-optimization-tutorial

This tutorial uses a local docker-compose.yml that exposes port 4040 for the Spark UI, so make sure you cloned the full repo, not just the subdirectory.

The starter files include:

  • Baseline ETL pipeline code (in src/etl_pipeline.py and main.py)
  • Sample grocery order data (in data/raw/)
  • Docker configuration for running Spark locally
  • Solution files for reference (in solution/)

Quick note: Spark 4.x enables Adaptive Query Execution (AQE) by default, which automatically handles some optimizations like dynamic partition coalescing. We've disabled it for this tutorial so you can see the raw performance problems clearly. Understanding these issues helps you write better code even when AQE is handling them automatically, which you'll re-enable in production.

Now, let's run this production-ready baseline to see what we’re dealing with.

Running the baseline

Make sure you're in the correct directory and start your Docker container with port access:

cd /tutorials/pyspark-optimization-tutorial
docker compose run --rm --service-ports lab

This opens an interactive shell inside the container. The --service-ports flag exposes port 4040, which you'll need to access Spark's web interface. We'll use that in a moment to see exactly what your pipeline is doing.

Inside the container, run:

python main.py
Pipeline completed in 47.15 seconds
WARNING: Total allocation exceeds 95.00% of heap memory
[...25 more memory warnings...]

47 seconds for 75 records. Look at those memory warnings! Spark is struggling because we're creating way too many partition files. We're scanning the data multiple times for simple counts. We're doing expensive aggregations without any caching. Every operation triggers a full pass through the data.

The good news? These are common, fixable mistakes. We'll systematically identify and eliminate them until this runs in under 5 seconds. Let's start by diagnosing exactly what's slow.

Understanding What's Slow: The Spark UI

We can't fix performance problems by guessing. The first rule of optimization: measure, then fix.

Spark gives us a built-in diagnostic tool that shows exactly what our pipeline is doing. It's called the Spark UI, and it runs every time you execute a Spark job. For this tutorial, we've added a pause at the end of our main.py job so you can explore the Spark UI. In production, you'd use Spark's history server to review completed jobs, but for learning, this pause lets you click around and build familiarity with the interface.

Accessing the Spark UI

Remember that port 4040 we exposed with our Docker command? Now we're going to use it. While your pipeline is running (or paused at the end), open a browser and go to http://localhost:4040.

You'll see Spark's web interface showing every job, stage, and task your pipeline executed. This is where you diagnose bottlenecks.

What to Look For

The Spark UI has a lot of tabs and metrics, which can feel overwhelming. For our work today we’ll focus on three things:

The Jobs tab shows every action that triggered execution. Each .count() or .write() creates a job. If you see 20+ jobs for a simple pipeline, you're doing too much work.

Stage durations tell you where time is being spent. Click into a job to see its stages. Spending 30 seconds on a count operation for 75 records? That's a bottleneck.

Number of tasks reveals partitioning problems. See 200 tasks to process 75 records? That's way too much overhead.

Reading the Baseline Pipeline

Open the Jobs tab. You'll see a table that looks messier than you'd expect:

Job Id  Description                                     Duration
0       count at NativeMethodAccessorImpl.java:0        0.5 s
1       count at NativeMethodAccessorImpl.java:0        0.1 s
2       count at NativeMethodAccessorImpl.java:0        0.1 s
...
13      parquet at NativeMethodAccessorImpl.java:0      3 s
15      parquet at NativeMethodAccessorImpl.java:0      2 s
19      collect at etl_pipeline.py:195                  0.3 s
20      collect at etl_pipeline.py:201                  0.3 s

The descriptions aren't particularly helpful, but count them. 22 jobs total. That's a lot of work for 75 records!

Most of these are count operations. Every time we logged a count in our code, Spark had to scan the entire dataset. Jobs 0-12 are all those .count() calls scattered through extraction and transformation. That's inefficiency #1.

Jobs 13 and 15 are our parquet writes. If you click into job 13, you'll see it created around 200 tasks to write 75 records. That's why we got all those memory warnings: too many tiny files means too much overhead.

Jobs 19 and 20 are the collect operations from our summary report (you can see the line numbers: etl_pipeline.py:195 and :201). Each one triggers more computation.

The Spark UI isn't always pretty, but it's showing us exactly what's wrong:

  • 22 jobs for a simple pipeline - way too much work
  • Most jobs are counts - rescanning data repeatedly
  • 200 tasks to write 75 records - partitioning gone wrong
  • Separate collection operations - no reuse of computed data

You don't need to understand every Java stack trace; you just need to count the jobs, spot the repeated operations, and identify where time is being spent. That's enough to know what to fix.

Eliminating Redundant Operations

Look back at those 22 jobs in the Spark UI. Most of them are counts. Every time we wrote df.count() to log how many records we had, Spark scanned the entire dataset. Right now, that's just 75 records, but scale to 75 million, and those scans eat hours of runtime.

Spark is lazy by design, so transformations like .filter() and .withColumn() build a plan but don't actually do anything. Only actions like .count(), .collect(), and .write() trigger execution. Usually, this is good because Spark can optimize the whole plan at once, but when you sprinkle counts everywhere "just to see what's happening," you're forcing Spark to execute repeatedly.

The Problem: Logging Everything

Open src/etl_pipeline.py and look at the extract_sales_data function:

def extract_sales_data(spark, input_path):
    """Read CSV files with explicit schema"""

    logger.info(f"Reading sales data from {input_path}")

    # ... schema definition ...

    df = spark.read.csv(input_path, header=True, schema=schema)

    # PROBLEM: This forces a full scan just to log the count
    logger.info(f"Loaded {df.count()} records from {input_path}")

    return df

That count seems harmless because we want to know how many records we loaded, right? But we're calling this function three times (once per CSV file), so that's three full scans before we've done any actual work.

Now look at extract_all_data:

def extract_all_data(spark):
    """Combine data from multiple sources"""

    online_orders = extract_sales_data(spark, "data/raw/online_orders.csv")
    store_orders = extract_sales_data(spark, "data/raw/store_orders.csv")
    mobile_orders = extract_sales_data(spark, "data/raw/mobile_orders.csv")

    all_orders = online_orders.unionByName(store_orders).unionByName(mobile_orders)

    # PROBLEM: Another full scan right after combining
    logger.info(f"Combined dataset has {all_orders.count()} orders")

    return all_orders

We just scanned three times to count individual files, then scanned again to count the combined result. That's four scans before we've even started transforming data.

The Fix: Count Only What Matters

Only count when you need the number for business logic, not for logging convenience.

Remove the counts from extract_sales_data:

def extract_sales_data(spark, input_path):
    """Read CSV files with explicit schema"""

    logger.info(f"Reading sales data from {input_path}")

    schema = StructType([
        StructField("order_id", StringType(), True),
        StructField("customer_id", StringType(), True),
        StructField("product_name", StringType(), True),
        StructField("price", StringType(), True),
        StructField("quantity", StringType(), True),
        StructField("order_date", StringType(), True),
        StructField("region", StringType(), True)
    ])

    df = spark.read.csv(input_path, header=True, schema=schema)

    # No count here - just return the DataFrame
    return df

Remove the count from extract_all_data:

def extract_all_data(spark):
    """Combine data from multiple sources"""

    online_orders = extract_sales_data(spark, "data/raw/online_orders.csv")
    store_orders = extract_sales_data(spark, "data/raw/store_orders.csv")
    mobile_orders = extract_sales_data(spark, "data/raw/mobile_orders.csv")

    all_orders = online_orders.unionByName(store_orders).unionByName(mobile_orders)

    logger.info("Combined data from all sources")
    return all_orders

Fixing the Transform Phase

Now look at remove_test_data and handle_duplicates. Both calculate how many records they removed:

def remove_test_data(df):
    """Filter out test records"""
    df_filtered = df.filter(
        ~(upper(col("customer_id")).contains("TEST") |
          upper(col("product_name")).contains("TEST") |
          col("customer_id").isNull() |
          col("order_id").isNull())
    )

    # PROBLEM: Two counts just to log the difference
    removed_count = df.count() - df_filtered.count()
    logger.info(f"Removed {removed_count} test/invalid orders")

    return df_filtered

That's two full scans (one for df.count(), one for df_filtered.count()) just to log a number. Here's the fix:

def remove_test_data(df):
    """Filter out test records"""
    df_filtered = df.filter(
        ~(upper(col("customer_id")).contains("TEST") |
          upper(col("product_name")).contains("TEST") |
          col("customer_id").isNull() |
          col("order_id").isNull())
    )

    logger.info("Removed test and invalid orders")
    return df_filtered

Do the same for handle_duplicates:

def handle_duplicates(df):
    """Remove duplicate orders"""
    df_deduped = df.dropDuplicates(["order_id"])

    logger.info("Removed duplicate orders")
    return df_deduped

Counts make sense during development when we're validating logic. Feel free to add them liberally - check that your test filter actually removed 10 records, verify deduplication worked. But once the pipeline works? Remove them before deploying to production, where you'd use Spark's accumulators or monitoring systems instead.

Keeping One Strategic Count

Let's keep exactly one count operation in main.py after the full transformation completes. This tells us the final record count without triggering excessive scans during processing:

def main():
    # ... existing code ...

    try:
        spark = create_spark_session()
        logger.info("Spark session created")

        # Extract
        raw_df = extract_all_data(spark)
        logger.info("Extracted raw data from all sources")

        # Transform
        clean_df = transform_orders(raw_df)
        logger.info(f"Transformation complete: {clean_df.count()} clean records")

        # ... rest of pipeline ...

One count at a key checkpoint. That's it.

What About the Summary Report?

Look at what create_summary_report is doing:

total_orders = df.count()
unique_customers = df.select("customer_id").distinct().count()
unique_products = df.select("product_name").distinct().count()
total_revenue = df.agg(sum("total_amount")).collect()[0][0]
# ... more separate operations ...

Each line scans the data independently - six separate scans for six metrics. We'll fix this properly in a later section when we talk about efficient aggregations, but for now, let's just remove the call to create_summary_report from main.py entirely. Comment it out:

        load_to_parquet(clean_df, output_path)
        load_to_parquet(metrics_df, metrics_path)

        # summary = create_summary_report(clean_df)  # Temporarily disabled

        runtime = (datetime.now() - start_time).total_seconds()
        logger.info(f"Pipeline completed in {runtime:.2f} seconds")

Test the Changes

Run your optimized pipeline:

python main.py

Watch the output. You'll see far fewer log messages because we're not counting everything. More importantly, check the completion time:

Pipeline completed in 42.70 seconds

We just shaved off 5 seconds by removing unnecessary counts.

Not a massive win, but we're just getting started. More importantly, check the Spark UI (http://localhost:4040) and you should see fewer jobs now, maybe 12-15 instead of 22.

Not all optimizations give massive speedups, but fixing them systematically adds up. We removed unnecessary work, which is always good practice. Now let's tackle the real problem: partitioning.

Fixing Partitioning: Right-Sizing Your Data

When Spark processes data, it splits it into chunks called partitions. Each partition gets processed independently, potentially on different machines or CPU cores. This is how Spark achieves parallelism. But Spark doesn't automatically know the perfect number of partitions for your data.

By default, Spark often creates 200 partitions for operations like shuffles and writes. That's a reasonable default if you're processing hundreds of gigabytes across a 50-node cluster, but we're processing 75 records on a single machine. Creating 200 partition files means:

  • 200 tiny files written to disk
  • 200 file handles opened simultaneously
  • Memory overhead for managing 200 separate write operations
  • More time spent on coordination than actual work

It's like hiring 200 people to move 75 boxes. Most of them stand around waiting while a few do all the work, and you waste money coordinating everyone.

Seeing the Problem

Open the Spark UI and look at the Stages tab. Find one of the parquet write stages (you'll see them labeled parquet at NativeMethodAccessorImpl.java:0).

Spark UI Stages Tab

Look at the Tasks: Succeeded/Total column and you'll see 200/200. Spark created 200 separate tasks to write 75 records. Now, look at the Shuffle Write column, which is probably showing single-digit kilobytes. We're using massive parallelism for tiny amounts of data.

Each of those 200 tasks creates overhead: opening a file handle, coordinating with the driver, writing a few bytes, then closing. Most tasks spend more time on coordination than actual work.

The Fix: Coalesce

We need to reduce the number of partitions before writing. Spark gives us two options: repartition() and coalesce().

Repartition does a full shuffle - redistributes all data across the cluster. It's expensive but gives you exact control.

Coalesce is smarter. It combines existing partitions without shuffling. If you have 200 partitions and coalesce to 2, Spark just merges them in place, which is much faster.

For our data size, we want 1 partition - one file per output, clean and simple.

Update load_to_parquet in src/etl_pipeline.py:

def load_to_parquet(df, output_path):
    """Save to parquet with proper partitioning"""

    logger.info(f"Writing data to {output_path}")

    # Coalesce to 1 partition before writing
    # This creates a single output file instead of 200 tiny ones
    df.coalesce(1) \
      .write \
      .mode("overwrite") \
      .parquet(output_path)

    logger.info(f"Successfully wrote data to {output_path}")

When to Use Different Partition Counts

You don’t always want one partition. Here's when to use different counts:

For small datasets (under 1GB): Use 1-4 partitions. Minimize overhead.

For medium datasets (1-10GB): Use 10-50 partitions. Balance parallelism and overhead.

For large datasets (100GB+): Use 100-500 partitions. Maximize parallelism.

General guideline: Aim for 128MB to 1GB per partition. That's the sweet spot where each task has enough work to justify the overhead but not so much that it runs out of memory.

For our 75 records, 1 partition works perfectly. In production environments, AQE typically handles this partition sizing automatically, but understanding these principles helps you write better code even when AQE is doing the heavy lifting.

Test the Fix

Run the pipeline again:

python main.py

Those memory warnings should be gone. Check the timing:

Pipeline completed in 20.87 seconds

We just saved another 22 seconds by fixing partitioning. That's cutting our runtime in half. From 47 seconds in the original baseline to 20 seconds now, and we've only made two changes.

Check the Spark UI Stages tab. Now your parquet write stages should show 1/1 or 2/2 tasks instead of 200/200.

Look at your output directory:

ls data/processed/orders/

Before, you'd see hundreds of tiny parquet files with names like part-00001.parquet, part-00002.parquet, and so on. Now you'll see one clean file.

Why This Matters at Scale

Wrong partitioning breaks pipelines at scale. With 75 million records, having too many partitions creates coordination overhead. Having too few means you can't parallelize. And if your data is unevenly distributed (some partitions with 1 million records, others with 10,000), you get stragglers, slow tasks that hold up the entire job while everyone else waits.

Get partitioning right and your pipeline uses less memory, produces cleaner files, performs better downstream, and scales without crashing.

We've cut our runtime in half by eliminating redundant work and fixing partitioning. Next up: caching. We're still recomputing DataFrames multiple times, and that's costing us time. Let's fix that.

Strategic Caching: Stop Recomputing the Same Data

Here's a question: how many times do we use clean_df in our pipeline?

Look at main.py:

clean_df = transform_orders(raw_df)
logger.info(f"Transformation complete: {clean_df.count()} clean records")

metrics_df = create_metrics(clean_df)  # Using clean_df here
logger.info(f"Generated {metrics_df.count()} metric records")

load_to_parquet(clean_df, output_path)  # Using clean_df again
load_to_parquet(metrics_df, metrics_path)

We use clean_df three times: once for the count, once to create metrics, and once to write it. Here's the catch: Spark recomputes that entire transformation pipeline every single time.

Remember, transformations are lazy. When you write clean_df = transform_orders(raw_df), Spark doesn't actually clean the data. It just creates a plan: "When someone needs this data, here's how to build it." Every time you use clean_df, Spark goes back to the raw CSVs, reads them, applies all your transformations, and produces the result. Three uses = three complete executions.

That's wasteful.

The Solution: Cache It

Caching tells Spark: "I'm going to use this DataFrame multiple times. Compute it once, keep it in memory, and reuse it."

Update main.py to cache clean_df after transformation:

def main():
    # ... existing code ...

    try:
        spark = create_spark_session()
        logger.info("Spark session created")

        # Extract
        raw_df = extract_all_data(spark)
        logger.info("Extracted raw data from all sources")

        # Transform
        clean_df = transform_orders(raw_df)

        # Cache because we'll use this multiple times
        clean_df.cache()

        logger.info(f"Transformation complete: {clean_df.count()} clean records")

        # Create aggregated metrics
        metrics_df = create_metrics(clean_df)
        logger.info(f"Generated {metrics_df.count()} metric records")

        # Load
        output_path = "data/processed/orders"
        metrics_path = "data/processed/metrics"

        load_to_parquet(clean_df, output_path)
        load_to_parquet(metrics_df, metrics_path)

        # Clean up the cache when done
        clean_df.unpersist()

        runtime = (datetime.now() - start_time).total_seconds()
        logger.info(f"Pipeline completed in {runtime:.2f} seconds")

We added clean_df.cache() right after transformation and clean_df.unpersist() when we're done with it.

What Actually Happens

When you call .cache(), nothing happens immediately. Spark just marks that DataFrame as "cacheable." The first time you actually use it (the count operation), Spark computes the result and stores it in memory. Every subsequent use pulls from memory instead of recomputing.

The .unpersist() at the end frees up that memory. Not strictly necessary (Spark will eventually evict cached data when it needs space), but it's good practice to be explicit.

When to Cache (and When Not To)

Not every DataFrame needs caching. Use these guidelines:

Cache when:

  • You use a DataFrame 2+ times
  • The DataFrame is expensive to compute
  • It's reasonably sized (Spark will spill to disk if it doesn't fit in memory, but excessive spilling hurts performance)

Don't cache when:

  • You only use it once (wastes memory for no benefit)
  • Computing it is cheaper than the cache overhead (very simple operations)
  • You're caching so much data that Spark is constantly evicting and re-caching (check the Storage tab in Spark UI to see if this is happening)

For our pipeline, caching clean_df makes sense because we use it three times and it's small. Caching the raw data wouldn't help because we only transform it once.

Test the Changes

Run the pipeline:

python main.py

Check the timing:

Pipeline completed in 18.11 seconds

We saved about 3 seconds with caching. Want to see what actually got cached? Check the Storage tab in the Spark UI to see how much data is in memory versus what has been spilled to disk. For our small dataset, everything should be in memory.

From 47 seconds at baseline to 18 seconds now — that's a 62% improvement from three optimizations: removing redundant counts, fixing partitioning, and caching strategically.

The performance gain from caching is smaller than partitioning because our dataset is tiny. With larger data, caching makes a much bigger difference.

Too Much Caching

Caching isn't free. It uses memory, and if you cache too much, Spark starts evicting data to make room for new caches. This causes thrashing, constant cache evictions and recomputations, which makes everything slower.

Cache strategically. If you'll use it more than once, cache it. If not, don't.

We've now eliminated redundant operations, fixed partitioning, and added strategic caching. Next up: filtering early. We're still cleaning all the data before removing test records, which is backwards. Let's fix that.

Filter Early: Don't Clean Data You'll Throw Away

Look at our transformation pipeline in transform_orders:

def transform_orders(df):
    """Apply all transformations in sequence"""

    logger.info("Starting data transformation...")

    df = clean_customer_id(df)
    df = clean_price_column(df)
    df = standardize_dates(df)
    df = remove_test_data(df)       # ← This should be first!
    df = handle_duplicates(df)

See the problem? We're cleaning customer IDs, parsing prices, and standardizing dates for all the records. Then, at the end, we remove test data and duplicates. We just wasted time cleaning data we're about to throw away.

It's like washing dirty dishes before checking which ones are broken. Why scrub something you're going to toss?

Why This Matters

Test data removal and deduplication are cheap operations because they're just filters. Price cleaning and date parsing are expensive because they involve regex operations and type conversions on every row.

Right now we're doing expensive work on 85 records, then filtering down to 75. We should filter to 75 first, then do expensive work on just those records.

With our tiny dataset, this won't save much time. But imagine production: you load 10 million records, 5% are test data and duplicates. That's 500,000 records you're cleaning for no reason. Early filtering means you only clean 9.5 million records instead of 10 million. That's real time saved.

The Fix: Reorder Transformations

Move the filters to the front. Change transform_orders to:

def transform_orders(df):
    """Apply all transformations in sequence"""

    logger.info("Starting data transformation...")

    # Filter first - remove data we won't use
    df = remove_test_data(df)
    df = handle_duplicates(df)

    # Then do expensive transformations on clean data only
    df = clean_customer_id(df)
    df = clean_price_column(df)
    df = standardize_dates(df)

    # Cast quantity and add calculated fields
    df = df.withColumn(
        "quantity",
        when(col("quantity").isNotNull(), col("quantity").cast(IntegerType()))
        .otherwise(1)
    )

    df = df.withColumn("total_amount", col("unit_price") * col("quantity")) \
           .withColumn("processing_date", current_date()) \
           .withColumn("year", year(col("order_date"))) \
           .withColumn("month", month(col("order_date")))

    logger.info("Transformation complete")

    return df

The Principle: Push Down Filters

This is called predicate pushdown in database terminology, but the concept is simple: do your filtering as early as possible in the pipeline. Reduce your data size before doing expensive operations.

This applies beyond just test data:

  • If you're only analyzing orders from 2024, filter by date right after reading
  • If you only care about specific regions, filter by region immediately
  • If you're joining two datasets and one is huge, filter both before joining

The general pattern: filter early, transform less.

Test the Changes

Run the pipeline:

python main.py

Check the timing:

Pipeline completed in 17.71 seconds

We saved about a second. In production with millions of records, early filtering can be the difference between a 10-minute job and a 2-minute job.

When Order Matters

Not all transformations can be reordered. Some have dependencies:

  • You can't calculate total_amount before you've cleaned unit_price
  • You can't extract the year from order_date before standardizing the date format
  • You can't deduplicate before you've standardized customer IDs (or you might miss duplicates)

But filters that don't depend on transformations? Move those to the front.

Let's tackle one final improvement: making the summary report efficient.

Efficient Aggregations: Compute Everything in One Pass

Remember that summary report we commented out earlier? Let's bring it back and fix it properly.

Here's what create_summary_report currently does:

def create_summary_report(df):
    """Generate summary statistics"""

    logger.info("Generating summary report...")

    total_orders = df.count()
    unique_customers = df.select("customer_id").distinct().count()
    unique_products = df.select("product_name").distinct().count()
    total_revenue = df.agg(sum("total_amount")).collect()[0][0]

    date_stats = df.agg(
        min("order_date").alias("earliest"),
        max("order_date").alias("latest")
    ).collect()[0]

    region_count = df.groupBy("region").count().count()

    # ... log everything ...

Count how many times we scan the data. Six separate operations: count the orders, count distinct customers, count distinct products, sum revenue, get date range, and count regions. Each one reads through the entire DataFrame independently.

The Fix: Single Aggregation Pass

Spark lets you compute multiple aggregations in one pass. Here's how:

Replace create_summary_report with this optimized version:

def create_summary_report(df):
    """Generate summary statistics efficiently"""

    logger.info("Generating summary report...")

    # Compute everything in a single aggregation
    stats = df.agg(
        count("*").alias("total_orders"),
        countDistinct("customer_id").alias("unique_customers"),
        countDistinct("product_name").alias("unique_products"),
        sum("total_amount").alias("total_revenue"),
        min("order_date").alias("earliest_date"),
        max("order_date").alias("latest_date"),
        countDistinct("region").alias("regions")
    ).collect()[0]

    summary = {
        "total_orders": stats["total_orders"],
        "unique_customers": stats["unique_customers"],
        "unique_products": stats["unique_products"],
        "total_revenue": stats["total_revenue"],
        "date_range": f"{stats['earliest_date']} to {stats['latest_date']}",
        "regions": stats["regions"]
    }

    logger.info("\n=== ETL Summary Report ===")
    for key, value in summary.items():
        logger.info(f"{key}: {value}")
    logger.info("========================\n")

    return summary

One .agg() call, seven metrics computed. Spark scans the data once and calculates everything simultaneously.

Now uncomment the summary report call in main.py:

        load_to_parquet(clean_df, output_path)
        load_to_parquet(metrics_df, metrics_path)

        # Generate summary report
        summary = create_summary_report(clean_df)

        # Clean up the cache when done
        clean_df.unpersist()

How This Works

When you chain multiple aggregations in .agg(), Spark sees them all at once and creates a single execution plan. Everything gets calculated together in one pass through the data, not sequentially.

This is the difference between:

  • Six passes: Read → count → Read → distinct customers → Read → distinct products...
  • One pass: Read → count + distinct customers + distinct products + sum + min + max + regions

Same results, fraction of the work.

Test the Final Pipeline

Run it:

python main.py

Check the output:

=== ETL Summary Report ===
2025-11-07 23:14:22,285 - INFO - total_orders: 75
2025-11-07 23:14:22,285 - INFO - unique_customers: 53
2025-11-07 23:14:22,285 - INFO - unique_products: 74
2025-11-07 23:14:22,285 - INFO - total_revenue: 667.8700000000003
2025-11-07 23:14:22,285 - INFO - date_range: 2024-10-15 to 2024-11-10
2025-11-07 23:14:22,285 - INFO - regions: 4
2025-11-07 23:14:22,286 - INFO - ========================

2025-11-07 23:14:22,297 - INFO - Pipeline completed in 18.28 seconds

We're at 18 seconds. Adding the efficient summary back added less than a second because it computes everything in one pass. The old version with six separate scans would probably take 24+ seconds.

Before and After: The Complete Picture

Let's recap what we've done:

Baseline (47 seconds):

  • Multiple counts scattered everywhere
  • 200 partitions for 75 records
  • No caching, constant recomputation
  • Test data removed after expensive transformations
  • Summary report with six separate scans

Optimized (18 seconds):

  • Strategic counting only where needed
  • 1 partition per output file
  • Cached frequently-used DataFrames
  • Filters first, transformations second
  • Summary report in a single aggregation

Result: 61% faster with the same hardware, same data, just better code.

And here's the thing: these optimizations scale. With 75,000 records instead of 75, the improvements would be even more dramatic. With 75 million records, proper optimization is the difference between a job that completes and one that crashes.

You've now seen the core optimization techniques that solve most real-world performance problems. Let's wrap up with what you've learned and when to apply these patterns.

What You've Accomplished

You can diagnose performance problems. The Spark UI showed you exactly where time was being spent. Too many jobs meant redundant operations. Too many tasks meant partitioning problems. Now you know how to read those signals.

You know when and how to optimize. Remove unnecessary work first. Fix obvious problems, such as incorrect partition counts. Cache strategically when data gets reused. Filter early to reduce data volume. Combine aggregations into single passes. Each technique targets a specific bottleneck.

You understand the tradeoffs. Caching uses memory. Coalescing reduces parallelism. Different situations need different techniques. Pick the right ones for your problem.

You learned the process: measure, identify bottlenecks, fix them, measure again. That process works whether you're optimizing a 75-record tutorial or a production pipeline processing terabytes.

When to Apply These Techniques

Don't optimize prematurely. If your pipeline runs in 30 seconds and runs once a day, it's fine. Spend your time building features instead of shaving seconds.

Optimize when:

  • Your pipeline can't finish before the next run starts
  • You're hitting memory limits or crashes
  • Jobs are taking hours when they should take minutes
  • You're paying significant cloud costs for compute time

Then follow the process you just learned: measure what's slow, fix the biggest bottleneck, measure again. Repeat until it's fast enough.

What's Next

You've optimized a single-machine pipeline. The next tutorial covers integrating PySpark with the broader big data ecosystem: connecting to data warehouses, working with cloud storage, orchestrating with Airflow, and understanding when to scale to a real cluster.

Before you move on to distributed clusters and ecosystem integration, take what you've learned and apply it. Find a slow Spark job - at work, in a project, or just a dataset you're curious about. Profile it with the Spark UI. Apply these optimizations. See the improvement for yourself.

In production, enable AQE (spark.sql.adaptive.enabled = true) and let Spark's automatic optimizations work alongside your manual tuning.

One Last Thing

Performance optimization feels intimidating when you're starting out. It seems like it requires deep expertise in Spark internals, JVM tuning, and cluster configuration.

But as you just saw, most performance problems come from a handful of common mistakes. Unnecessary operations. Wrong partitioning. Missing caches. Poor filtering strategies. Fix those, and you've solved 80% of real-world performance issues.

You don't need to be a Spark expert to write fast code. You need to understand what Spark is doing, identify the waste, and eliminate it. That's exactly what you did today.

Now go make something fast.

Measuring Similarity and Distance between Embeddings

11 November 2025 at 03:02

In the previous tutorial, you learned how to collect 500 research papers from arXiv and generate embeddings using both local models and API services. You now have a dataset of papers with embeddings that capture their semantic meaning. Those embeddings are vectors, which means we can perform mathematical operations on them.

But here's the thing: having embeddings isn't enough to build a search system. You need to know how to measure similarity between vectors. When a user searches for "neural networks for computer vision," which papers in your dataset are most relevant? The answer lies in measuring the distance between the query embedding and each paper embedding.

This tutorial teaches you how to implement similarity calculations and build a functional semantic search engine. You'll implement three different distance metrics, understand when to use each one, and create a search function that returns ranked results based on semantic similarity. By the end, you'll have built a complete search system that finds relevant papers based on meaning rather than keywords.

Setting Up Your Environment

We'll continue using the same libraries from the previous tutorials. If you've been following along, you should already have these installed. If not, here's the installation command for you to run from your terminal:

# Developed with: Python 3.12.12
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1

pip install scikit-learn matplotlib numpy pandas cohere python-dotenv

Loading Your Saved Embeddings

Previously, we saved our embeddings and metadata to disk. Let's load them back into memory so we can work with them. We'll use the Cohere embeddings (embeddings_cohere.npy) because they provide consistent results across different hardware setups:

import numpy as np
import pandas as pd

# Load the metadata
df = pd.read_csv('arxiv_papers_metadata.csv')
print(f"Loaded {len(df)} papers")

# Load the Cohere embeddings
embeddings = np.load('embeddings_cohere.npy')
print(f"Loaded embeddings with shape: {embeddings.shape}")
print(f"Each paper is represented by a {embeddings.shape[1]}-dimensional vector")

# Verify the data loaded correctly
print(f"\nFirst paper title: {df['title'].iloc[0]}")
print(f"First embedding (first 5 values): {embeddings[0][:5]}")
Loaded 500 papers
Loaded embeddings with shape: (500, 1536)
Each paper is represented by a 1536-dimensional vector

First paper title: Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
First embedding (first 5 values): [-7.7144260e-03  1.9527141e-02 -4.2141182e-05 -2.8627755e-03 -2.5192423e-02]

Perfect! We have our 500 papers and their corresponding 1536-dimensional embedding vectors. Each vector is a point in high-dimensional space, and papers with similar content will have vectors that are close together. Now we need to define what "close together" actually means.

Understanding Distance in Vector Space

Before we write any code, let's build intuition about measuring similarity between vectors. Imagine you have two papers about software compliance. Their embeddings might look like this:

Paper A: [0.8, 0.6, 0.1, ...]  (1536 numbers total)
Paper B: [0.7, 0.5, 0.2, ...]  (1536 numbers total)

To calculate the distance between embedding vectors, we need a distance metric. There are three commonly used metrics for measuring similarity between embeddings:

  1. Euclidean Distance: Measures the straight-line distance between vectors in space. A shorter distance means higher similarity. You can think of it as measuring the physical distance between two points.
  2. Dot Product: Multiplies corresponding elements and sums them up. Considers both direction and magnitude of the vectors. Works well when embeddings are normalized to unit length.
  3. Cosine Similarity: Measures the angle between vectors. If two vectors point in the same direction, they're similar, regardless of their length. This is the most common metric for text embeddings.

We'll implement each metric in order from most intuitive to most commonly used. Let's start with Euclidean distance because it's the easiest to understand.

Implementing Euclidean Distance

Euclidean distance measures the straight-line distance between two points in space. This is the most intuitive metric because we all understand physical distance. If you have two points on a map, the Euclidean distance is literally how far apart they are.

Unlike the other metrics we'll learn (where higher is better), Euclidean distance works in reverse: lower distance means higher similarity. Papers that are close together in space have low distance and are semantically similar.

Note that Euclidean distance is sensitive to vector magnitude. If your embeddings aren't normalized (meaning vectors can have different lengths), two vectors pointing in similar directions but with different magnitudes will show larger distance than expected. This is why cosine similarity (which we'll learn next) is often preferred for text embeddings. It ignores magnitude and focuses purely on direction.

The formula is:

$$\text{Euclidean distance} = |\mathbf{A} - \mathbf{B}| = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}$$

This is essentially the Pythagorean theorem extended to high-dimensional space. We subtract corresponding values, square them, sum everything up, and take the square root. Let's implement it:

def euclidean_distance_manual(vec1, vec2):
    """
    Calculate Euclidean distance between two vectors.

    Parameters:
    -----------
    vec1, vec2 : numpy arrays
        The vectors to compare

    Returns:
    --------
    float
        Euclidean distance (lower means more similar)
    """
    # np.linalg.norm computes the square root of sum of squared differences
    # This implements the Euclidean distance formula directly
    return np.linalg.norm(vec1 - vec2)

# Let's test it by comparing two similar papers
paper_idx_1 = 492  # Android compliance detection paper
paper_idx_2 = 493  # GDPR benchmarking paper

distance = euclidean_distance_manual(embeddings[paper_idx_1], embeddings[paper_idx_2])

print(f"Comparing two papers:")
print(f"Paper 1: {df['title'].iloc[paper_idx_1][:50]}...")
print(f"Paper 2: {df['title'].iloc[paper_idx_2][:50]}...")
print(f"\nEuclidean distance: {distance:.4f}")
Comparing two papers:
Paper 1: Can Large Language Models Detect Real-World Androi...
Paper 2: GDPR-Bench-Android: A Benchmark for Evaluating Aut...

Euclidean distance: 0.8431

A distance of 0.84 is quite low, which means these papers are very similar! Both papers discuss Android compliance and benchmarking, so this makes perfect sense. Now let's compare this to a paper from a completely different category:

# Compare a software engineering paper to a database paper
paper_idx_3 = 300  # A database paper about natural language queries

distance_related = euclidean_distance_manual(embeddings[paper_idx_1],
                                             embeddings[paper_idx_2])
distance_unrelated = euclidean_distance_manual(embeddings[paper_idx_1],
                                               embeddings[paper_idx_3])

print(f"Software Engineering paper 1 vs Software Engineering paper 2:")
print(f"  Distance: {distance_related:.4f}")
print(f"\nSoftware Engineering paper vs Database paper:")
print(f"  Distance: {distance_unrelated:.4f}")
print(f"\nThe related SE papers are {distance_unrelated/distance_related:.2f}x closer")
Software Engineering paper 1 vs Software Engineering paper 2:
  Distance: 0.8431

Software Engineering paper vs Database paper:
  Distance: 1.2538

The related SE papers are 1.49x closer

The distance correctly identifies that papers from the same category are closer to each other than papers from different categories. The related papers have a much lower distance.

For calculating distance to all papers, we can use scikit-learn:

from sklearn.metrics.pairwise import euclidean_distances

# Calculate distance from one paper to all others
query_embedding = embeddings[paper_idx_1].reshape(1, -1)
all_distances = euclidean_distances(query_embedding, embeddings)

# Get top 10 (lowest distances = most similar)
top_indices = np.argsort(all_distances[0])[1:11]

print(f"Query paper: {df['title'].iloc[paper_idx_1]}\n")
print("Top 10 papers by Euclidean distance (lowest = most similar):")
for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. [{all_distances[0][idx]:.4f}] {df['title'].iloc[idx][:50]}...")
Query paper: Can Large Language Models Detect Real-World Android Software Compliance Violations?

Top 10 papers by Euclidean distance (lowest = most similar):
1. [0.8431] GDPR-Bench-Android: A Benchmark for Evaluating Aut...
2. [1.0168] An Empirical Study of LLM-Based Code Clone Detecti...
3. [1.0218] LLM-as-a-Judge is Bad, Based on AI Attempting the ...
4. [1.0541] BengaliMoralBench: A Benchmark for Auditing Moral ...
5. [1.0677] Exploring the Feasibility of End-to-End Large Lang...
6. [1.0730] Where Do LLMs Still Struggle? An In-Depth Analysis...
7. [1.0730] Where Do LLMs Still Struggle? An In-Depth Analysis...
8. [1.0763] EvoDev: An Iterative Feature-Driven Framework for ...
9. [1.0766] Watermarking Large Language Models in Europe: Inte...
10. [1.0814] One Battle After Another: Probing LLMs' Limits on ...

Euclidean distance is intuitive and works well for many applications. Now let's learn about dot product, which takes a different approach to measuring similarity.

Implementing Dot Product Similarity

The dot product is simpler than Euclidean distance because it doesn't involve taking square roots or differences. Instead, we multiply corresponding elements and sum them up. The formula is:

$$\text{dot product} = \mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^{n} A_i B_i$$

The dot product considers both the angle between vectors and their magnitudes. When vectors point in similar directions, the products of corresponding elements tend to be positive and large, resulting in a high dot product. When vectors point in different directions, some products are positive and some negative, and they tend to cancel out, resulting in a lower dot product. Higher scores mean higher similarity.

The dot product works particularly well when embeddings have been normalized to similar lengths. Many embedding APIs like Cohere and OpenAI produce normalized embeddings by default. However, some open-source frameworks (like sentence-transformers or instructor) require you to explicitly set normalization parameters. Always check your embedding model's documentation to understand whether normalization is applied automatically or needs to be configured.

Let's implement it:

def dot_product_similarity_manual(vec1, vec2):
    """
    Calculate dot product between two vectors.

    Parameters:
    -----------
    vec1, vec2 : numpy arrays
        The vectors to compare

    Returns:
    --------
    float
        Dot product score (higher means more similar)
    """
    # np.dot multiplies corresponding elements and sums them
    # This directly implements the dot product formula
    return np.dot(vec1, vec2)

# Compare the same papers using dot product
similarity_dot = dot_product_similarity_manual(embeddings[paper_idx_1],
                                               embeddings[paper_idx_2])

print(f"Comparing the same two papers:")
print(f"  Dot product: {similarity_dot:.4f}")
Comparing the same two papers:
  Dot product: 0.6446

Keep this number in mind. When we calculate cosine similarity next, you'll see why dot product works so well for these embeddings.

For search across all papers, we can use NumPy's matrix multiplication:

# Efficient dot product for one query against all papers
query_embedding = embeddings[paper_idx_1]
all_dot_products = np.dot(embeddings, query_embedding)

# Get top 10 results
top_indices = np.argsort(all_dot_products)[::-1][1:11]

print(f"Query paper: {df['title'].iloc[paper_idx_1]}\n")
print("Top 10 papers by dot product similarity:")
for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. [{all_dot_products[idx]:.4f}] {df['title'].iloc[idx][:50]}...")
Query paper: Can Large Language Models Detect Real-World Android Software Compliance Violations?

Top 10 papers by dot product similarity:
1. [0.6446] GDPR-Bench-Android: A Benchmark for Evaluating Aut...
2. [0.4831] An Empirical Study of LLM-Based Code Clone Detecti...
3. [0.4779] LLM-as-a-Judge is Bad, Based on AI Attempting the ...
4. [0.4445] BengaliMoralBench: A Benchmark for Auditing Moral ...
5. [0.4300] Exploring the Feasibility of End-to-End Large Lang...
6. [0.4243] Where Do LLMs Still Struggle? An In-Depth Analysis...
7. [0.4243] Where Do LLMs Still Struggle? An In-Depth Analysis...
8. [0.4208] EvoDev: An Iterative Feature-Driven Framework for ...
9. [0.4204] Watermarking Large Language Models in Europe: Inte...
10. [0.4153] One Battle After Another: Probing LLMs' Limits on ...

Notice that the rankings are identical to those from Euclidean distance! This happens because both metrics capture similar relationships in the data, just measured differently. This won't always be the case with all embedding models, but it's common when embeddings are well-normalized.

Implementing Cosine Similarity

Cosine similarity is the most commonly used metric for text embeddings. It measures the angle between vectors rather than their distance. If two vectors point in the same direction, they're similar, regardless of how long they are.

The formula looks like this:

$$\text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|}=\frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Where $\mathbf{A}$ and $\mathbf{B}$ are our two vectors, $\mathbf{A} \cdot \mathbf{B}$ is the dot product, and $|\mathbf{A}|$ represents the magnitude (or length) of vector $\mathbf{A}$.

The result ranges from -1 to 1:

  • 1 means the vectors point in exactly the same direction (identical meaning)
  • 0 means the vectors are perpendicular (unrelated)
  • -1 means the vectors point in opposite directions (opposite meaning)

For text embeddings, you'll typically see values between 0 and 1 because embeddings rarely point in completely opposite directions.

Let's implement this using NumPy:

def cosine_similarity_manual(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.

    Parameters:
    -----------
    vec1, vec2 : numpy arrays
        The vectors to compare

    Returns:
    --------
    float
        Cosine similarity score between -1 and 1
    """
    # Calculate dot product (numerator)
    dot_product = np.dot(vec1, vec2)

    # Calculate magnitudes using np.linalg.norm (denominator)
    # np.linalg.norm computes sqrt(sum of squared values)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)

    # Divide dot product by product of magnitudes
    similarity = dot_product / (magnitude1 * magnitude2)

    return similarity

# Test with our software engineering papers
similarity = cosine_similarity_manual(embeddings[paper_idx_1],
                                     embeddings[paper_idx_2])

print(f"Comparing two papers:")
print(f"Paper 1: {df['title'].iloc[paper_idx_1][:50]}...")
print(f"Paper 2: {df['title'].iloc[paper_idx_2][:50]}...")
print(f"\nCosine similarity: {similarity:.4f}")
Comparing two papers:
Paper 1: Can Large Language Models Detect Real-World Androi...
Paper 2: GDPR-Bench-Android: A Benchmark for Evaluating Aut...

Cosine similarity: 0.6446

The cosine similarity (0.6446) is identical to the dot product we calculated earlier. This isn't a coincidence. Cohere's embeddings are normalized to unit length, which means the dot product and cosine similarity are mathematically equivalent for these vectors. When embeddings are normalized, the denominator in the cosine formula (the product of the vector magnitudes) always equals 1, leaving just the dot product. This is why many vector databases prefer dot product for normalized embeddings. It's computationally cheaper and produces identical results to cosine.

Now let's compare this to a paper from a completely different category:

# Compare a software engineering paper to a database paper
similarity_related = cosine_similarity_manual(embeddings[paper_idx_1],
                                             embeddings[paper_idx_2])
similarity_unrelated = cosine_similarity_manual(embeddings[paper_idx_1],
                                               embeddings[paper_idx_3])

print(f"Software Engineering paper 1 vs Software Engineering paper 2:")
print(f"  Similarity: {similarity_related:.4f}")
print(f"\nSoftware Engineering paper vs Database paper:")
print(f"  Similarity: {similarity_unrelated:.4f}")
print(f"\nThe SE papers are {similarity_related/similarity_unrelated:.2f}x more similar")
Software Engineering paper 1 vs Software Engineering paper 2:
  Similarity: 0.6446

Software Engineering paper vs Database paper:
  Similarity: 0.2140

The SE papers are 3.01x more similar

Great! The similarity score correctly identifies that papers from the same category are much more similar to each other than papers from different categories.

Now, calculating similarity one pair at a time is fine for understanding, but it's not practical for search. We need to compare a query against all 500 papers efficiently. Let's use scikit-learn's optimized implementation:

from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarity between one paper and all other papers
query_embedding = embeddings[paper_idx_1].reshape(1, -1)
all_similarities = cosine_similarity(query_embedding, embeddings)

# Get the top 10 most similar papers (excluding the query itself)
top_indices = np.argsort(all_similarities[0])[::-1][1:11]

print(f"Query paper: {df['title'].iloc[paper_idx_1]}\n")
print("Top 10 most similar papers:")
for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. [{all_similarities[0][idx]:.4f}] {df['title'].iloc[idx][:50]}...")
Query paper: Can Large Language Models Detect Real-World Android Software Compliance Violations?

Top 10 most similar papers:
1. [0.6446] GDPR-Bench-Android: A Benchmark for Evaluating Aut...
2. [0.4831] An Empirical Study of LLM-Based Code Clone Detecti...
3. [0.4779] LLM-as-a-Judge is Bad, Based on AI Attempting the ...
4. [0.4445] BengaliMoralBench: A Benchmark for Auditing Moral ...
5. [0.4300] Exploring the Feasibility of End-to-End Large Lang...
6. [0.4243] Where Do LLMs Still Struggle? An In-Depth Analysis...
7. [0.4243] Where Do LLMs Still Struggle? An In-Depth Analysis...
8. [0.4208] EvoDev: An Iterative Feature-Driven Framework for ...
9. [0.4204] Watermarking Large Language Models in Europe: Inte...
10. [0.4153] One Battle After Another: Probing LLMs' Limits on ...

Notice how scikit-learn's cosine_similarity function is much cleaner. It handles the reshaping and broadcasts the calculation efficiently across all papers. This is what you'll use in production code, but understanding the manual implementation helps you see what's happening under the hood.

You might notice papers 6 and 7 appear to be duplicates with identical scores. This happens because the same paper was cross-listed in multiple arXiv categories. In a production system, you'd typically de-duplicate results using a stable identifier like the arXiv ID, showing each unique paper only once while perhaps noting all its categories.

Choosing the Right Metric for Your Use Case

Now that we've implemented all three metrics, let's understand when to use each one. Here's a practical comparison:

Metric When to Use Advantages Considerations
Euclidean Distance Use when the absolute position in vector space matters, or for scientific computing applications. Intuitive geometric interpretation. Common in general machine learning tasks beyond NLP. Lower scores mean higher similarity (inverse relationship). Can be sensitive to vector magnitude.
Dot Product Use when embeddings are already normalized to unit length. Common in vector databases. Fastest computation. Identical rankings to cosine for normalized vectors. Many vector DBs optimize for this. Only equivalent to cosine when vectors are normalized. Check your embedding model's documentation.
Cosine Similarity Default choice for text embeddings. Use when you care about semantic similarity regardless of document length. Most common in NLP. Normalized by default (outputs 0 to 1). Works well with sentence-transformers and most embedding APIs. Requires normalization calculation. Slightly more computationally expensive than dot product.

Going forward, we'll use cosine similarity because it's the standard for text embeddings and produces interpretable scores between 0 and 1.

Let's verify that our embeddings produce consistent rankings across metrics:

# Compare rankings from all three metrics for a single query
query_embedding = embeddings[paper_idx_1].reshape(1, -1)

# Calculate similarities/distances
cosine_scores = cosine_similarity(query_embedding, embeddings)[0]
dot_scores = np.dot(embeddings, embeddings[paper_idx_1])
euclidean_scores = euclidean_distances(query_embedding, embeddings)[0]

# Get top 10 indices for each metric
top_cosine = set(np.argsort(cosine_scores)[::-1][1:11])
top_dot = set(np.argsort(dot_scores)[::-1][1:11])
top_euclidean = set(np.argsort(euclidean_scores)[1:11])

# Calculate overlap
cosine_dot_overlap = len(top_cosine & top_dot)
cosine_euclidean_overlap = len(top_cosine & top_euclidean)
all_three_overlap = len(top_cosine & top_dot & top_euclidean)

print(f"Top 10 papers overlap between metrics:")
print(f"  Cosine & Dot Product: {cosine_dot_overlap}/10 papers match")
print(f"  Cosine & Euclidean: {cosine_euclidean_overlap}/10 papers match")
print(f"  All three metrics: {all_three_overlap}/10 papers match")
Top 10 papers overlap between metrics:
  Cosine & Dot Product: 10/10 papers match
  Cosine & Euclidean: 10/10 papers match
  All three metrics: 10/10 papers match

For our Cohere embeddings with these 500 papers, all three metrics produce identical top-10 rankings. This happens when embeddings are well-normalized, but isn't guaranteed across all embedding models or datasets. What matters more than perfect metric agreement is understanding what each metric measures and when to use it.

Building Your Search Function

Now let's build a complete semantic search function that ties everything together. This function will take a natural language query, convert it into an embedding, and return the most relevant papers.

Before building our search function, ensure your Cohere API key is configured. As we did in the previous tutorial, you should have a .env file in your project directory with your API key:

COHERE_API_KEY=your_key_here

Now let's build the search function:

from cohere import ClientV2
from dotenv import load_dotenv
import os

# Load Cohere API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
co = ClientV2(api_key=cohere_api_key)

def semantic_search(query, embeddings, df, top_k=5, metric='cosine'):
    """
    Search for papers semantically similar to a query.

    Parameters:
    -----------
    query : str
        Natural language search query
    embeddings : numpy array
        Pre-computed embeddings for all papers
    df : pandas DataFrame
        DataFrame containing paper metadata
    top_k : int
        Number of results to return
    metric : str
        Similarity metric to use ('cosine', 'dot', or 'euclidean')

    Returns:
    --------
    pandas DataFrame
        Top results with similarity scores
    """
    # Generate embedding for the query
    response = co.embed(
        texts=[query],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0]).reshape(1, -1)

    # Calculate similarities based on chosen metric
    if metric == 'cosine':
        scores = cosine_similarity(query_embedding, embeddings)[0]
        top_indices = np.argsort(scores)[::-1][:top_k]
    elif metric == 'dot':
        scores = np.dot(embeddings, query_embedding.flatten())
        top_indices = np.argsort(scores)[::-1][:top_k]
    elif metric == 'euclidean':
        scores = euclidean_distances(query_embedding, embeddings)[0]
        top_indices = np.argsort(scores)[:top_k]
        scores = 1 / (1 + scores)
    else:
        raise ValueError(f"Unknown metric: {metric}")

    # Create results DataFrame
    results = df.iloc[top_indices].copy()
    results['similarity_score'] = scores[top_indices]
    results = results[['title', 'category', 'similarity_score', 'abstract']]

    return results

# Test the search function
query = "query optimization algorithms"
results = semantic_search(query, embeddings, df, top_k=5)

separator = "=" * 80
print(f"Query: '{query}'\n")
print(f"Top 5 most relevant papers:\n{separator}")
for idx, row in results.iterrows():
    print(f"\n{row['title']}")
    print(f"Category: {row['category']} | Similarity: {row['similarity_score']:.4f}")
    print(f"Abstract: {row['abstract'][:150]}...")
Query: 'query optimization algorithms'

Top 5 most relevant papers:
================================================================================

Query Optimization in the Wild: Realities and Trends
Category: cs.DB | Similarity: 0.4206
Abstract: For nearly half a century, the core design of query optimizers in industrial database systems has remained remarkably stable, relying on foundational
...

Hybrid Mixed Integer Linear Programming for Large-Scale Join Order Optimisation
Category: cs.DB | Similarity: 0.3795
Abstract: Finding optimal join orders is among the most crucial steps to be performed by query optimisers. Though extensively studied in data management researc...

One Join Order Does Not Fit All: Reducing Intermediate Results with Per-Split Query Plans
Category: cs.DB | Similarity: 0.3682
Abstract: Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarant...

PathFinder: Efficiently Supporting Conjunctions and Disjunctions for Filtered Approximate Nearest Neighbor Search
Category: cs.DB | Similarity: 0.3673
Abstract: Filtered approximate nearest neighbor search (ANNS) restricts the search to data objects whose attributes satisfy a given filter and retrieves the top...

Fine-Grained Dichotomies for Conjunctive Queries with Minimum or Maximum
Category: cs.DB | Similarity: 0.3666
Abstract: We investigate the fine-grained complexity of direct access to Conjunctive Query (CQ) answers according to their position, ordered by the minimum (or
...

Excellent! Our search function found highly relevant papers about query optimization. Notice how all the top results are from the cs.DB (Databases) category and have strong similarity scores.

Before we move on, let's talk about what these similarity scores mean. Notice our top score is around 0.42 rather than 0.85 or higher. This is completely normal for multi-domain datasets. We're working with 500 papers spanning five distinct computer science fields (Machine Learning, Computer Vision, NLP, Databases, Software Engineering). When your dataset covers diverse topics, even genuinely relevant papers show moderate absolute scores because the overall vocabulary space is broad.

If we had a specialized dataset focused narrowly on one topic, say only database query optimization papers, we'd see higher absolute scores. What matters most is relative ranking. The top results are still the most relevant papers, and the ranking accurately reflects semantic similarity. Pay attention to the score differences between results rather than treating specific thresholds as universal truths.

Let's test it with a few more diverse queries to see how well it works across different topics:

# Test multiple queries
test_queries = [
    "language model pretraining",
    "reinforcement learning algorithms",
    "code quality analysis"
]

for query in test_queries:
    print(f"\nQuery: '{query}'\n{separator}")
    results = semantic_search(query, embeddings, df, top_k=3)

    for idx, row in results.iterrows():
        print(f"  [{row['similarity_score']:.4f}] {row['title'][:50]}...")
        print(f"           Category: {row['category']}")
Query: 'language model pretraining'
================================================================================
  [0.4240] Reusing Pre-Training Data at Test Time is a Comput...
           Category: cs.CL
  [0.4102] Evo-1: Lightweight Vision-Language-Action Model wi...
           Category: cs.CV
  [0.3910] PixCLIP: Achieving Fine-grained Visual Language Un...
           Category: cs.CV

Query: 'reinforcement learning algorithms'
================================================================================
  [0.3477] Exchange Policy Optimization Algorithm for Semi-In...
           Category: cs.LG
  [0.3429] Fitting Reinforcement Learning Model to Behavioral...
           Category: cs.LG
  [0.3091] Online Algorithms for Repeated Optimal Stopping: A...
           Category: cs.LG

Query: 'code quality analysis'
================================================================================
  [0.3762] From Code Changes to Quality Gains: An Empirical S...
           Category: cs.SE
  [0.3662] Speed at the Cost of Quality? The Impact of LLM Ag...
           Category: cs.SE
  [0.3502] A Systematic Literature Review of Code Hallucinati...
           Category: cs.SE

The search function correctly identifies relevant papers for each query. The language model query returns papers about training language models. The reinforcement learning query returns papers about RL algorithms. The code quality query returns papers about testing and technical debt.

Notice how the semantic search understands the meaning behind the queries, not just keyword matching. Even though our queries use natural language, the system finds papers that match the intent.

Visualizing Search Results in Embedding Space

We've seen the search function work, but let's visualize what's actually happening in embedding space. This will help you understand why certain papers are retrieved for a given query. We'll use PCA to reduce our embeddings to 2D and show how the query relates spatially to its results:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

def visualize_search_results(query, embeddings, df, top_k=10):
    """
    Visualize search results in 2D embedding space.
    """
    # Get search results
    response = co.embed(
        texts=[query],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Calculate similarities
    similarities = cosine_similarity(query_embedding.reshape(1, -1), embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Combine query embedding with all paper embeddings for PCA
    all_embeddings_with_query = np.vstack([query_embedding, embeddings])

    # Reduce to 2D
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(all_embeddings_with_query)

    # Split back into query and papers
    query_2d = embeddings_2d[0]
    papers_2d = embeddings_2d[1:]

    # Create visualization
    plt.figure(figsize=(8, 6))

    # Define colors for categories
    colors = ['#C8102E', '#003DA5', '#00843D', '#FF8200', '#6A1B9A']
    category_codes = ['cs.LG', 'cs.CV', 'cs.CL', 'cs.DB', 'cs.SE']
    category_names = ['Machine Learning', 'Computer Vision', 'Comp. Linguistics',
                     'Databases', 'Software Eng.']

    # Plot all papers with subtle colors
    for i, (cat_code, cat_name, color) in enumerate(zip(category_codes,
                                                         category_names, colors)):
        mask = df['category'] == cat_code
        cat_embeddings = papers_2d[mask]
        plt.scatter(cat_embeddings[:, 0], cat_embeddings[:, 1],
                   c=color, label=cat_name, s=30, alpha=0.3, edgecolors='none')

    # Highlight top results
    top_embeddings = papers_2d[top_indices]
    plt.scatter(top_embeddings[:, 0], top_embeddings[:, 1],
               c='black', s=150, alpha=0.6, edgecolors='yellow', linewidth=2,
               marker='o', label=f'Top {top_k} Results', zorder=5)

    # Plot query point
    plt.scatter(query_2d[0], query_2d[1],
               c='red', s=400, alpha=0.9, edgecolors='black', linewidth=2,
               marker='*', label='Query', zorder=10)

    # Draw lines from query to top results
    for idx in top_indices:
        plt.plot([query_2d[0], papers_2d[idx, 0]],
                [query_2d[1], papers_2d[idx, 1]],
                'k--', alpha=0.2, linewidth=1, zorder=1)

    plt.xlabel('First Principal Component', fontsize=12)
    plt.ylabel('Second Principal Component', fontsize=12)
    plt.title(f'Search Results for: "{query}"\n' +
             '(Query shown as red star, top results highlighted)',
             fontsize=14, fontweight='bold', pad=20)
    plt.legend(loc='best', fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Print the top results
    print(f"\nTop {top_k} results for query: '{query}'\n{separator}")
    for rank, idx in enumerate(top_indices, 1):
        print(f"{rank}. [{similarities[idx]:.4f}] {df['title'].iloc[idx][:50]}...")
        print(f"   Category: {df['category'].iloc[idx]}")

# Visualize search results for a query
query = "language model pretraining"
visualize_search_results(query, embeddings, df, top_k=10)

Language Model Pretraining

Top 10 results for query: 'language model pretraining'
================================================================================
1. [0.4240] Reusing Pre-Training Data at Test Time is a Comput...
   Category: cs.CL
2. [0.4102] Evo-1: Lightweight Vision-Language-Action Model wi...
   Category: cs.CV
3. [0.3910] PixCLIP: Achieving Fine-grained Visual Language Un...
   Category: cs.CV
4. [0.3713] PLLuM: A Family of Polish Large Language Models...
   Category: cs.CL
5. [0.3712] SCALE: Upscaled Continual Learning of Large Langua...
   Category: cs.CL
6. [0.3528] Q3R: Quadratic Reweighted Rank Regularizer for Eff...
   Category: cs.LG
7. [0.3334] LLMs and Cultural Values: the Impact of Prompt Lan...
   Category: cs.CL
8. [0.3297] TwIST: Rigging the Lottery in Transformers with In...
   Category: cs.LG
9. [0.3278] IndicSuperTokenizer: An Optimized Tokenizer for In...
   Category: cs.CL
10. [0.3157] Bearing Syntactic Fruit with Stack-Augmented Neura...
   Category: cs.CL

This visualization reveals exactly what's happening during semantic search. The red star represents your query embedding. The black circles highlighted in yellow are the top 10 results. The dotted lines connect the query to its top 10 matches, showing the spatial relationships.

Notice how most of the top results cluster near the query in embedding space. The majority are from the Computational Linguistics category (the green cluster), which makes perfect sense for a query about language model pretraining. Papers from other categories sit farther away in the visualization, corresponding to their lower similarity scores.

You might notice some papers that appear visually closer to the red query star aren't in our top 10 results. This happens because PCA compresses 1536 dimensions down to just 2 for visualization. This lossy compression can't perfectly preserve all distance relationships from the original high-dimensional space. The similarity scores we display are calculated in the full 1536-dimensional embedding space before PCA, which is why they're more accurate than visual proximity in this 2D plot. Think of the visualization as showing general clustering patterns rather than exact rankings.

This spatial representation makes the abstract concept of similarity concrete. High similarity scores mean points that are close together in the original high-dimensional embedding space. When we say two papers are semantically similar, we're saying their embeddings point in similar directions.

Let's try another visualization with a different query:

# Try a more specific query
query = "reinforcement learning algorithms"
visualize_search_results(query, embeddings, df, top_k=10)

Reinforcement Learning Algorithms

Top 10 results for query: 'reinforcement learning algorithms'
================================================================================
1. [0.3477] Exchange Policy Optimization Algorithm for Semi-In...
   Category: cs.LG
2. [0.3429] Fitting Reinforcement Learning Model to Behavioral...
   Category: cs.LG
3. [0.3091] Online Algorithms for Repeated Optimal Stopping: A...
   Category: cs.LG
4. [0.3062] DeepPAAC: A New Deep Galerkin Method for Principal...
   Category: cs.LG
5. [0.2970] Environment Agnostic Goal-Conditioning, A Study of...
   Category: cs.LG
6. [0.2925] Forgetting is Everywhere...
   Category: cs.LG
7. [0.2865] RLHF: A comprehensive Survey for Cultural, Multimo...
   Category: cs.CL
8. [0.2857] RLHF: A comprehensive Survey for Cultural, Multimo...
   Category: cs.LG
9. [0.2827] End-to-End Reinforcement Learning of Koopman Model...
   Category: cs.LG
10. [0.2813] GrowthHacker: Automated Off-Policy Evaluation Opti...
   Category: cs.SE

This visualization shows clear clustering around the Machine Learning region (red cluster), and most top results are ML papers about reinforcement learning. The query star lands right in the middle of where we'd expect for an RL-focused query, and the top results fan out from there in the embedding space.

Use these visualizations to spot broad trends (like whether your query lands in the right category cluster), not to validate exact rankings. The rankings come from measuring distances in all 1536 dimensions, while the visualization shows only 2.

Evaluating Search Quality

How do we know if our search system is working well? In production systems, you'd use quantitative metrics like Precision@K, Recall@K, or Mean Average Precision (MAP). These metrics require labeled relevance judgments where humans mark which papers are relevant for specific queries.

For this tutorial, we'll use qualitative evaluation. Let's examine results for a query and assess whether they make sense:

# Detailed evaluation of a single query
query = "anomaly detection techniques"
results = semantic_search(query, embeddings, df, top_k=10)

print(f"Query: '{query}'\n")
print(f"Detailed Results:\n{separator}")

for rank, (idx, row) in enumerate(results.iterrows(), 1):
    print(f"\nRank {rank} | Similarity: {row['similarity_score']:.4f}")
    print(f"Title: {row['title']}")
    print(f"Category: {row['category']}")
    print(f"Abstract: {row['abstract'][:200]}...")
    print("-" * 80)
Query: 'anomaly detection techniques'

Detailed Results:
================================================================================

Rank 1 | Similarity: 0.3895
Title: An Encode-then-Decompose Approach to Unsupervised Time Series Anomaly Detection on Contaminated Training Data--Extended Version
Category: cs.DB
Abstract: Time series anomaly detection is important in modern large-scale systems and is applied in a variety of domains to analyze and monitor the operation of
diverse systems. Unsupervised approaches have re...
--------------------------------------------------------------------------------

Rank 2 | Similarity: 0.3268
Title: DeNoise: Learning Robust Graph Representations for Unsupervised Graph-Level Anomaly Detection
Category: cs.LG
Abstract: With the rapid growth of graph-structured data in critical domains,
unsupervised graph-level anomaly detection (UGAD) has become a pivotal task.
UGAD seeks to identify entire graphs that deviate from ...
--------------------------------------------------------------------------------

Rank 3 | Similarity: 0.3218
Title: IEC3D-AD: A 3D Dataset of Industrial Equipment Components for Unsupervised Point Cloud Anomaly Detection
Category: cs.CV
Abstract: 3D anomaly detection (3D-AD) plays a critical role in industrial
manufacturing, particularly in ensuring the reliability and safety of core
equipment components. Although existing 3D datasets like Rea...
--------------------------------------------------------------------------------

Rank 4 | Similarity: 0.3085
Title: Conditional Score Learning for Quickest Change Detection in Markov Transition Kernels
Category: cs.LG
Abstract: We address the problem of quickest change detection in Markov processes with unknown transition kernels. The key idea is to learn the conditional score
$nabla_{\mathbf{y}} \log p(\mathbf{y}|\mathbf{x...
--------------------------------------------------------------------------------

Rank 5 | Similarity: 0.3053
Title: Multiscale Astrocyte Network Calcium Dynamics for Biologically Plausible Intelligence in Anomaly Detection
Category: cs.LG
Abstract: Network anomaly detection systems encounter several challenges with
traditional detectors trained offline. They become susceptible to concept drift
and new threats such as zero-day or polymorphic atta...
--------------------------------------------------------------------------------

Rank 6 | Similarity: 0.2907
Title: I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging
Category: cs.CV
Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert
supervision. We introduce an unsupervised, oracle-free...
--------------------------------------------------------------------------------

Rank 7 | Similarity: 0.2901
Title: Adaptive Detection of Software Aging under Workload Shift
Category: cs.SE
Abstract: Software aging is a phenomenon that affects long-running systems, leading to progressive performance degradation and increasing the risk of failures. To
mitigate this problem, this work proposes an ad...
--------------------------------------------------------------------------------

Rank 8 | Similarity: 0.2763
Title: The Impact of Data Compression in Real-Time and Historical Data Acquisition Systems on the Accuracy of Analytical Solutions
Category: cs.DB
Abstract: In industrial and IoT environments, massive amounts of real-time and
historical process data are continuously generated and archived. With sensors
and devices capturing every operational detail, the v...
--------------------------------------------------------------------------------

Rank 9 | Similarity: 0.2570
Title: A Large Scale Study of AI-based Binary Function Similarity Detection Techniques for Security Researchers and Practitioners
Category: cs.SE
Abstract: Binary Function Similarity Detection (BFSD) is a foundational technique in software security, underpinning a wide range of applications including
vulnerability detection, malware analysis. Recent adva...
--------------------------------------------------------------------------------

Rank 10 | Similarity: 0.2418
Title: Fraud-Proof Revenue Division on Subscription Platforms
Category: cs.LG
Abstract: We study a model of subscription-based platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predom...
--------------------------------------------------------------------------------

Looking at these results, we can assess quality by asking:

  • Are the results relevant to the query? Yes! All papers discuss anomaly detection techniques and methods.
  • Are similarity scores meaningful? Yes! Higher-ranked papers are more directly relevant to the query.
  • Does the ranking make sense? Yes! The top result is specifically about time series anomaly detection, which directly matches our query.

Let's look at what similarity score thresholds might indicate:

# Analyze the distribution of similarity scores
query = "query optimization algorithms"
results = semantic_search(query, embeddings, df, top_k=50)

print(f"Query: '{query}'")
print(f"\nSimilarity score distribution for top 50 results:")
print(f"  Highest score: {results['similarity_score'].max():.4f}")
print(f"  Median score: {results['similarity_score'].median():.4f}")
print(f"  Lowest score: {results['similarity_score'].min():.4f}")

# Show how scores change with rank
print(f"\nScore decay by rank:")
for rank in [1, 5, 10, 20, 30, 40, 50]:
    score = results['similarity_score'].iloc[rank-1]
    print(f"  Rank {rank:2d}: {score:.4f}")
Query: 'query optimization algorithms'

Similarity score distribution for top 50 results:
  Highest score: 0.4206
  Median score: 0.2765
  Lowest score: 0.2402

Score decay by rank:
  Rank  1: 0.4206
  Rank  5: 0.3666
  Rank 10: 0.3144
  Rank 20: 0.2910
  Rank 30: 0.2737
  Rank 40: 0.2598
  Rank 50: 0.2402

Similarity score interpretation depends heavily on your dataset characteristics. Here are general heuristics, but they require adjustment based on your specific data:

For broad, multi-domain datasets (like ours with 5 distinct categories):

  • 0.40+: Highly relevant
  • 0.30-0.40: Very relevant
  • 0.25-0.30: Moderately relevant
  • Below 0.25: Questionable relevance

For narrow, specialized datasets (single domain):

  • 0.70+: Highly relevant
  • 0.60-0.70: Very relevant
  • 0.50-0.60: Moderately relevant
  • Below 0.50: Questionable relevance

The key is understanding relative rankings within your dataset rather than treating these as universal thresholds. Our multi-domain dataset naturally produces lower absolute scores than a specialized single-topic dataset would. What matters is that the top results are genuinely more relevant than lower-ranked results.

Testing Edge Cases

A good search system should handle different types of queries gracefully. Let's test some edge cases:

# Test 1: Very specific technical query
print(f"Test 1: Highly Specific Query\n{separator}")
query = "graph neural networks for molecular property prediction"
results = semantic_search(query, embeddings, df, top_k=3)
print(f"Query: '{query}'\n")
for idx, row in results.iterrows():
    print(f"  [{row['similarity_score']:.4f}] {row['title'][:50]}...")

# Test 2: Broad general query
print(f"\n\nTest 2: Broad General Query\n{separator}")
query = "artificial intelligence"
results = semantic_search(query, embeddings, df, top_k=3)
print(f"Query: '{query}'\n")
for idx, row in results.iterrows():
    print(f"  [{row['similarity_score']:.4f}] {row['title'][:50]}...")

# Test 3: Query with common words
print(f"\n\nTest 3: Common Words Query\n{separator}")
query = "learning from data"
results = semantic_search(query, embeddings, df, top_k=3)
print(f"Query: '{query}'\n")
for idx, row in results.iterrows():
    print(f"  [{row['similarity_score']:.4f}] {row['title'][:50]}...")
Test 1: Highly Specific Query
================================================================================
Query: 'graph neural networks for molecular property prediction'

  [0.3602] ScaleDL: Towards Scalable and Efficient Runtime Pr...
  [0.3072] RELATE: A Schema-Agnostic Perceiver Encoder for Mu...
  [0.3032] Dark Energy Survey Year 3 results: Simulation-base...

Test 2: Broad General Query
================================================================================
Query: 'artificial intelligence'

  [0.3202] Lessons Learned from the Use of Generative AI in E...
  [0.3137] AI for Distributed Systems Design: Scalable Cloud ...
  [0.3096] SmartMLOps Studio: Design of an LLM-Integrated IDE...

Test 3: Common Words Query
================================================================================
Query: 'learning from data'

  [0.2912] PrivacyCD: Hierarchical Unlearning for Protecting ...
  [0.2879] Learned Static Function Data Structures...
  [0.2732] REMIND: Input Loss Landscapes Reveal Residual Memo...

Notice what happens:

  • Specific queries return focused, relevant results with higher similarity scores
  • Broad queries return more general papers about AI with moderate scores
  • Common word queries still find relevant content because embeddings understand context

This demonstrates the power of semantic search over keyword matching. A keyword search for "learning from data" would match almost everything, but semantic search understands the intent and returns papers about data-driven learning and optimization.

Understanding Retrieval Quality

Let's create a function to help us understand why certain papers are retrieved for a query:

def explain_search_result(query, paper_idx, embeddings, df):
    """
    Explain why a particular paper was retrieved for a query.
    """
    # Get query embedding
    response = co.embed(
        texts=[query],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Calculate similarity
    paper_embedding = embeddings[paper_idx]
    similarity = cosine_similarity(
        query_embedding.reshape(1, -1),
        paper_embedding.reshape(1, -1)
    )[0][0]

    # Show the result
    print(f"Query: '{query}'")
    print(f"\nPaper: {df['title'].iloc[paper_idx]}")
    print(f"Category: {df['category'].iloc[paper_idx]}")
    print(f"Similarity Score: {similarity:.4f}")
    print(f"\nAbstract:")
    print(df['abstract'].iloc[paper_idx][:300] + "...")

    # Show how this compares to all papers
    all_similarities = cosine_similarity(
        query_embedding.reshape(1, -1),
        embeddings
    )[0]
    rank = (all_similarities > similarity).sum() + 1

    print(f"\nRanking: {rank}/{len(df)} papers")
    percentage = (len(df) - rank)/len(df)*100
    print(f"This paper is more relevant than {percentage:.1f}% of papers")

# Explain why a specific paper was retrieved
query = "database query optimization"
paper_idx = 322
explain_search_result(query, paper_idx, embeddings, df)
Query: 'database query optimization'

Paper: L2T-Tune:LLM-Guided Hybrid Database Tuning with LHS and TD3
Category: cs.DB
Similarity Score: 0.3374

Abstract:
Configuration tuning is critical for database performance. Although recent
advancements in database tuning have shown promising results in throughput and
latency improvement, challenges remain. First, the vast knob space makes direct
optimization unstable and slow to converge. Second, reinforcement ...

Ranking: 9/500 papers
This paper is more relevant than 98.2% of papers

This explanation shows exactly why the paper was retrieved: it has a solid similarity score (0.3373) and ranks in the top 2% of all papers for this query. The abstract clearly discusses database configuration tuning and optimization, which matches the query intent perfectly.

Applying These Skills to Your Own Projects

You now have a complete semantic search system. The skills you've learned here transfer directly to any domain where you need to find relevant documents based on meaning.

The pattern is always the same:

  1. Collect documents (APIs, databases, file systems)
  2. Generate embeddings (local models or API services)
  3. Store embeddings efficiently (files or vector databases)
  4. Implement similarity calculations (cosine, dot product, or Euclidean)
  5. Build a search function that returns ranked results
  6. Evaluate results to ensure quality

This exact workflow applies whether you're building:

  • A research paper search engine (what we just built)
  • A code search system for documentation
  • A customer support knowledge base
  • A product recommendation system
  • A legal document retrieval system

The only difference is the data source. Everything else remains the same.

Optimizing for Production

Before we wrap up, let's discuss a few optimizations for production systems. We won't implement these now, but knowing they exist will help you scale your search system when needed.

1. Caching Query Embeddings

If users frequently search for similar queries, caching the embeddings can save significant API calls and computation time. Store the query text and its embedding in memory or a database. When a user searches, check if you've already generated an embedding for that exact query. This simple optimization can reduce costs and improve response times, especially for popular searches.

2. Approximate Nearest Neighbors

For datasets with millions of embeddings, exact similarity calculations become slow. Libraries like FAISS, Annoy, or ScaNN provide approximate nearest neighbor search that's much faster. These specialized libraries use clever indexing techniques to quickly find embeddings that are close to your query without calculating the exact distance to every single vector in your database. While we didn't implement this in our tutorial series, it's worth knowing that these tools exist for production systems handling large-scale search.

3. Batch Query Processing

When processing multiple queries, batch them together for efficiency. Instead of generating embeddings one query at a time, send multiple queries to your embedding API in a single request. Most embedding APIs support batch processing, which reduces network overhead and can be significantly faster than sequential processing. This approach is particularly valuable when you need to process many queries at once, such as during system evaluation or when handling multiple concurrent users.

4. Vector Database Storage

For production systems, use vector databases (Pinecone, Weaviate, Chroma) rather than files. Vector databases handle indexing, similarity search optimization, and storage efficiency automatically. They also apply float32 precision by default for memory efficiency. Something you'd need to handle manually with file-based storage (converting from float64 to float32 can halve storage requirements with minimal impact on search quality).

5. Document Chunking for Long Content

In our tutorial, we embedded entire paper abstracts as single units. This works fine for abstracts (which are naturally concise), but production systems often process longer documents like full papers, documentation, or articles. Industry best practice is to chunk these into coherent sections (typically 200-1000 tokens per chunk) for optimal semantic fidelity. This ensures each embedding captures a focused concept rather than trying to represent an entire document's diverse topics in a single vector. Modern models with high token limits (8k+ tokens) make this less critical than before, but chunking still improves retrieval quality for longer content.

These optimizations become critical as your system scales, but the core concepts remain the same. Start with the straightforward implementation we've built, then add these optimizations when performance or cost becomes a concern.

What You've Accomplished

You've built a complete semantic search system from the ground up! Let's review what you've learned:

  1. You understand three distance metrics (Euclidean distance, dot product, cosine similarity) and when to use each one.
  2. You can implement similarity calculations both manually (to understand the math) and efficiently (using scikit-learn).
  3. You've built a search function that converts queries to embeddings and returns ranked results.
  4. You can visualize search results in embedding space to understand spatial relationships between queries and documents.
  5. You can evaluate search quality qualitatively by examining results and similarity scores.
  6. You understand how to optimize search systems for production with caching, approximate search, batching, vector databases, and document chunking.

Most importantly, you now have the complete skillset to build your own search engine. You know how to:

  • Collect data from APIs
  • Generate embeddings
  • Calculate semantic similarity
  • Build and evaluate search systems

Next Steps

Before moving on, try experimenting with your search system:

Test different query styles:

  • Try very short queries ("neural nets") vs detailed queries ("applying deep neural networks to computer vision tasks")
  • See how the system handles questions vs keywords
  • Test queries that combine multiple topics

Explore the similarity threshold:

  • Set a minimum similarity threshold (e.g., 0.30) and see how many results pass
  • Test what happens with a very strict threshold (0.40+)
  • Find the sweet spot for your use case

Analyze failure cases:

  • Find queries where the results aren't great
  • Understand why (too broad? too specific? wrong domain?)
  • Think about how you'd improve the system

Compare categories:

  • Search for "deep learning" and see which categories dominate results
  • Try category-specific searches and verify papers match
  • Look for interesting cross-category papers

Visualize different queries:

  • Create visualizations for queries from different domains
  • Observe how the query point moves in embedding space
  • Notice which categories cluster near different types of queries

This experimentation will sharpen your intuition about how semantic search works and prepare you to debug issues in your own projects.


Key Takeaways:

  • Euclidean distance measures straight-line distance between vectors and is the most intuitive metric
  • Dot product multiplies corresponding elements and is computationally efficient
  • Cosine similarity measures the angle between vectors and is the standard for text embeddings
  • For well-normalized embeddings, all three metrics typically produce similar rankings
  • Similarity scores depend on dataset characteristics and should be interpreted relative to your specific data
  • Multi-domain datasets naturally produce lower absolute scores than specialized single-topic datasets
  • Visualizing search results in 2D embedding space helps understand clustering patterns, though exact rankings come from the full high-dimensional space
  • The spatial proximity of embeddings directly corresponds to semantic similarity scores
  • Production search systems benefit from query caching, approximate nearest neighbors, batch processing, vector databases, and document chunking
  • The semantic search pattern (collect, embed, calculate similarity, rank) applies universally across domains
  • Qualitative evaluation through manual inspection is crucial for understanding search quality
  • Edge cases like very broad or very specific queries test the robustness of your search system
  • These skills transfer directly to building search systems in any domain with any type of content

Running and Managing Apache Airflow with Docker (Part II)

8 November 2025 at 02:17

In the previous tutorial, we set up Apache Airflow inside Docker, explored its architecture, and built our first real DAG using the TaskFlow API. We simulated an ETL process with two stages — Extract and Transform, demonstrating how Airflow manages dependencies, task retries, and dynamic parallel execution through Dynamic Task Mapping. By the end, we had a functional, scalable workflow capable of processing multiple datasets in parallel, a key building block for modern data pipelines.

In this tutorial, we’ll build on what you created earlier and take a significant step toward production-style orchestration. You’ll complete the ETL lifecycle by adding the Load stage and connecting Airflow to a local MySQL database. This will allow you to load transformed data directly from your pipeline and manage database connections securely using Airflow’s Connections and Environment Variables.

Beyond data loading, you’ll integrate Git and Git Sync into your Airflow environment to enable version control, collaboration, and continuous deployment of DAGs. These practices mirror how data engineering teams manage Airflow projects in real-world settings, promoting consistency, reliability, and scalability, while still keeping the focus on learning and experimentation.

By the end of this part, your Airflow setup will move beyond a simple sandbox and start resembling a production-aligned environment. You’ll have a complete ETL pipeline, from extraction and transformation to loading and automation, and a clear understanding of how professional teams structure and manage their workflows.

Working with MYSQL in Airflow

In the previous section, we built a fully functional Airflow pipeline that dynamically extracted and transformed market data from multiple regions , us, europe, asia, and africa. Each branch of our DAG handled its own extract and transform tasks independently, creating separate CSV files for each region under /opt/airflow/tmp. This setup mimics a real-world data engineering workflow where regional datasets are processed in parallel before being stored or analyzed further.

Now that our transformed datasets are generated, the next logical step is to load them into a database, a critical phase in any ETL pipeline. This not only centralizes your processed data but also allows for downstream analysis, reporting, and integration with BI tools like Power BI or Looker.

While production pipelines often write to cloud-managed databases such as Amazon RDS, Google Cloud SQL, or Azure Database for MySQL, we’ll keep things local and simple by using a MySQL instance on your machine. This approach allows you to test and validate your Airflow workflows without relying on external cloud resources or credentials. The same logic, however, can later be applied seamlessly to remote or cloud-hosted databases.

Prerequisite: Install and Set Up MySQL Locally

Before adding the Load step to our DAG, ensure that MySQL is installed and running on your machine.

Install MySQL

  • Windows/macOS: Download and install MySQL Community Server.

  • Linux (Ubuntu):

    sudo apt update
    sudo apt install mysql-server -y
    sudo systemctl start mysql
    
  • Verify installation by running:

mysql -u root -p

Create a Database and User for Airflow

Inside your MySQL terminal or MySQL Workbench, run the following commands:

CREATE DATABASE IF NOT EXISTS airflow_db;
CREATE USER IF NOT EXISTS 'airflow'@'%' IDENTIFIED BY 'airflow';
GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow'@'%';
FLUSH PRIVILEGES;

This creates a simple local database called airflow_db and a user airflow with full access, perfect for development and testing.

Create a Database and User for Airflow

Network Configuration for Linux Users

When running Airflow in Docker and MySQL locally on Linux, Docker containers can’t automatically access localhost.

To fix this, you need to make your local machine reachable from inside Docker.

Open your docker-compose.yaml file and add the following line under the x-airflow-common service definition:

extra_hosts:
  - "host.docker.internal:host-gateway"

This line creates a bridge that allows Airflow containers to communicate with your local MySQL instance using the hostname host.docker.internal.

Switching to LocalExecutor

In part one of this tutorial, we worked with CeleryExecutor to run our Airflow. By default, the Docker Compose file uses CeleryExecutor, which requires additional components such as Redis, Celery workers, and the Flower dashboard for distributed task execution.

Since we’re running Airflow to make it production-ready, we can simplify things by using LocalExecutor, which runs tasks in parallel on a single machine, eliminating the need for an external queue or worker system.

Find this line in your docker-compose.yaml:

AIRFLOW__CORE__EXECUTOR: CeleryExecutor 

Change it to:

AIRFLOW__CORE__EXECUTOR: LocalExecutor

Removing Unnecessary Services

Because we’re no longer using Celery, we can safely remove related components from the configuration. These include Redis, airflow-worker, and Flower.

You can search for the following sections and delete them:

  • The entire redis service block.
  • The airflow-worker service (Celery’s worker).
  • The flower service (Celery monitoring dashboard).
  • Any AIRFLOW__CELERY__... lines inside environment blocks.

Extending the DAG with a Load Step

Now let’s extend our existing DAG to include the Load phase of the ETL process. Already we had extract_market_data() and transform_market_data() created in the first part of this tutorial. This new task will read each transformed CSV file and insert its data into a MySQL table.

Here’s our updated daily_etl_pipeline_airflow3 DAG with the new load_to_mysql() task.
You can also find the complete version of this DAG in the cloned repository([email protected]:dataquestio/tutorials.git), inside the part-two/

folder under airflow-docker-tutorial .

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        """Load the transformed CSV data into a MySQL table."""
        import mysql.connector
        import os

        db_config = {
            "host": "host.docker.internal",  # enables Docker-to-local communication
            "user": "airflow",
            "password": "airflow",
            "database": "airflow_db",
            "port": 3306
        }

        df = pd.read_csv(transformed_file)

        # Derive the table name dynamically based on region
        table_name = f"transformed_market_data_{os.path.basename(transformed_file).split('_')[-1].replace('.csv', '')}"

        conn = mysql.connector.connect(**db_config)
        cursor = conn.cursor()

        # Create table if it doesn’t exist
        cursor.execute(f"""
            CREATE TABLE IF NOT EXISTS {table_name} (
                timestamp VARCHAR(50),
                market VARCHAR(50),
                company VARCHAR(255),
                price_usd DECIMAL,
                daily_change_percent DECIMAL
            );
        """)

        # Insert records
        for _, row in df.iterrows():
            cursor.execute(
                f"""
                INSERT INTO {table_name} (timestamp, market, company, price_usd, daily_change_percent)
                VALUES (%s, %s, %s, %s, %s)
                """,
                tuple(row)
            )

        conn.commit()
        conn.close()
        print(f"[LOAD] Data successfully loaded into MySQL table: {table_name}")

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

When you trigger this DAG, Airflow will automatically create three sequential tasks for each defined region (us, europe, asia, africa):

first extracting market data, then transforming it, and finally loading it into a region-specific MySQL table.

Create a Database and User for Airflow (2)

Each branch runs independently, so by the end of a successful run, your local MySQL database (airflow_db) will contain four separate tables, one for each region:

transformed_market_data_us
transformed_market_data_europe
transformed_market_data_asia
transformed_market_data_africa

Each table contains the cleaned and sorted dataset for its region, including company names, prices, and daily percentage changes.

Once your containers are running, open MySQL (via terminal or MySQL Workbench) and run:

SHOW TABLES;

Create a Database and User for Airflow (3)

You should see all four tables listed. Then, to inspect one of them, for example us, run:

SELECT * FROM transformed_market_data_us;

Create a Database and User for Airflow (4)

From above, we can see the dataset that Airflow extracted, transformed, and loaded for the U.S. market, confirming your pipeline has now completed all three stages of ETL: Extract → Transform → Load.

This integration demonstrates Airflow’s ability to manage data flow across multiple sources and databases seamlessly, a key capability in modern data engineering pipelines.

Absolutely, here’s the updated subsection with your requested note added in the right place.

It keeps the professional teaching tone and gently reminds learners that these connection values must match the local MySQL setup they created earlier.

Previewing the Loaded Data in Airflow

By now, you’ve confirmed that your transformed datasets are successfully loaded into MySQL, you can view them directly in MySQL Workbench or through a SQL client. But Airflow also provides a convenient way to query and preview this data right from the UI, using Connections and the SQLExecuteQueryOperator.

Connections in Airflow store the credentials and parameters needed to connect to external systems such as databases, APIs, or cloud services. Instead of hardcoding passwords or host details in your DAGs, you define a connection once in the Web UI and reference it securely using its conn_id.

To set this up:

  1. Open the Airflow Web UI
  2. Navigate to Admin → Connections → + Add a new record
  3. Fill in the following details:
Field Value
Conn Id local_mysql
Conn Type MySQL
Host host.docker.internal
Schema airflow_db
Login airflow
Password airflow
Port 3306

Note: These values must match the credentials you defined earlier when setting up your local MySQL instance.

Specifically, the database airflow_db, user airflow, and password airflow should already exist in your MySQL setup.

The host.docker.internal value ensures that your Airflow containers can communicate with MySQL running on your local machine.

  • Also note that when you use docker compose down -v, all volumes, including your Airflow connections, will be deleted. Always remember to re-add the connection afterward.

If your changes are not volume-related, you can safely shut down the containers using docker compose down (without -v), which preserves your existing connections and data.

Click Save to register the connection.

Now, Airflow knows how to connect to your MySQL database whenever a task specifies conn_id="local_mysql".

Let’s create a simple SQL query task to preview the data we just loaded.


    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        ...

        from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator

        preview_mysql = SQLExecuteQueryOperator(
            task_id="preview_mysql_table",
            conn_id="local_mysql",
            sql="SELECT * FROM transformed_market_data_us LIMIT 5;",
            do_xcom_push=True,  # makes query results viewable in Airflow’s XCom tab
        )
        # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

Next, link this task to your DAG so that it runs after the loading process, update this line load_to_mysql.expand(transformed_file=transformed_files) to this:

    load_to_mysql.expand(transformed_file=transformed_files) >> preview_mysql

When you trigger the DAG again (always remember to shut down the containers before making changes to your DAGs using docker compose down, and then, once saved, use docker compose up -d), Airflow will:

  1. Connect to your MySQL database using the stored connection credentials.
  2. Run the SQL query on the specified table.
  3. Display the first few rows of your data as a JSON result in the XCom view.

To see it:

  • Go to Grid View
  • Click on the preview_mysql_table task
  • Choose XCom from the top menu

Previewing the Loaded Data in Airflow (5)

You’ll see your data represented in JSON format, confirming that the integration works, Airflow not only orchestrates the workflow but can also interactively query and visualize your results without leaving the platform.

This makes it easy to verify that your ETL pipeline is functioning correctly end-to-end: extraction, transformation, loading, and now validation, all visible and traceable inside Airflow.

Git-Based DAG Management and CI/CD for Deployment (with git-sync)

At this stage, your local Airflow environment is complete, you’ve built a fully functional ETL pipeline that extracts, transforms, and loads regional market data into MySQL, and even validated results directly from the Airflow UI.

Now it’s time to take the final step toward production readiness: managing your DAGs the way data engineering teams do in real-world systems, through Git-based deployment and continuous integration.

We’ll push our DAGs to a shared GitHub repository called airflow_dags, and connect Airflow to it using the git-sync container, which automatically keeps your DAGs in sync. This allows every team member (or environment) to pull from the same source, the Git repo, without manually copying files into containers.

Why Manage DAGs with Git

Every DAG is just a Python file, and like all code, it deserves version control. Storing DAGs in a Git repository brings the same advantages that software engineers rely on:

  • Versioning: track every change and roll back safely.
  • Collaboration: multiple developers can work on different workflows without conflict.
  • Reproducibility: every environment can pull identical DAGs from a single source.
  • Automation: changes sync automatically, eliminating manual uploads.

This structure makes Airflow easier to maintain and scales naturally as your pipelines grow in number and complexity.

Pushing Your DAGs to GitHub

To begin, create a public or private repository named airflow_dags (e.g., https://github.com/<your-username>/airflow_dags).

Then, in your project root (airflow-docker), initialize Git and push your local dags/ directory:

git init
git remote add origin https://github.com/<your-username>/airflow_dags.git
git add dags/
git commit -m "Add Airflow ETL pipeline DAGs"
git branch -M main
git push -u origin main

Once complete, your DAGs live safely in GitHub, ready for syncing.

How git-sync Works

git-sync is a lightweight sidecar container that continuously clones and updates a Git repository into a shared volume.

Once running, it:

  • Clones your repository (e.g., https://github.com/<your-username>/airflow_dags.git),
  • Pulls updates every 30 seconds by default,
  • Exposes the latest DAGs to Airflow automatically, no rebuilds or restarts required.

This is how Airflow stays up to date with your Git repo in real time.

Setting Up git-sync in Docker Compose

In your existing docker-compose.yaml, you already have a list of services that define your Airflow environment, like the api-server, scheduler, triggerer, and dag-processor. Each of these runs in its own container but works together as part of the same orchestration system.

The git-sync container will become another service in this list, just like those, but with a very specific purpose:

  • to keep your /dags folder continuously synchronized with your remote GitHub repository.

Instead of copying Python DAG files manually or rebuilding containers every time you make a change, the git-sync service will automatically pull updates from your GitHub repo (in our case, airflow_dags) into a shared volume that all Airflow services can read from.

This ensures that your environment always runs the latest DAGs from GitHub ,without downtime, restarts, or manual synchronization.

Remember in our docker-compose.yaml file, we had this kind of setup:

Setting Up Git in Docker Compose

Now, we’ll extend that structure by introducing git-sync as an additional service within the same services: section and also an addition in the volumes: section(other than postgres-db-volume: we we have to also add airflow-dags-volume: for uniformity accross all containers).

Below is a configuration that works seamlessly with Docker on any OS:

services:
  git-sync:
    image: registry.k8s.io/git-sync/git-sync:v4.1.0
    user: "0:0"    # run as root so it can create /dags/git-sync
    restart: always
    environment:
      GITSYNC_REPO: "https://github.com/<your-username>/airflow-dags.git"
      GITSYNC_BRANCH: "main"           # use BRANCH not REF
      GITSYNC_PERIOD: "30s"
      GITSYNC_DEPTH: "1"
      GITSYNC_ROOT: "/dags/git-sync"
      GITSYNC_DEST: "repo"
      GITSYNC_LINK: "current"
      GITSYNC_ONE_TIME: "false"
      GITSYNC_ADD_USER: "true"
      GITSYNC_CHANGE_PERMISSIONS: "1"
      GITSYNC_STALE_WORKTREE_TIMEOUT: "24h"
    volumes:
      - airflow-dags-volume:/dags
    healthcheck:
      test: ["CMD-SHELL", "test -L /dags/git-sync/current && test -d /dags/git-sync/current/dags && [ \"$(ls -A /dags/git-sync/current/dags 2>/dev/null)\" ]"]
      interval: 10s
      timeout: 3s
      retries: 10
      start_period: 10s

volumes:
  airflow-dags-volume:

In this setup, the git-sync service runs as a lightweight companion container that keeps your Airflow DAGs in sync with your GitHub repository.

The GITSYNC_REPO variable tells it where to pull code from, in this case, your DAG repository (airflow_dags). Make sure you replace <your-username> with your exact GitHub username. The GITSYNC_BRANCH specifies which branch to track, usually main, while GITSYNC_PERIOD defines how often to check for updates. Here, it’s set to every 30 seconds, meaning Airflow will always be within half a minute of your latest Git push.

The synchronization happens inside the directory defined by GITSYNC_ROOT, which becomes /dags/git-sync inside the container. Inside that root, GITSYNC_DEST defines where the repo is cloned (as repo), and GITSYNC_LINK creates a symbolic link called current pointing to the active clone.

This design allows Airflow to always reference a stable, predictable path (/dags/git-sync/current/dags) even as the repository updates in the background, no path changes, no reloads.

A few environment flags ensure stability and portability across systems. For instance, GITSYNC_ADD_USER and GITSYNC_CHANGE_PERMISSIONS make sure the synced files are accessible to Airflow even when permissions differ across Docker environments.

GITSYNC_DEPTH limits the clone to just the latest commit (keeping it lightweight), while GITSYNC_STALE_WORKTREE_TIMEOUT helps clean up old syncs if something goes wrong.

The shared volume, airflow-dags-volume, acts as the bridge between git-sync and Airflow. It stores all synced DAGs in one central location accessible by both containers. The health check at the end ensures that git-sync is functioning, it verifies that the /current/dags directory exists and contains files before Airflow tries to load them.

Finally, the healthcheck section ensures that Airflow doesn’t start until git-sync has successfully cloned your repository. It runs a small shell command that checks three things, whether the symbolic link /dags/git-sync/current exists, whether the dags directory is present inside it, and whether that directory actually contains files. Only when all these conditions pass does Docker mark the git-sync service as healthy. The interval and retry parameters control how often and how long these checks run, ensuring that Airflow’s scheduler, webserver, and other components wait patiently until the DAGs are fully available. This simple step prevents race conditions and guarantees a smooth startup every time.

Together, these settings ensure that your Airflow instance always runs the latest DAGs from GitHub, automatically, securely, and without manual file transfers.

Generally, this configuration does the following:

  • Creates a shared volume (airflow-dags-volume) where the DAGs are cloned.
  • Mounts it into both git-sync and Airflow services.
  • Runs git-sync as root to fix permission issues on Windows.
  • Keeps DAGs up to date every 30 seconds.

Adjusting the Airflow Configuration

We’ve now added git-sync as part of our Airflow services, sitting right alongside the api-server, scheduler, triggerer, and dag-processor.

This new service continuously pulls our DAGs from GitHub and stores them inside a shared volume (airflow-dags-volume) that both git-sync and Airflow can access.

However, our Airflow setup still expects to find DAGs through local directory mounts defined under each service (via x-airflow-common), not global named volumes. The default configuration maps these paths as follows:

volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins

This setup points Airflow to the local dags/ folder in your host machine, but now that we have git-sync, our DAGs will live inside a synchronized Git directory instead.

So we need to update the DAG volume mapping to pull from the new shared Git volume instead of the local one.

Replace the first line(- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags) under the volumes: section with: - airflow-dags-volume:/opt/airflow/dags

This tells Docker to mount the shared airflow-dags-volume (created by git-sync) into Airflow’s /opt/airflow/dags directory.

That way, any DAGs pulled by git-sync from your GitHub repository will immediately appear inside Airflow’s working environment, without needing to rebuild or copy files.

We also need to explicitly tell Airflow where the synced DAGs live.

In the environment section of your x-airflow-common block, add the following:

AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags/git-sync/current/dags

This line links Airflow directly to the directory created by the git-sync container.

Here’s how it connects:

  • Inside the git-sync configuration, we defined:

    GITSYNC_ROOT: "/dags/git-sync"
    GITSYNC_LINK: "current"

    Together, these ensure that the most recent repository clone is always available under /dags/git-sync/current.

  • When we mount airflow-dags-volume:/opt/airflow/dags, this path becomes accessible inside the Airflow containers as

    /opt/airflow/dags/git-sync/current/dags.

By setting AIRFLOW__CORE__DAGS_FOLDER to that exact path, Airflow automatically watches the live Git-synced DAG directory for changes, meaning every new commit to your GitHub repo will reflect instantly in the Airflow UI.

Finally, ensure that Airflow waits for git-sync to finish cloning before starting up.

In each Airflow service (airflow-scheduler, airflow-apiserver, dag-processor, and triggerer), depends_on section, add:

depends_on:
  git-sync:
    condition: service_healthy

This guarantees that Airflow only starts once the git-sync container has successfully pulled your repository, preventing race conditions during startup.

Once complete, Airflow will read its DAGs directly from the synchronized Git directory , /opt/airflow/dags/git-sync/current/dags , instead of your local project folder.

This change transforms your setup into a live, Git-driven workflow, where Airflow continuously tracks and loads the latest DAGs from GitHub automatically.

Automating Validation with GitHub Actions

Our Git integration wouldn’t be truly powerful without CI/CD (Continuous Integration and Continuous Deployment).

While git-sync ensures that any change pushed to GitHub automatically reflects in Airflow, that alone can be risky, not every change should make it to production immediately.

Imagine pushing a DAG with a missing import, a syntax error, or a bad dependency.

Airflow might fail to parse it, causing your scheduler or api-server to crash or restart repeatedly. That’s why we need a safety net, a way to automatically check that every DAG in our repository is valid before it ever reaches Airflow.

This is exactly where GitHub Actions comes in.

We can set up a lightweight CI pipeline that validates all DAGs whenever someone pushes to the main branch. If a broken DAG is detected, the pipeline fails, preventing the merge and protecting your Airflow environment from unverified code.

GitHub also provides notifications directly in your repository interface, showing failed workflows and highlighting the cause of the issue.

Inside your airflow_dags repository, create a GitHub Actions workflow file at:

.github/workflows/validate-dags.yml

name: Validate Airflow DAGs

on:
  push:
    branches: [ main ]
    paths:
      - 'dags/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Airflow
        run: pip install apache-airflow==3.1.1

      - name: Validate DAGs
        run: |
          echo "Validating DAG syntax..."
          airflow dags list || exit 1

This simple workflow automatically runs every time you push a new commit to the main branch (or modify anything in the dags/ directory).

It installs Apache Airflow in a lightweight test environment, loads all your DAGs, and checks that they parse successfully, no missing imports, syntax issues, or circular dependencies.

If even one DAG fails to load, the validation job will exit with an error, causing the GitHub Actions pipeline to fail.

GitHub then immediately notifies you (and any collaborators) through the repository’s Actions tab, issue alerts, and optional email notifications.

By doing this, you’re adding a crucial layer of protection to your workflow:

  • Pre-deployment safety: invalid DAGs never reach your running Airflow instance.
  • Automatic feedback: failed DAGs trigger GitHub notifications, allowing you to fix errors early.
  • Confidence in deployment: once the pipeline passes, you know every DAG is production-ready.

Together, this CI validation and your git-sync setup create a self-updating, automated Airflow environment that mirrors production deployment practices.

With this final step, your Airflow environment becomes a versioned, automated, and production-ready orchestration system, capable of handling real data pipelines the way modern engineering teams do.

You’ve now completed a full transformation:

from local DAG development to automated, Git-driven deployment, all within Docker, all powered by Apache Airflow.

  • Note that, both the git-sync service and the Airflow UI depend on your Docker containers running. As long as your containers are up, git-sync remains active, continuously checking for updates in your GitHub repository and syncing any new DAGs to your Airflow environment.

    Once you stop or shut down the containers (docker compose down), this synchronization pauses. You also won’t be able to access the Airflow Web UI or trigger DAGs until the containers are started again.

    When you restart with docker compose up -d, everything, including git-sync , resumes automatically, picking up the latest changes from GitHub and restoring your full Airflow setup just as you left it.

Summary and Up Next

In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your workflow and connecting it to a local MySQL database. You learned how Airflow securely manages external connections, dynamically handles multiple data regions, and enables in-UI data previews through XCom and Connections.

You also took your setup a step closer to production by integrating Git-based DAG management with git-sync, and implementing GitHub Actions CI to validate DAGs automatically before deployment.

Together, these changes transformed your environment into a version-controlled, automated orchestration system that mirrors the structure of production-grade setups, a final step before deploying to the cloud.

In the next tutorial, you’ll move beyond simulated data and build a real-world data pipeline, extracting data from an API, transforming it with Python, and loading it into MySQL. You’ll also add retries, alerts, and monitoring, and deploy the full workflow through CI/CD, achieving a truly end-to-end, production-grade Airflow setup.

Running and Managing Apache Airflow with Docker (Part I)

7 November 2025 at 22:50

In the last tutorial, we explored what workflow orchestration is, why it matters, and how Apache Airflow structures, automates, and monitors complex data pipelines through DAGs, tasks, and the scheduler. We examined how orchestration transforms scattered scripts into a coordinated system that ensures reliability, observability, and scalability across modern data workflows.

In this two-part hands-on tutorial, we move from theory to practice. You’ll run Apache Airflow inside Docker, the most efficient and production-like way to deploy Airflow for development and testing. This containerized approach mirrors how Airflow operates in real-world environments, from on-premises teams to managed services like ECR and Cloud Composer.

In Part One, our focus goes beyond setup. You’ll learn how to work effectively with DAGs inside a Dockerized Airflow environment, writing, testing, visualizing, and managing them through the Web UI. You’ll use the TaskFlow API to build clean, Pythonic workflows and implement dynamic task mapping to run multiple processes in parallel. By the end of this part, you’ll have a fully functional Airflow environment running in Docker and a working DAG that extracts and transforms data automatically, the foundation of modern data engineering pipelines.

In Part Two, we’ll extend that foundation to handle data management and automation workflows. You’ll connect Airflow to a local MySQL database for data loading, manage credentials securely through the Admin panel and environment variables, and integrate Git with Git Sync to enable version control and continuous deployment. You’ll also see how CI/CD pipelines can automate DAG validation and deployment, ensuring your Airflow environment remains consistent and collaborative across development teams.

By the end of the series, you’ll not only understand how Airflow runs inside Docker but also how to design, orchestrate, and manage production-grade data pipelines the way data engineers do in real-world systems.

Why Use Docker for Airflow

While Airflow can be installed locally with pip install apache-airflow, this approach often leads to dependency conflicts, version mismatches, and complicated setups. Airflow depends on multiple services, an API server, scheduler, triggerer, metadata database, and dag-processors, all of which must communicate correctly. Installing and maintaining these manually on your local machine can be tedious and error-prone.

Docker eliminates these issues by packaging everything into lightweight, isolated containers. Each container runs a single Airflow component, but all work together seamlessly through Docker Compose. The result is a clean, reproducible environment that behaves consistently across operating systems.

In short:

  • Local installation: works for testing but often breaks due to dependency conflicts or version mismatches.
  • Cloud-managed services (like AWS ECS or Cloud Composer): excellent for production but not that much flexible for learning or prototyping.
  • Docker setup: combines realism with simplicity, providing the same multi-service environment used in production without the overhead of manual configuration.

Docker setup is ideal for learning and development and closely mirrors production environments, but additional configuration is needed for a full production deployment

Prerequisites

Before you begin, ensure the following are installed and ready on your system:

  1. Docker Desktop – Required to build and run Airflow containers.
  2. A code editor – Visual Studio Code or similar, for writing and editing DAGs.
  3. Python 3.10 or higher – Used for authoring Airflow DAGs and helper scripts.

Running Airflow Using Docker

Now that your environment is ready (Docker is open and running), let’s get Airflow running using Docker Compose.

This tool orchestrates all Airflow services, api-server, scheduler, triggerer, database, and workers — so they start and communicate properly.

Clone the Tutorial Repository

We’ve already prepared the starter files you’ll need for this tutorial on GitHub.

Begin by cloning the repository:

git clone [email protected]:dataquestio/tutorials.git

Then navigate to the Airflow tutorial folder:

cd airflow-docker-tutorial

This is the directory where you’ll be working throughout the tutorial.

Inside, you’ll notice a structure similar to this:

airflow-docker-tutorial/
├── part-one/  
├── part-two/
├── docker-compose.yaml
└── README.md
  • The part-one/ and part-two/ folders contain the complete reference files for both tutorials (Part One and Part Two).

    You don’t need to modify anything there, it’s only for comparison or review.

  • The docker-compose.yaml file is your starting point and will evolve as the tutorial progresses.

Explore the Docker Compose File

Open the docker-compose.yaml file in your code editor.

This file defines all the Airflow components and how they interact inside Docker.

It includes:

  • api-server – Airflow’s web user interface
  • Scheduler – Parses and triggers DAGs
  • Triggerer – Manages deferrable tasks efficiently
  • Metadata database – Tracks DAG runs and task states
  • Executors – Execute tasks

Each of these services runs in its own container, but together they form a single working Airflow environment.

You’ll be updating this file as you move through the tutorial to configure, extend, and manage your Airflow setup.

Create Required Folders

Airflow expects certain directories to exist before launching.

Create them inside the same directory as your docker-compose.yaml file:

mkdir -p ./dags ./logs ./plugins ./config
  • dags/ – your workflow scripts
  • logs/ – task execution logs
  • plugins/ – custom hooks and operators
  • config/ – optional configuration overrides (this will be auto-populated later when initializing the database)

Configure User Permissions

If you’re using Linux, set a user ID to prevent permission issues when Docker writes files locally:

echo -e "AIRFLOW_UID=$(id -u)" > .env

If you’re using macOS or Windows, manually create a .env file in the same directory with the following content:

AIRFLOW_UID=50000

This ensures consistent file ownership between your host system and the Docker containers.

Initialize the Metadata Database

Airflow keeps track of DAG runs, task states, and configurations in a metadata database.

Initialize it by running:

docker compose up airflow-init

Once initialization completes, you’ll see a message confirming that an admin user has been created with default credentials:

  • Username: airflow
  • Password: airflow

Start Airflow

Now start all Airflow services in the background:

docker compose up -d

Docker Compose will spin up the scheduler, API server, triggerer, database, and worker containers.

Step 6: Start Airflow

Now launch all the services in the background:

docker compose up -d

Docker Compose will start the scheduler, api-server, triggerer, database, and executor containers.

Start Airflow

Make sure the triggerer, dag-processor, scheduler, and api-server are shown as started as above. If that is not the case, rebuild the Docker container, since the build process might have been interrupted. Otherwise, navigate to http://localhost:8080 to access the Airflow UI exposed by the api-server.

You can also access this through your Docker app, by navigating to containers:

Start Airflow (2)

Log in using the credentials above to accessh the Airflow Web UI.

  • If the UI fails to load or some containers keep restarting, increase Docker’s memory allocation to at least 4 GB (8 GB recommended) in Docker Desktop → Settings → Resources.

Configuring the Airflow Project

Once Airflow is running and you visit http://localhost:8080, you’ll be seee Airflow Web UI.

Configuring the Airflow Project

This is the command center for your workflows, where you can visualize DAGs, monitor task runs, and manage system configurations. When you navigate to Dags, you’ll see a dashboard that lists several example DAGs provided by the Airflow team. These are sample workflows meant to demonstrate different operators, sensors, and features.

However, for this tutorial, we’ll build our own clean environment, so we’ll remove these example DAGs and customize our setup to suit our project.

Before doing that, though, it’s important to understand the docker-compose.yaml file, since this is where your Airflow environment is actually defined.

Understanding the docker-compose.yaml File

The docker-compose.yaml file tells Docker how to build, connect, and run all the Airflow components as containers.

If you open it, you’ll see multiple sections that look like this:

Understanding the Docker Compose File

Let’s break this down briefly:

  • x-airflow-common – This is the shared configuration block that all Airflow containers inherit from. It defines the base Docker image (apache/airflow:3.1.0), key environment variables, and mounted volumes for DAGs, logs, and plugins. It also specifies user permissions to ensure that files created inside the containers are accessible from your host machine. The depends_on lists dependencies such as the PostgreSQL database used to store Airflow metadata. In short, this section sets up the common foundation for every container in your environment.
  • services – This section defines the actual Airflow components that make up your environment. Each service, such as the api-server, scheduler, triggerer, dag-processor , and metadata database, runs as a separate container but uses the shared configuration from x-airflow-common. Together, they form a complete Airflow deployment where each container plays a specific role.
  • volumes - this section sets up persistent storage for containers. Airflow uses it by default for the Postgres database, keeping your DAGs, logs, and configurations saved across runs. In part 2, we’ll expand it to include Git integration.

Each of these sections works together to create a unified Airflow environment that’s easy to configure, extend, or simplify as needed.

Understanding these parts now will make the next steps - cleaning, customizing, and managing your Airflow setup - much clearer.

Resetting the Environment Before Making Changes

Before editing anything inside the docker-compose.yaml, it’s crucial to shut down your containers cleanly to avoid conflicts.

Run: docker compose down -v

Here’s what this does:

  • docker compose down stops and removes all containers.
  • The v flag removes volumes, which clears stored metadata, logs, and configurations.

    This ensures that you start with a completely fresh environment the next time you launch Airflow — which can be helpful when your environment becomes misconfigured or broken. However, you shouldn’t do this routinely after every DAG or configuration change, as it will also remove your saved Connections, Variables, and other stateful data. In most cases, you can simply run docker compose down instead to stop the containers without wiping the environment.

Disabling Example DAGs

By default, Airflow loads several example DAGs to help new users explore its features. For our purposes, we want a clean workspace that only shows our own DAGs.

  1. Open the docker-compose.yaml file in your code editor.
  2. Locate the environment section under x-airflow-common and find this line: AIRFLOW__CORE__LOAD_EXAMPLES: 'true' . Change 'true' to 'false': AIRFLOW__CORE__LOAD_EXAMPLES: 'false'

This setting tells Airflow not to load any of the example workflows when it starts.

Once you’ve made the changes:

  1. Save your docker-compose.yaml file.
  2. Rebuild and start your Airflow environment again: docker compose up -d
  3. Wait a few moments, then visit http://localhost:8080 again.

This time, when you log in, you’ll notice the example DAGs are gone, leaving you with a clean workspace ready for your own workflows.

Disabling Example DAGs

Let’s now build our first DAG.

Working with DAGs in Airflow

Now that your Airflow environment is clean and running, it’s time to create **** our first real workflow.

This is where you begin writing DAGs (Directed Acyclic Graphs), which sit at the very heart of how Airflow operates.

A DAG is more than just a piece of code, it’s a visual and logical representation of your workflow, showing how tasks connect, when they run, and in what order.

Each task in a DAG represents a distinct step in your process, such as pulling data, cleaning it, transforming it, or loading it into a database. In this tutorial we will create tasks that extract and transform data. We will the see the loading process in part two, and how airflow intergrates to git.

Airflow ensures these tasks execute in the correct order without looping back on themselves (that’s what acyclic means).

Setting Up Your DAG File

Let’s start by creating the foundation of our workflow( make sure to shut down the running containers by using docker compose down -v)

Open your airflow-docker project folder and, inside the dags/ directory, create a new file named:

our_first_dag.py

Every .py file you place in this folder becomes a workflow that Airflow can recognize and manage automatically.

You don’t need to manually register anything, Airflow continuously scans this directory and loads any valid DAGs it finds.

At the top of our file, let’s import the core libraries we need for our project:

from airflow.decorators import dag, task
from datetime import datetime, timedelta
import pandas as pd
import random
import os

Let’s pause to understand what each of these imports does and why they matter:

  • dag and task come from Airflow’s TaskFlow API.

    These decorators turn plain Python functions into Airflow-managed tasks, giving you cleaner, more intuitive code while Airflow handles orchestration behind the scenes.

  • datetime and timedelta handle scheduling logic.

    They help define when your DAG starts and how frequently it runs.

  • pandas, random, and os are standard Python libraries we’ll use to simulate a simple ETL process, generating, transforming, and saving data locally.

This setup might seem minimal, but it’s everything you need to start orchestrating real tasks.

Defining the DAG Structure

With our imports ready, the next step is to define the skeleton of our DAG, its blueprint.

Think of this as defining when and how your workflow runs.

default_args = {
    "owner": "Your name",
    "retries": 3,
    "retry_delay": timedelta(minutes=1),
}

@dag(
    dag_id="daily_etl_pipeline_airflow3",
    description="ETL workflow demonstrating dynamic task mapping and assets",
    schedule="@daily",
    start_date=datetime(2025, 10, 29),
    catchup=False,
    default_args=default_args,
    tags=["airflow3", "etl"],
)
def daily_etl_pipeline():
    ...

dag = daily_etl_pipeline()

Let’s break this down carefully:

  • default_args

    This dictionary defines shared settings for all tasks in your DAG.

    Here, each task will automatically retry up to three times with a one-minute delay between attempts, a good practice when your tasks depend on external systems like APIs or databases that can occasionally fail.

  • The @dag decorator

    This tells Airflow that everything inside the daily_etl_pipeline() function(we can have this to any name) belongs to one cohesive workflow.

    It defines:

    • schedule="@daily" → when the DAG should run.
    • start_date → the first execution date.
    • catchup=False → prevents Airflow from running past-due DAGs automatically.
    • tags → helps you categorize DAGs in the UI.
  • The daily_etl_pipeline() function

    This is the container for your workflow logic, it’s where you’ll later define your tasks and how they depend on one another.

    Think of it as the “script” that describes what happens in each run of your DAG.

  • dag = daily_etl_pipeline()

    This single line instantiates the DAG. It’s what makes your workflow visible and schedulable inside Airflow.

This structure acts as the foundation for everything that follows.

If we think of a DAG as a movie script, this section defines the production schedule and stage setup before the actors (tasks) appear.

Creating Tasks with the TaskFlow API

Now it’s time to define the stars of our workflow, the tasks.

Tasks are the actual units of work that Airflow runs. Each one performs a specific action, and together they form your complete data pipeline.

Airflow’s TaskFlow API makes this remarkably easy: you simply decorate ordinary Python functions with @task, and Airflow takes care of converting them into fully managed, trackable workflow steps.

We’ll start with two tasks:

  • Extract → simulates pulling or generating data.
  • Transform → processes and cleans the extracted data.

(We’ll add the Load step in the next part of this tutorial.)

Extract Task — Generating Fake Data

@task
def extract_market_data():
    """
    Simulate extracting market data for popular companies.
    This task mimics pulling live stock prices or API data.
    """
    companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix", "NVIDIA", "Meta"]

    # Simulate today's timestamped price data
    records = []
    for company in companies:
        price = round(random.uniform(100, 1500), 2)
        change = round(random.uniform(-5, 5), 2)
        records.append({
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "company": company,
            "price_usd": price,
            "daily_change_percent": change,
        })

    df = pd.DataFrame(records)
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    raw_path = "/opt/airflow/tmp/market_data.csv"
    df.to_csv(raw_path, index=False)

    print(f"[EXTRACT] Market data successfully generated at {raw_path}")
    return raw_path

Let’s unpack what’s happening here:

  • The function simulates the extraction phase of an ETL pipeline by generating a small, timestamped dataset of popular companies and their simulated market prices.
  • Each record includes a company name, current price in USD, and a randomly generated daily percentage change, mimicking what you’d expect from a real API response or financial data feed.
  • The data is stored in a CSV file inside /opt/airflow/tmp, a shared directory accessible from within your Docker container, this mimics saving raw extracted data before it’s cleaned or transformed.
  • Finally, the function returns the path to that CSV file. This return value becomes crucial because Airflow automatically treats it as the output of this task. Any downstream task that depends on it, for example, a transformation step, can receive it as an input automatically.

In simpler terms, Airflow handles the data flow for you. You focus on defining what each task does, and Airflow takes care of passing outputs to inputs behind the scenes, ensuring your pipeline runs smoothly and predictably.

Transform Task — Cleaning and Analyzing Market Data

@task
def transform_market_data(raw_file: str):
    """
    Clean and analyze extracted market data.
    This task simulates transforming raw stock data
    to identify the top gainers and losers of the day.
    """
    df = pd.read_csv(raw_file)

    # Clean: ensure numeric fields are valid
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")

    # Sort companies by daily change (descending = top gainers)
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    # Select top 3 gainers and bottom 3 losers
    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    # Save transformed files
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    gainers_path = "/opt/airflow/tmp/top_gainers.csv"
    losers_path = "/opt/airflow/tmp/top_losers.csv"

    top_gainers.to_csv(gainers_path, index=False)
    top_losers.to_csv(losers_path, index=False)

    print(f"[TRANSFORM] Top gainers saved to {gainers_path}")
    print(f"[TRANSFORM] Top losers saved to {losers_path}")

    return {"gainers": gainers_path, "losers": losers_path}

Let’s unpack what this transformation does and why it’s important:

  • The function begins by reading the extracted CSV file produced by the previous task (extract_market_data). This is our “raw” dataset.
  • Next, it cleans the data, converting prices and percentage changes into numeric formats, a vital first step before analysis, since raw data often arrives as text.
  • It then sorts companies by their daily percentage change, allowing us to quickly identify which ones gained or lost the most value during the day.
  • Two smaller datasets are then created: one for the top gainers and one for the top losers, each saved as separate CSV files in the same temporary directory.
  • Finally, the task returns both file paths as a dictionary, allowing any downstream task (for example, a visualization or database load step) to easily access both datasets.

This transformation demonstrates how Airflow tasks can move beyond simple sorting; they can perform real business logic, generate multiple outputs, and return structured data to other steps in the workflow.

At this point, your DAG has two working tasks:

  • Extract — to simulate data collection
  • Transform — to clean and analyze that data

When Airflow runs this workflow, it will execute them in order:

Extract → Transform

Now that both the Extract and Transform tasks are defined inside your DAG, let’s see how Airflow links them together when you call them in sequence.

Inside your daily_etl_pipeline() function, add these two lines to establish the task order:

raw = extract_market_data()
transformed = transform_market_data(raw)

When Airflow parses the DAG, it doesn’t see these as ordinary Python calls, it reads them as task relationships.

The TaskFlow API automatically builds a dependency chain, so Airflow knows that extract_market_data must complete before transform_market_data begins.

Notice that we’ve assigned extract_market_data() to a variable called raw. This variable represents the output of the first task, in our case, the path to the extracted data file. The next line, transform_market_data(raw), then takes that output and uses it as input for the transformation step.

This pattern makes the workflow clear and logical: data is extracted, then transformed, with Airflow managing the sequence automatically behind the scenes.

This is how Airflow builds the workflow graph internally: by reading the relationships you define through function calls.

Visualizing the Workflow in the Airflow UI

Once you’ve saved your DAG file with both tasks ****—Extract and Transform —it’s time to bring it to life. Start your Airflow environment using:

docker compose up -d

Then open your browser and navigate to: http://localhost:8080

You’ll be able to see the Airflow Home page, this time with the dag we just created ; daily_etl_pipeline_airflow3.

Visualizing the Workflow in the Airflow UI

Click on it to open the DAG details, then trigger a manual run using the Play button.

The task currently running will turn blue, and once it completes successfully, it will turn green.

Visualizing the Workflow in the Airflow UI (2)

On the graph view, you will also see two tasks: extract_market_data and transform_market_data , connected in sequence showing success in each.

Visualizing the Workflow in the Airflow UI (3)

If a task encounters an issue, Airflow will automatically retry it up to three times (as defined in default_args). If it continues to fail after all retries, it will appear red, indicating that the task, and therefore the DAG run, has failed.

Inspecting Task Logs

Click on any task box (for example, transform_market_data), then click Task Instances.

Inspecting Task Logs

All DAG runs for the selected task will be listed here. Click on the latest run. This will open a detailed log of the task’s execution, an invaluable feature for debugging and understanding what’s happening under the hood.

In your log, you’ll see:

  • The [EXTRACT] or [TRANSFORM] tags you printed in the code.
  • Confirmation messages showing where your files were saved, e.g.:

    Inspecting Task Logs (2)

    Inspecting Task Logs (3)

These messages prove that your tasks executed correctly and help you trace your data through each stage of the pipeline.

Dynamic Task Mapping

As data engineers, we rarely process just one dataset; we usually work with many sources at once.

For example, instead of analyzing one market, you might process stock data from multiple exchanges or regions simultaneously.

In our current DAG, the extraction and transformation handle only a single dataset.

But what if we wanted to repeat that same process for several markets, say, us, europe, asia, and africa , all in parallel?

Writing a separate task for each region would make our DAG repetitive and hard to maintain.

That’s where Dynamic Task Mapping comes in.

It allows Airflow to create parallel tasks automatically at runtime based on input data such as lists, dictionaries, or query results.

Before editing the DAG, stop any running containers to ensure Airflow picks up your changes cleanly:

docker compose down -v

Now, extend your existing daily_etl_pipeline_airflow3 to handle multiple markets dynamically:

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...
    @task
    def transform_market_data(raw_file: str):
      ...

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create parallel tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)

dag = daily_etl_pipeline()

By using .expand(), Airflow automatically generates multiple parallel task instances from a single function. You’ll notice the argument market passed into the extract_market_data() function. For that to work effectively, here’s the updated version of the extract_market_data() function:

@task
def extract_market_data(market: str):
        """Simulate extracting market data for a given region or market."""
        companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix"]
        records = []
        for company in companies:
            price = round(random.uniform(100, 1500), 2)
            change = round(random.uniform(-5, 5), 2)
            records.append({
                "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                "market": market,
                "company": company,
                "price_usd": price,
                "daily_change_percent": change,
            })

        df = pd.DataFrame(records)
        os.makedirs("/opt/airflow/tmp", exist_ok=True)
        raw_path = f"/opt/airflow/tmp/market_data_{market}.csv"
        df.to_csv(raw_path, index=False)
        print(f"[EXTRACT] Market data for {market} saved at {raw_path}")
        return raw_path

We also updated our transform_market_data() task to align with this dynamic setup:

@task
def transform_market_data(raw_file: str):
    """Clean and analyze each regional dataset."""
    df = pd.read_csv(raw_file)
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    transformed_path = raw_file.replace("market_data_", "transformed_")
    top_gainers.to_csv(transformed_path, index=False)
    print(f"[TRANSFORM] Transformed data saved at {transformed_path}")
    return transformed_path

Both extract_market_data() and transform_market_data() now work together dynamically:

  • extract_market_data() generates a unique dataset per region (e.g., market_data_us.csv, market_data_europe.csv).
  • transform_market_data() then processes each of those files individually and saves transformed versions (e.g., transformed_us.csv).

Generally:

  • One extract task is created for each market (us, europe, asia, africa).
  • Each extract’s output file becomes the input for its corresponding transform task.
  • Airflow handles all the mapping logic automatically, no loops or manual duplication needed.

Let’s redeploy our containers by running docker compose up -d .

You’ll see this clearly in the Graph View, where the DAG fans out into several parallel branches, one per market.

Dynamic Task Mapping

Each branch runs independently, and Airflow retries or logs failures per task as defined in default_args. You’ll notice that there are four task instances, which clearly correspond to the four market regions we processed.

Dynamic Task Mapping (2)

When you click any of the tasks, for example, extract_market_data , and open the logs, you’ll notice that the data for the corresponding market regions was extracted and saved independently.

Dynamic Task Mapping (3)

Dynamic Task Mapping (4)

Dynamic Task Mapping (5)

Dynamic Task Mapping (6)

Summary and What’s Next

We have built a complete foundation for working with Apache Airflow inside Docker. You learned how to deploy a fully functional Airflow environment using Docker Compose, understand its architecture, and configure it for clean, local development. We explored the Airflow Web UI, and used the TaskFlow API to create our first real workflow, a simple yet powerful ETL pipeline that extracts and transforms data automatically.

By extending it with Dynamic Task Mapping, we saw how Airflow can scale horizontally by processing multiple datasets in parallel, creating independent task instances for each region without duplicating code.

In Part Two, we’ll build on this foundation and introduce the Load phase of our ETL pipeline. You’ll connect Airflow to a local MySQL database, learn how to configure Connections through the Admin panel and environment variables. We’ll also integrate Git and Git Sync to automate DAG deployment and introduce CI/CD pipelines for version-controlled, collaborative Airflow workflows.

By the end of the next part, your environment will evolve from a development sandbox into a production-ready data orchestration system, capable of automating data ingestion, transformation, and loading with full observability, reliability, and continuous integration support.

Generating Embeddings with APIs and Open Models

4 November 2025 at 21:39

In the previous tutorial, you learned that embeddings convert text into numerical vectors that capture semantic meaning. You saw how papers about machine learning, data engineering, and data visualization naturally clustered into distinct groups when we visualized their embeddings. That was the foundation.

But we only worked with 12 handwritten paper abstracts that we typed directly into our code. That approach works great for understanding core concepts, but it doesn't prepare you for real projects. Real applications require processing hundreds or thousands of documents, and you need to make strategic decisions about how to generate those embeddings efficiently.

This tutorial teaches you how to collect documents programmatically and generate embeddings using different approaches. You'll use the arXiv API to gather 500 research papers, then generate embeddings using both local models and cloud services. By comparing these approaches hands-on, you'll understand the tradeoffs and be able to make informed decisions for your own projects.

These techniques form the foundation for production systems, but we're focusing on core concepts with a learning-sized dataset. A real system handling millions of documents would require batching strategies, streaming pipelines, and specialized vector databases. We'll touch on those considerations, but our goal here is to build your intuition about the embedding generation process itself.

Setting Up Your Environment

Before we start collecting data, let's install the libraries we'll need. We'll use the arxiv library to access research papers programmatically, pandas for data manipulation, and the same embedding libraries from the previous tutorial.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.2
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2
# arxiv==2.2.0
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1

pip install sentence-transformers scikit-learn matplotlib numpy arxiv pandas cohere python-dotenv

This tutorial works in any Python environment: Jupyter notebooks, Python scripts, VS Code, or your preferred IDE. Run the pip command above in your terminal before starting, then use the Python code blocks throughout this tutorial.

Collecting Research Papers with the arXiv API

arXiv is a repository of over 2 million scholarly papers in physics, mathematics, computer science, and more. Researchers publish cutting-edge work here before it appears in journals, making it a valuable resource for staying current with AI and machine learning research. Best of all, arXiv provides a free API for programmatic access. While they do monitor usage and have some rate limits to prevent abuse, these limits are generous for learning and research purposes. Check their Terms of Use for current guidelines.

We'll use the arXiv API to collect 500 papers from five different computer science categories. This diversity will give us clear semantic clusters when we visualize or search our embeddings. The categories we'll use are:

  • cs.LG (Machine Learning): Core ML algorithms, training methods, and theoretical foundations
  • cs.CV (Computer Vision): Image processing, object detection, and visual recognition
  • cs.CL (Computational Linguistics/NLP): Natural language processing and understanding
  • cs.DB (Databases): Data storage, query optimization, and database systems
  • cs.SE (Software Engineering): Development practices, testing, and software architecture

These categories use distinct vocabularies and will create well-separated clusters in our embedding space. Let's write a function to collect papers from specific arXiv categories:

import arxiv

# Create the arXiv client once and reuse it
# This is recommended by the arxiv package to respect rate limits
client = arxiv.Client()

def collect_arxiv_papers(category, max_results=100):
    """
    Collect papers from arXiv by category.

    Parameters:
    -----------
    category : str
        arXiv category code (e.g., 'cs.LG', 'cs.CV')
    max_results : int
        Maximum number of papers to retrieve

    Returns:
    --------
    list of dict
        List of paper dictionaries containing title, abstract, authors, etc.
    """
    # Construct search query for the category
    search = arxiv.Search(
        query=f"cat:{category}",
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    papers = []
    for result in client.results(search):
        paper = {
            'title': result.title,
            'abstract': result.summary,
            'authors': [author.name for author in result.authors],
            'published': result.published,
            'category': category,
            'arxiv_id': result.entry_id.split('/')[-1]
        }
        papers.append(paper)

    return papers

# Define the categories we want to collect from
categories = [
    ('cs.LG', 'Machine Learning'),
    ('cs.CV', 'Computer Vision'),
    ('cs.CL', 'Computational Linguistics'),
    ('cs.DB', 'Databases'),
    ('cs.SE', 'Software Engineering')
]

# Collect 100 papers from each category
all_papers = []
for category_code, category_name in categories:
    print(f"Collecting papers from {category_name} ({category_code})...")
    papers = collect_arxiv_papers(category_code, max_results=100)
    all_papers.extend(papers)
    print(f"  Collected {len(papers)} papers")

print(f"\nTotal papers collected: {len(all_papers)}")

# Let's examine the first paper from each category
separator = "=" * 80
print(f"\n{separator}", "SAMPLE PAPERS (one from each category)", f"{separator}", sep="\n")
for i, (_, category_name) in enumerate(categories):
    paper = all_papers[i * 100]
    print(f"\n{category_name}:")
    print(f"  Title: {paper['title']}")
    print(f"  Abstract (first 150 chars): {paper['abstract'][:150]}...")
    
Collecting papers from Machine Learning (cs.LG)...
  Collected 100 papers
Collecting papers from Computer Vision (cs.CV)...
  Collected 100 papers
Collecting papers from Computational Linguistics (cs.CL)...
  Collected 100 papers
Collecting papers from Databases (cs.DB)...
  Collected 100 papers
Collecting papers from Software Engineering (cs.SE)...
  Collected 100 papers

Total papers collected: 500

================================================================================
SAMPLE PAPERS (one from each category)
================================================================================

Machine Learning:
  Title: Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
  Abstract (first 150 chars): Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale struc...

Computer Vision:
  Title: Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
  Abstract (first 150 chars): Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have propose...

Computational Linguistics:
  Title: VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
  Abstract (first 150 chars): LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct an...

Databases:
  Title: Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
  Abstract (first 150 chars): Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it ...

Software Engineering:
  Title: evomap: A Toolbox for Dynamic Mapping in Python
  Abstract (first 150 chars): This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among ...
  

The code above demonstrates how easy it is to collect papers programmatically. In just a few lines, we've gathered 500 recent research papers from five distinct computer science domains.

Take a look at your output when you run this code. You might notice something interesting: sometimes the same paper title appears under multiple categories. This happens because researchers often cross-list their papers in multiple relevant categories on arXiv. A paper about deep learning for natural language processing could legitimately appear in both Machine Learning (cs.LG) and Computational Linguistics (cs.CL). A paper about neural networks for image generation might be listed in both Machine Learning (cs.LG) and Computer Vision (cs.CV).

While our five categories are conceptually separate, there's naturally some overlap, especially between closely related fields. This real-world messiness is exactly what makes working with actual data more interesting than handcrafted examples. Your specific results will look different from ours because arXiv returns the most recently submitted papers, which change as new research is published.

Preparing Your Dataset

Before generating embeddings, we need to clean and structure our data. Real-world datasets always have imperfections. Some papers might have missing abstracts, others might have abstracts that are too short to be meaningful, and we need to organize everything into a format that's easy to work with.

Let's use pandas to create a DataFrame and handle these data quality issues:

import pandas as pd

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(all_papers)

print("Dataset before cleaning:")
print(f"Total papers: {len(df)}")
print(f"Papers with abstracts: {df['abstract'].notna().sum()}")

# Check for missing abstracts
missing_abstracts = df['abstract'].isna().sum()
if missing_abstracts > 0:
    print(f"\nWarning: {missing_abstracts} papers have missing abstracts")
    df = df.dropna(subset=['abstract'])

# Filter out papers with very short abstracts (less than 100 characters)
# These are often just placeholders or incomplete entries
df['abstract_length'] = df['abstract'].str.len()
df = df[df['abstract_length'] >= 100].copy()

print(f"\nDataset after cleaning:")
print(f"Total papers: {len(df)}")
print(f"Average abstract length: {df['abstract_length'].mean():.0f} characters")

# Show the distribution across categories
print("\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Display the first few entries
separator = "=" * 80
print(f"\n{separator}", "FIRST 3 PAPERS IN CLEANED DATASET", f"{separator}", sep="\n")
for idx, row in df.head(3).iterrows():
    print(f"\n{idx+1}. {row['title']}")
    print(f"   Category: {row['category']}")
    print(f"   Abstract length: {row['abstract_length']} characters")
    
Dataset before cleaning:
Total papers: 500
Papers with abstracts: 500

Dataset after cleaning:
Total papers: 500
Average abstract length: 1337 characters

Papers per category:
category
cs.CL    100
cs.CV    100
cs.DB    100
cs.LG    100
cs.SE    100
Name: count, dtype: int64

================================================================================
FIRST 3 PAPERS IN CLEANED DATASET
================================================================================

1. Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
   Category: cs.LG
   Abstract length: 1783 characters

2. Multi-Method Analysis of Mathematics Placement Assessments: Classical, Machine Learning, and Clustering Approaches
   Category: cs.LG
   Abstract length: 1519 characters

3. Forgetting is Everywhere
   Category: cs.LG
   Abstract length: 1150 characters
   

Data preparation matters because poor quality input leads to poor quality embeddings. By filtering out papers with missing or very short abstracts, we ensure that our embeddings will capture meaningful semantic content. In production systems, you'd likely implement more sophisticated quality checks, but this basic approach handles the most common issues.

Strategy One: Local Open-Source Models

Now we're ready to generate embeddings. Let's start with local models using sentence-transformers, the same approach we used in the previous tutorial. The key advantage of local models is that everything runs on your own machine. There are no API costs, no data leaves your computer, and you have complete control over the embedding process.

We'll use all-MiniLM-L6-v2 again for consistency, and we'll also demonstrate a larger model called all-mpnet-base-v2 to show how different models produce different results:

from sentence_transformers import SentenceTransformer
import numpy as np
import time

# Load the same model from the previous tutorial
print("Loading all-MiniLM-L6-v2 model...")
model_small = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all abstracts
abstracts = df['abstract'].tolist()

print(f"Generating embeddings for {len(abstracts)} papers...")
start_time = time.time()

# The encode() method handles batching automatically
embeddings_small = model_small.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32  # Process 32 abstracts at a time
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_small.shape}")
print(f"Each abstract is now a {embeddings_small.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add embeddings to our DataFrame
df['embedding_minilm'] = list(embeddings_small)
Loading all-MiniLM-L6-v2 model...
Generating embeddings for 500 papers...
Batches: 100%|██████████| 16/16 [01:05<00:00,  4.09s/it]

Completed in 65.45 seconds
Embedding shape: (500, 384)
Each abstract is now a 384-dimensional vector
Average time per abstract: 0.131 seconds

That was fast! On a typical laptop, we generated embeddings for 500 abstracts in about 65 seconds. Now let's try a larger, more powerful model to see the difference.

Spoiler alert: this will take several more minutes than the last one, so you may want to freshen up your coffee while it's running:

# Load a larger (more dimensions) model
print("\nLoading all-mpnet-base-v2 model...")
model_large = SentenceTransformer('all-mpnet-base-v2')

print("Generating embeddings with larger model...")
start_time = time.time()

embeddings_large = model_large.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_large.shape}")
print(f"Each abstract is now a {embeddings_large.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add these embeddings to our DataFrame too
df['embedding_mpnet'] = list(embeddings_large)
Loading all-mpnet-base-v2 model...
Generating embeddings with larger model...
Batches: 100%|██████████| 16/16 [11:20<00:00, 30.16s/it]

Completed in 680.47 seconds
Embedding shape: (500, 768)
Each abstract is now a 768-dimensional vector
Average time per abstract: 1.361 seconds

Notice the differences between these two models:

  • Dimensionality: The smaller model produces 384-dimensional embeddings, while the larger model produces 768-dimensional embeddings. More dimensions can capture more nuanced semantic information.
  • Speed: The smaller model is about 10 times faster. For 500 papers, that's a difference of about 10 minutes. For thousands of documents, this difference becomes significant.
  • Quality: Larger models generally produce higher-quality embeddings that better capture subtle semantic relationships. However, the smaller model is often good enough for many applications.

The key insight here is that local models give you flexibility. You can choose models that balance quality, speed, and computational resources based on your specific needs. For rapid prototyping, use smaller models. For production systems where quality matters most, use larger models.

Visualizing Real-World Embeddings

In our previous tutorial, we saw beautifully separated clusters using handcrafted paper abstracts. Let's see what happens when we visualize embeddings from real arXiv papers. We'll use the same PCA approach to reduce our 384-dimensional embeddings down to 2D:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_small)

print(f"Original embedding dimensions: {embeddings_small.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
Original embedding dimensions: 384
Reduced embedding dimensions: 2

Now let's create a visualization showing how our 500 papers cluster by category:

# Create the visualization
plt.figure(figsize=(12, 8))

# Define colors for different categories
colors = ['#C8102E', '#003DA5', '#00843D', '#FF8200', '#6A1B9A']
category_names = ['Machine Learning', 'Computer Vision', 'Comp. Linguistics', 'Databases', 'Software Eng.']
category_codes = ['cs.LG', 'cs.CV', 'cs.CL', 'cs.DB', 'cs.SE']

# Plot each category
for i, (cat_code, cat_name, color) in enumerate(zip(category_codes, category_names, colors)):
    # Get papers from this category
    mask = df['category'] == cat_code
    cat_embeddings = embeddings_2d[mask]

    plt.scatter(cat_embeddings[:, 0], cat_embeddings[:, 1],
                c=color, label=cat_name, s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

plt.xlabel('First Principal Component', fontsize=12)
plt.ylabel('Second Principal Component', fontsize=12)
plt.title('500 arXiv Papers Across Five Computer Science Categories\n(Real-world embeddings show overlapping clusters)',
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='best', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Visualization of embeddings for 500 arXiv Papers Across Five Computer Science Categories

The visualization above reveals an important aspect of real-world data. Unlike our handcrafted examples in the previous tutorial, where clusters were perfectly separated, these real arXiv papers show more overlap. You can see clear groupings, as well as papers that bridge multiple topics. For example, a paper about "deep learning for database query optimization" uses vocabulary from both machine learning and databases, so it might appear between those clusters.

This is exactly what you'll encounter in production systems. Real data is messy, topics overlap, and semantic boundaries are often fuzzy rather than sharp. The embeddings are still capturing meaningful relationships, but the visualization shows the complexity of actual research papers rather than the idealized examples we used for learning.

Strategy Two: API-Based Embedding Services

Local models work great, but they require computational resources and you're responsible for managing them. API-based embedding services offer an alternative approach. You send your text to a cloud provider, they generate embeddings using their infrastructure, and they send the embeddings back to you.

We'll use Cohere's API for our main example because they offer a generous free trial tier that doesn't require payment information. This makes it perfect for learning and experimentation.

Setting Up Cohere Securely

First, you'll need to create a free Cohere account and get an API key:

  1. Visit Cohere's registration page
  2. Sign up for a free account (no credit card required)
  3. Navigate to the API Keys section in your dashboard
  4. Copy your Trial API key

Important security practice: Never hardcode API keys directly in your notebooks or scripts. Store them in a .env file instead. This prevents accidentally sharing sensitive credentials when you share your code.

Create a file named .env in your project directory with the following entry:

COHERE_API_KEY=your_key_here

Important: Add .env to your .gitignore file to prevent committing it to version control.

Now load your API key securely:

from dotenv import load_dotenv
import os
from cohere import ClientV2
import time

# Load environment variables from .env file
load_dotenv()

# Access your API key
cohere_api_key = os.getenv('COHERE_API_KEY')

if not cohere_api_key:
    raise ValueError(
        "COHERE_API_KEY not found. Please create a .env file with your API key.\n"
        "See https://dashboard.cohere.com for instructions on getting your key."
    )

# Initialize the Cohere client using the V2 API
co = ClientV2(api_key=cohere_api_key)
print("API key loaded successfully from environment")
API key loaded successfully from environment

Now let's generate embeddings using the Cohere API. Here's something we discovered through trial and error: when we first ran this code without delays, we hit Cohere's rate limit and got a 429 TooManyRequestsError with this message: "trial token rate limit exceeded, limit is 100000 tokens per minute."

This exposes an important lesson about working with APIs. Rate limits aren't always clearly documented upfront. Sometimes you discover them by running into them, then you have to dig through the error responses in the documentation to understand what happened. In this case, we found the details in Cohere's error responses documentation. You can also check their rate limits page for current limits, though specifics for free tier accounts aren't always listed there.

With 500 papers averaging around 1,337 characters each, we can easily exceed 100,000 tokens per minute if we send batches too quickly. So we've built in two safeguards: a 12-second delay between batches to stay under the limit, and retry logic in case we do hit it. This makes the process take about 60-70 seconds instead of the 6-8 seconds the API actually needs, but it's reliable and won't throw errors mid-process.

Think of it as the tradeoff for using a free tier: we get access to powerful models without paying, but we work within some constraints. Let's see it in action:

print("Generating embeddings using Cohere API...")
print(f"Processing {len(abstracts)} abstracts...")

start_time = time.time()
actual_api_time = 0  # Track time spent on actual API calls

# Cohere recommends processing in batches for efficiency
# Their API accepts up to 96 texts per request
batch_size = 90
all_embeddings = []

for i in range(0, len(abstracts), batch_size):
    batch = abstracts[i:i+batch_size]
    batch_num = i//batch_size + 1
    total_batches = (len(abstracts) + batch_size - 1) // batch_size
    print(f"Processing batch {batch_num}/{total_batches} ({len(batch)} abstracts)...")

    # Add retry logic for rate limits
    max_retries = 3
    retry_delay = 60  # Wait 60 seconds if we hit rate limit

    for attempt in range(max_retries):
        try:
            # Track actual API call time
            api_start = time.time()

            # Generate embeddings for this batch using V2 API
            response = co.embed(
                texts=batch,
                model='embed-v4.0',
                input_type='search_document',
                embedding_types=['float']
            )

            actual_api_time += time.time() - api_start
            # V2 API returns embeddings in a different structure
            all_embeddings.extend(response.embeddings.float_)
            break  # Success, move to next batch

        except Exception as e:
            if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                print(f"  Rate limit hit. Waiting {retry_delay} seconds before retry...")
                time.sleep(retry_delay)
            else:
                raise  # Re-raise if it's not a rate limit error or we're out of retries

    # Add a delay between batches to avoid hitting rate limits
    # Wait 12 seconds between batches (spreads 500 papers over ~1 minute)
    if i + batch_size < len(abstracts):  # Don't wait after the last batch
        time.sleep(12)

# Convert to numpy array for consistency with local models
embeddings_cohere = np.array(all_embeddings)
elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds (includes rate limit delays)")
print(f"Actual API processing time: {actual_api_time:.2f} seconds")
print(f"Time spent waiting for rate limits: {elapsed_time - actual_api_time:.2f} seconds")
print(f"Embedding shape: {embeddings_cohere.shape}")
print(f"Each abstract is now a {embeddings_cohere.shape[1]}-dimensional vector")
print(f"Average time per abstract (API only): {actual_api_time/len(abstracts):.3f} seconds")

# Add to DataFrame
df['embedding_cohere'] = list(embeddings_cohere)
Generating embeddings using Cohere API...
Processing 500 abstracts...
Processing batch 1/6 (90 abstracts)...
Processing batch 2/6 (90 abstracts)...
Processing batch 3/6 (90 abstracts)...
Processing batch 4/6 (90 abstracts)...
Processing batch 5/6 (90 abstracts)...
Processing batch 6/6 (50 abstracts)...

Completed in 87.23 seconds (includes rate limit delays)
Actual API processing time: 27.18 seconds
Time spent waiting for rate limits: 60.05 seconds
Embedding shape: (500, 1536)
Each abstract is now a 1536-dimensional vector
Average time per abstract (API only): 0.054 seconds

Notice the timing breakdown? The actual API processing was quite fast (around 27 seconds), but we spent most of our time waiting between batches to respect rate limits (around 60 seconds). This is the reality of free-tier accounts: they're fantastic for learning and prototyping, but come with constraints. Paid tiers would give us much higher limits and let us process at full speed.

Something else worth noting: Cohere's embeddings are 1536-dimensional, which is 4x larger than our small local model (384 dimensions) and 2x larger than our large local model (768 dimensions). Yet the API processing was still faster than our small local model. This demonstrates the power of specialized infrastructure. Cohere runs optimized hardware designed specifically for embedding generation at scale, while our local models run on general-purpose computers. Higher dimensions don't automatically mean slower processing when you have the right infrastructure behind them.

For this tutorial, Cohere’s free tier works perfectly. We're focusing on understanding the concepts and comparing approaches, not optimizing for production speed. The key differences from local models:

  • No local computation: All processing happens on Cohere's servers, so it works equally well on any hardware.
  • Internet dependency: Requires an active internet connection to work.
  • Rate limits: Free tier accounts have token-per-minute limits, which is why we added delays between batches.

Other API Options

While we're using Cohere for this tutorial, you should know about other popular embedding APIs:

OpenAI offers excellent embedding models, but requires payment information upfront. If you have an OpenAI account, their text-embedding-3-small model is very affordable at \$0.02 per 1M tokens. You can find setup instructions in their embeddings documentation.

Together AI provides access to many open-source models through their API. They offer models like BAAI/bge-large-en-v1.5 and detailed documentation in their embeddings guide. Note that their rate limit tiers are subject to change, so be sure to check their rate limit documentation to determine the tier you'll need based on your needs.

The choice between these services depends on your priorities. OpenAI has excellent quality but requires payment setup. Together AI offers many model choices and different paid tiers. Cohere has a truly free tier for learning and prototyping.

Comparing Your Options

Now that we've generated embeddings using both local models and an API service, let's think about how to choose between these approaches for real projects. The decision isn't about one being universally better than the other. It's about matching the approach to your specific constraints and requirements.

To clarify terminology: "self-hosted models" means running models on infrastructure you control, whether that's your laptop for learning or your company's cloud servers for production. "API services" means using third-party providers like Cohere or OpenAI where you send data to their servers for processing.

Dimension Self-hosted Models API Services
Cost Zero ongoing costs after initial setup. Best for high-volume applications where you'll generate embeddings frequently. Pay-per-use model per 1M tokens. Cohere: \$0.12 per 1M tokens. OpenAI: \$0.13 per 1M tokens. Best for low to moderate volume, or when you want predictable costs without infrastructure.
Performance Speed depends on your hardware. Our results: 0.131 seconds per abstract (small model), 1.361 seconds per abstract (large model). Best for batch processing or when you control the infrastructure. Speed depends on internet connection and API server load. Our results: 0.054 seconds per abstract (Cohere). Includes network latency and third-party infrastructure considerations. Best when you don't have powerful local hardware or need access to the latest models.
Privacy All data stays on your infrastructure. Complete control over data handling. No data sent to third parties. Best for sensitive data, healthcare, financial services, or when compliance requires data locality. Data is sent to third-party servers for processing. Subject to the provider's data handling policies. Cohere states that API data isn't used for training (verify current policy). Best for non-sensitive data, or when provider policies meet your requirements.
Customization Can fine-tune models on your specific domain. Full control over model selection and updates. Can modify inference parameters. Best for specialized domains, custom requirements, or when you need reproducibility. Limited to provider's available models. Model updates happen on provider's schedule. Less control over inference details. Best for general-purpose applications, or when using the latest models matters more than control.
Infrastructure Requires managing infrastructure. Whether running on your laptop or company cloud servers, you handle model updates, dependencies, and scaling. Best for organizations with existing ML infrastructure or when infrastructure control is important. No infrastructure management needed. Automatic scaling to handle load. Provider manages updates and availability. Best for smaller teams, rapid prototyping, or when you want to focus on application logic rather than infrastructure.

When to Use Each Approach

Here's a practical decision guide to help you choose the right approach for your project:

Choose Self-Hosted Models when you:

  • Process large volumes of text regularly
  • Work with sensitive or regulated data
  • Need offline capability
  • Have existing ML infrastructure (whether local or cloud-based)
  • Want to fine-tune models for your domain
  • Need complete control over the deployment

Choose API Services when you:

  • Are just getting started or prototyping
  • Have unpredictable or variable workload
  • Want to avoid infrastructure management
  • Need automatic scaling
  • Prefer the latest models without maintenance
  • Value simplicity over infrastructure control

For our tutorial series, we've used both approaches to give you hands-on experience with each. In our next tutorial, we'll use the Cohere embeddings for our semantic search implementation. We're choosing Cohere because they offer a generous free tier for learning (no payment required), their models are well-suited for semantic search tasks, and they work consistently across different hardware setups.

In practice, you'd evaluate embedding quality by testing on your specific use case: generate embeddings with different models, run similarity searches on sample queries, and measure which model returns the most relevant results for your domain.

Storing Your Embeddings

We've generated embeddings using multiple methods, and now we need to save them for future use. Storing embeddings properly is important because generating them can be time-consuming and potentially costly. You don't want to regenerate embeddings every time you run your code.

Let's explore two storage approaches:

Option 1: CSV with Numpy Arrays

This approach works well for learning and small-scale prototyping:

# Save the metadata to CSV (without embeddings, which are large arrays)
df_metadata = df[['title', 'abstract', 'authors', 'published', 'category', 'arxiv_id', 'abstract_length']]
df_metadata.to_csv('arxiv_papers_metadata.csv', index=False)
print("Saved metadata to 'arxiv_papers_metadata.csv'")

# Save embeddings as numpy arrays
np.save('embeddings_minilm.npy', embeddings_small)
np.save('embeddings_mpnet.npy', embeddings_large)
np.save('embeddings_cohere.npy', embeddings_cohere)
print("Saved embeddings to .npy files")

# Later, you can load them back like this:
# df_loaded = pd.read_csv('arxiv_papers_metadata.csv')
# embeddings_loaded = np.load('embeddings_cohere.npy')
Saved metadata to 'arxiv_papers_metadata.csv'
Saved embeddings to .npy files

This approach is simple and transparent, making it perfect for learning and experimentation. However, it has significant limitations for larger datasets:

  • Loading all embeddings into memory doesn't scale beyond a few thousand documents
  • No indexing for fast similarity search
  • Manual coordination between CSV metadata and numpy arrays

For production systems with thousands or millions of embeddings, you'll want specialized vector databases (Option 2) that handle indexing, similarity search, and efficient storage automatically.

Option 2: Preparing for Vector Databases

In production systems, you'll likely store embeddings in a specialized vector database like Pinecone, Weaviate, or Chroma. These databases are optimized for similarity search. While we'll cover vector databases in detail in another tutorial series, here's how you'd structure your data for them:

# Prepare data in a format suitable for vector databases
# Most vector databases want: ID, embedding vector, and metadata

vector_db_data = []
for idx, row in df.iterrows():
    vector_db_data.append({
        'id': row['arxiv_id'],
        'embedding': row['embedding_cohere'].tolist(),  # Convert numpy array to list
        'metadata': {
            'title': row['title'],
            'abstract': row['abstract'][:500],  # Many DBs limit metadata size
            'authors': ', '.join(row['authors'][:3]),  # First 3 authors
            'category': row['category'],
            'published': str(row['published'])
        }
    })

# Save in JSON format for easy loading into vector databases
import json
with open('arxiv_papers_vector_db_format.json', 'w') as f:
    json.dump(vector_db_data, f, indent=2)
print("Saved data in vector database format to 'arxiv_papers_vector_db_format.json'")

print(f"\nTotal storage sizes:")
print(f"  Metadata CSV: ~{os.path.getsize('arxiv_papers_metadata.csv')/1024:.1f} KB")
print(f"  JSON for vector DB: ~{os.path.getsize('arxiv_papers_vector_db_format.json')/1024:.1f} KB")
Saved data in vector database format to 'arxiv_papers_vector_db_format.json'

Total storage sizes:
  Metadata CSV: ~764.6 KB
  JSON for vector DB: ~15051.0 KB
  

Each storage method has its purpose:

  • CSV + numpy: Best for learning and small-scale experimentation
  • JSON for vector databases: Best for production systems that need efficient similarity search

Preparing for Semantic Search

You now have 500 research papers from five distinct computer science domains with embeddings that capture their semantic meaning. These embeddings are vectors, which means we can measure how similar or different they are using mathematical distance calculations.

In the next tutorial, you'll use these embeddings to build a search system that finds relevant papers based on meaning rather than keywords. You'll implement similarity calculations, rank results, and see firsthand how semantic search outperforms traditional keyword matching.

Save your embeddings now, especially the Cohere embeddings since we'll use those in the next tutorial to build our search system. We chose Cohere because they work consistently across different hardware setups and provide a consistent baseline for implementing similarity calculations.

Next Steps

Before we move on, try these experiments to deepen your understanding:

Experiment with different arXiv categories:

  • Try collecting papers from categories like stat.ML (Statistics Machine Learning) or math.OC (Optimization and Control)
  • Use the PCA visualization code to see how these new categories cluster with your existing five
  • Do some categories overlap more than others?

Compare embedding models visually:

  • Generate embeddings for your dataset using all-mpnet-base-v2
  • Create side-by-side PCA visualizations comparing the small model and large model
  • Do the clusters look tighter or more separated with the larger model?

Test different dataset sizes:

  • Collect just 50 papers per category (250 total) and visualize the results
  • Then try 200 papers per category (1000 total)
  • How does dataset size affect the clarity of the clusters?
  • At what point does collection or processing time become noticeable?

Explore model differences:

Ready to implement similarity search and build a working semantic search engine? The next tutorial will show you how to turn these embeddings into a powerful research discovery tool.


Key Takeaways:

  • Programmatic data collection through APIs like arXiv enables working with real-world datasets
  • Collecting papers from diverse categories (cs.LG, cs.CV, cs.CL, cs.DB, cs.SE) creates semantic clusters for effective search
  • Papers can be cross-listed in multiple arXiv categories, creating natural overlap between related fields
  • Self-hosted embedding models provide zero-cost, private embedding generation with full control over the process
  • API-based embedding services offer high-quality embeddings without infrastructure management
  • Secure credential handling using .env files protects sensitive API keys and tokens
  • Rate limits aren't always clearly documented and are sometimes discovered through trial and error
  • The choice between self-hosted and API approaches depends on cost, privacy, scale, and infrastructure considerations
  • Free tier APIs provide powerful embedding generation for learning, but require handling rate limits and delays that paid tiers avoid
  • Real-world embeddings show more overlap than handcrafted examples, reflecting the complexity of actual data
  • Proper storage of embeddings prevents costly regeneration and enables efficient reuse across projects

Ubuntu 25.04 Users Can Now Upgrade to Ubuntu 25.10, Here’s How

29 October 2025 at 22:09

Ubuntu 25.10 Upgrade

A step-by-step and easy-to-follow tutorial (with screenshots) on how to upgrade your Ubuntu 25.04 installations to Ubuntu 25.10.

The post Ubuntu 25.04 Users Can Now Upgrade to Ubuntu 25.10, Here’s How appeared first on 9to5Linux - do not reproduce this article without permission. This RSS feed is intended for readers, not scrapers.

Setting Up Your Data Engineering Lab with Docker

28 October 2025 at 21:57

This guide helps you set up a clean, isolated environment for running Dataquest tutorials. While many tutorials work fine directly on your computer, some (particularly those involving data processing tools like PySpark) can run into issues depending on your operating system or existing software setup. The lab environment we'll create ensures everything runs consistently, with the right versions of Python and other tools, without affecting your main system.

What's a Lab Environment?

You can think of this "lab" as a separate workspace just for your Dataquest tutorials. It's a controlled space where you can experiment and test code without affecting your main computer. Just like scientists use labs for experiments, we'll use this development lab to work through tutorials safely.

Benefits for everyone:

  • Windows/Mac users: Avoid errors from system differences. No more "command not found" or PySpark failing to find files
  • Linux users: Get the exact versions of Python and Java needed for tutorials, without conflicting with your system's packages
  • Everyone: Keep your tutorial work separate from personal projects. Your code and files are saved normally, but any packages you install or system changes you make stay contained in the lab

We'll use a tool called Docker to create this isolated workspace. Think of it as having a dedicated computer just for tutorials inside your regular computer. Your files and code are saved just like normal (you can edit them with your favorite editor), but the tutorial environment itself stays clean and separate from everything else on your system.

The lab command you'll use creates this environment, and it mirrors real data engineering workflows (most companies use isolated environments like this to ensure consistency across their teams).

Installing Docker

Docker creates isolated Linux environments on any operating system. This means you'll get a consistent Linux workspace whether you're on Windows, Mac, or even Linux itself. We're using it as a simple tool, so no container orchestration or cloud deployment knowledge is needed.

On Windows:
Download Docker Desktop from docker.com/products/docker-desktop. Run the installer, restart your computer when prompted, and open Docker Desktop. You'll see a whale icon in your system tray when it's running.

Note: Docker Desktop will automatically enable required Windows features like WSL 2. If you see an error about virtualization, you may need to enable it in your computer's BIOS settings. Search online for your computer model + "enable virtualization" for specific steps.

On Mac:
Download Docker Desktop for your chip type (Intel or Apple Silicon) from the same link. Drag Docker to your Applications folder and launch it. You'll see the whale in your menu bar.

On Linux:
You probably already have Docker, but if not, run this command in your terminal:

curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh

Verify it works:
Open your terminal (PowerShell, Terminal, or bash) and run:

docker --version
docker compose version

You should see version numbers for both commands. If you see "command not found," restart your terminal or computer and try again.

Getting the Lab Environment

The lab is already configured in the Dataquest tutorials repository. Clone or download it:

git clone https://github.com/dataquestio/tutorials.git
cd tutorials

If you don't have git, download the repository as a ZIP file from GitHub and extract it.

The repository includes everything you need:

  • Dockerfile - Configures the Linux environment with Python 3.11 and Java (for Spark)
  • docker-compose.yml - Defines the lab setup
  • Tutorial folders with all the code and data

Starting Your Lab

In your IDE’s terminal, ensure you’re in the tutorials folder and start the lab:

docker compose run --rm lab

Note that the first time you run this command, the setup may take 2-5 minutes.

You're now in Linux! Your prompt will change to something like root@abc123:/tutorials#, which is your Linux command line where everything will work as expected.

The --rm flag means the lab cleans itself up when you exit, keeping your system tidy.

Using Your Lab

Once you’re in the lab environment, here's your typical workflow:

1. Navigate to the tutorial you're working on

# See all available tutorials
ls

# Enter a specific tutorial
cd pyspark-etl

2. Install packages as needed
Each tutorial might need different packages:

# For PySpark tutorials
pip install pyspark

# For data manipulation tutorials
pip install pandas numpy

# For database connections
pip install sqlalchemy psycopg2-binary

3. Run the tutorial code

python <script-name>.py

Because the code will run within a standardized Linux environment, you shouldn’t run into setup errors.

4. Edit files normally
The beauty of this setup: you can still use your favorite editor! The tutorials folder on your computer is synchronized with the lab. Edit files in VS Code, PyCharm, or any editor, and the lab sees changes immediately.

5. Exit when done
Type exit or press Ctrl+D to leave the lab. The environment cleans itself up automatically.

Common Workflow Examples

Running a PySpark tutorial:

docker compose run --rm lab
cd pyspark-etl
pip install pyspark pandas
python main.py

Working with Jupyter notebooks:

docker compose run --rm -p 8888:8888 lab
pip install jupyterlab
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
# Open the URL it shows in your browser

Keeping packages installed between sessions:
If you're tired of reinstalling packages, create a requirements file:

# After installing packages, save them
pip freeze > requirements.txt

# Next session, restore them
pip install -r requirements.txt

Quick Reference

The one command you need:

# From the tutorials folder
docker compose run --rm lab

Exit the lab:

exit # Or press Ctrl+D

Where things are:

  • Tutorial code: Each folder in /tutorials
  • Your edits: Automatically synchronized
  • Data files: In each tutorial's data/ folder
  • Output files: Save to the tutorial folder to see them on your computer

Adding services (databases, etc.):
For tutorials needing PostgreSQL, MongoDB, or other services, we can extend the docker-compose.yml. For now, the base setup handles all Python and PySpark tutorials.

Troubleshooting

  • "Cannot connect to Docker daemon"

    • Docker Desktop needs to be running. Start it from your applications.
  • "docker compose" not recognized

    • Older Docker versions use docker-compose (with a hyphen). Try:

      docker-compose run --rm lab
  • Slow performance on Windows

    • Docker on Windows can be slow with large datasets. For better performance, store data files in the container rather than the mounted folder.
  • "Permission denied" on Linux

    • Add your user to the docker group:

      sudo usermod -aG docker $USER

      Then log out and back in.

You're Ready

You now have a Linux lab environment that matches production systems. Happy experimenting!

Understanding, Generating, and Visualizing Embeddings

27 October 2025 at 23:01

Imagine you're searching through a massive library of data science papers looking for content about "cleaning messy datasets." A traditional keyword search returns papers that literally contain those exact words. But it completely misses an excellent paper about "handling missing values and duplicates" and another about "data validation techniques." Even though these papers teach exactly what you're looking for, you'll never see them because they're using different words.

This is the fundamental problem with keyword-based searches: they match words, not meaning. When you search for "neural network training," it won't connect you to papers about "optimizing deep learning models" or "improving model convergence," despite these being essentially the same topic.

Embeddings solve this by teaching machines to understand meaning instead of just matching text. And if you're serious about building AI systems, generating embeddings is a fundamental concept you need to master.

What Are Embeddings?

Embeddings are numerical representations that capture semantic meaning. Instead of treating text as a collection of words to match, embeddings convert text into vectors (a list of numbers) where similar meanings produce similar vectors. Think of it like translating human language into a mathematical language that computers can understand and compare.

When we convert two pieces of text that mean similar things into embeddings, those embedding vectors will be mathematically close to each other in the embedding space. Think of the embedding space as a multi-dimensional map where meaning determines location. Papers about machine learning will cluster together. Papers about data cleaning will form their own group. And papers about data visualization? They'll gather in a completely different region. In a moment, we'll create a visualization that clearly demonstrates this.

Setting Up Your Environment

Before we start working directly with embeddings, let's install the libraries we'll need. We'll use sentence-transformers from Hugging Face to generate embeddings, sklearn for dimensionality reduction, matplotlib for visualization, and numpy to handle the numerical arrays we'll be working with.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.1
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2

pip install sentence-transformers scikit-learn matplotlib numpy

Run the command above in your terminal to install all required libraries. This will work whether you're using a Python script, Jupyter notebook, or any other development environment.

For this tutorial series, we'll work with research paper abstracts from arXiv.org, a repository where researchers publish cutting-edge AI and machine learning papers. If you're building AI systems, arXiv is a great resource to have. It's where you'll find the latest research on new architectures, techniques, and approaches that can help you implement the latest techniques in your projects.

arXiv is pronounced as "archive" because the X represents the Greek letter chi ⟨χ⟩

For this tutorial, we've manually created 12 abstracts for papers spanning machine learning, data engineering, and data visualization. These abstracts are stored directly in our code as Python strings, keeping things simple for now. We'll work with APIs and larger datasets in the next tutorial to automate this process.

# Abstracts from three data science domains
papers = [
    # Machine Learning Papers
    {
        'title': 'Building Your First Neural Network with PyTorch',
        'abstract': '''Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the fundamentals of defining layers, activation functions, and forward propagation. You'll build a multi-layer perceptron for classification tasks, understand backpropagation, and implement gradient descent optimization. By the end, you'll have a working model that achieves over 90% accuracy on the MNIST dataset.'''
    },
    {
        'title': 'Preventing Overfitting: Regularization Techniques Explained',
        'abstract': '''Overfitting is one of the most common challenges in machine learning. This guide explores practical regularization methods including L1 and L2 regularization, dropout layers, and early stopping. You'll learn how to detect overfitting by monitoring training and validation loss, implement regularization in both scikit-learn and TensorFlow, and tune regularization hyperparameters to improve model generalization on unseen data.'''
    },
    {
        'title': 'Hyperparameter Tuning with Grid Search and Random Search',
        'abstract': '''Selecting optimal hyperparameters can dramatically improve model performance. This paper demonstrates systematic approaches to hyperparameter optimization using grid search and random search. You'll learn how to define hyperparameter spaces, implement cross-validation during tuning, and use scikit-learn's GridSearchCV and RandomizedSearchCV. We'll compare both methods and discuss when to use each approach for efficient model optimization.'''
    },
    {
        'title': 'Transfer Learning: Using Pre-trained Models for Image Classification',
        'abstract': '''Transfer learning lets you leverage pre-trained models to solve new problems with limited data. This paper shows how to use pre-trained convolutional neural networks like ResNet and VGG for custom image classification tasks. You'll learn how to freeze layers, fine-tune network weights, and adapt pre-trained models to your specific domain. We'll build a classifier that achieves high accuracy with just a few hundred training images.'''
    },

    # Data Engineering/ETL Papers
    {
        'title': 'Handling Missing Data: Strategies and Best Practices',
        'abstract': '''Missing data can derail your analysis if not handled properly. This comprehensive guide covers detection methods for missing values, statistical techniques for understanding missingness patterns, and practical imputation strategies. You'll learn when to use mean imputation, forward fill, and more sophisticated approaches like KNN imputation. We'll work through real datasets with missing values and implement robust solutions using pandas.'''
    },
    {
        'title': 'Data Validation Techniques for ETL Pipelines',
        'abstract': '''Building reliable data pipelines requires thorough validation at every stage. This paper teaches you how to implement data quality checks, define validation rules, and catch errors before they propagate downstream. You'll learn schema validation, outlier detection, and referential integrity checks. We'll build a validation framework using Great Expectations and integrate it into an automated ETL workflow for production data systems.'''
    },
    {
        'title': 'Cleaning Messy CSV Files: A Practical Guide',
        'abstract': '''Real-world CSV files are rarely clean and analysis-ready. This hands-on paper walks through common data quality issues: inconsistent formatting, duplicate records, invalid entries, and encoding problems. You'll master pandas techniques for standardizing column names, removing duplicates, handling date parsing errors, and dealing with mixed data types. We'll transform a messy CSV with multiple quality issues into a clean dataset ready for analysis.'''
    },
    {
        'title': 'Building Scalable ETL Workflows with Apache Airflow',
        'abstract': '''Apache Airflow helps you build, schedule, and monitor complex data pipelines. This paper introduces Airflow's core concepts including DAGs, operators, and task dependencies. You'll learn how to define pipeline workflows, implement retry logic and error handling, and schedule jobs for automated execution. We'll build a complete ETL pipeline that extracts data from APIs, transforms it, and loads it into a data warehouse on a daily schedule.'''
    },

    # Data Visualization Papers
    {
        'title': 'Creating Interactive Dashboards with Plotly Dash',
        'abstract': '''Interactive dashboards make data exploration intuitive and engaging. This paper teaches you how to build web-based dashboards using Plotly Dash. You'll learn to create interactive charts with dropdowns, sliders, and date pickers, implement callbacks for dynamic updates, and design responsive layouts. We'll build a complete dashboard for exploring sales data with filters, multiple chart types, and real-time updates.'''
    },
    {
        'title': 'Matplotlib Best Practices: Making Publication-Quality Plots',
        'abstract': '''Creating clear, professional visualizations requires attention to design principles. This guide covers matplotlib best practices for publication-quality plots. You'll learn about color palette selection, font sizing and typography, axis formatting, and legend placement. We'll explore techniques for reducing chart clutter, choosing appropriate chart types, and creating consistent styling across multiple plots for research papers and presentations.'''
    },
    {
        'title': 'Data Storytelling: Designing Effective Visualizations',
        'abstract': '''Good visualizations tell a story and guide viewers to insights. This paper focuses on the principles of visual storytelling and effective chart design. You'll learn how to choose the right visualization for your data, apply pre-attentive attributes to highlight key information, and structure narratives through sequential visualizations. We'll analyze both effective and ineffective visualizations, discussing what makes certain design choices successful.'''
    },
    {
        'title': 'Building Real-Time Visualization Streams with Bokeh',
        'abstract': '''Visualizing streaming data requires specialized techniques and tools. This paper demonstrates how to create real-time updating visualizations using Bokeh. You'll learn to implement streaming data sources, update plots dynamically as new data arrives, and optimize performance for continuous updates. We'll build a live monitoring dashboard that displays streaming sensor data with smoothly updating line charts and real-time statistics.'''
    }
]

print(f"Loaded {len(papers)} paper abstracts")
print(f"Topics covered: Machine Learning, Data Engineering, and Data Visualization")
Loaded 12 paper abstracts
Topics covered: Machine Learning, Data Engineering, and Data Visualization

Generating Your First Embeddings

Now let's transform these paper abstracts into embeddings. We'll use a pre-trained model from the sentence-transformers library called all-MiniLM-L6-v2. We're using this model because it's fast and efficient for learning purposes, perfect for understanding the core concepts. In our next tutorial, we'll explore more recent production-grade models used in real-world applications.

The model will convert each abstract into an n-dimensional vector, where the value of n depends on the specific model architecture. Different embedding models produce vectors of different sizes. Some models create compact 128-dimensional embeddings, while others produce larger 768 or even 1024-dimensional vectors. Generally, larger embeddings can capture more nuanced semantic information, but they also require more computational resources and storage space.

Think of each dimension in the vector as capturing some aspect of the text's meaning. Maybe one dimension responds strongly to "machine learning" concepts, another to "data cleaning" terminology, and another to "visualization" language. The model learned these representations automatically during training.

Let's see what dimensionality our specific model produces.

from sentence_transformers import SentenceTransformer

# Load the pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract just the abstracts for embedding
abstracts = [paper['abstract'] for paper in papers]

# Generate embeddings for all abstracts
embeddings = model.encode(abstracts)

# Let's examine what we've created
print(f"Shape of embeddings: {embeddings.shape}")
print(f"Each abstract is represented by a vector of {embeddings.shape[1]} numbers")
print(f"\nFirst few values of the first embedding:")
print(embeddings[0][:10])
Shape of embeddings: (12, 384)
Each abstract is represented by a vector of 384 numbers

First few values of the first embedding:
[-0.06071806 -0.13064863  0.00328695 -0.04209436 -0.03220841  0.02034248
  0.0042156  -0.01300791 -0.1026612  -0.04565621]

Perfect! We now have 12 embeddings, one for each paper abstract. Each embedding is a 384-dimensional vector, represented as a NumPy array of floating-point numbers.

These numbers might look random at first, but they encode meaningful information about the semantic content of each abstract. When we want to find similar documents, we measure the cosine similarity between their embedding vectors. Cosine similarity looks at the angle between vectors. Vectors pointing in similar directions (representing similar meanings) have high cosine similarity, even if their magnitudes differ. In a later tutorial, we'll compute vector similarity using cosine, Euclidean, and dot-product methods to compare different approaches.

Before we move on, let's verify we can retrieve the original text:

# Let's look at one paper and its embedding
print("Paper title:", papers[0]['title'])
print("\nAbstract:", papers[0]['abstract'][:100] + "...")
print("\nEmbedding shape:", embeddings[0].shape)
print("Embedding type:", type(embeddings[0]))
Paper title: Building Your First Neural Network with PyTorch

Abstract: Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the ...

Embedding shape: (384,)
Embedding type: <class 'numpy.ndarray'>

Great! We can still access the original paper text alongside its embedding. Throughout this tutorial, we'll work with these embeddings while keeping track of which paper each one represents.

Making Sense of High-Dimensional Spaces

We now have 12 vectors, each with 384 dimensions. But here's the issue: humans can't visualize 384-dimensional space. We struggle to imagine even four dimensions! To understand what our embeddings have learned, we need to reduce them to two dimensions so that we can plot them on a graph.

This is where dimensionality reduction is a good skill to have. We'll use Principal Component Analysis (PCA), a technique we can use to find the two most important dimensions (the ones that capture the most variation in our data). It's like taking a 3D object and finding the best angle to photograph it in 2D while preserving as much information as possible.

While we're definitely going to lose some detail during this compression, our original 384-dimensional embeddings capture rich, nuanced information about semantic meaning. When we squeeze them down to 2D, some subtleties are bound to get lost. But the major patterns (which papers belong to which topic) will still be clearly visible.

from sklearn.decomposition import PCA

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

print(f"Original embedding dimensions: {embeddings.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
print(f"\nVariance explained by these 2 dimensions: {pca.explained_variance_ratio_.sum():.2%}")
Original embedding dimensions: 384
Reduced embedding dimensions: 2

Variance explained by these 2 dimensions: 41.20%

The variance explained tells us how much of the variation in the original data is preserved in these 2 dimensions. Think of it this way: if all our papers were identical, they'd have zero variance. The more different they are, the more variance. We've kept about 41% of that variation, which is plenty to see the major patterns. The clustering itself depends on whether papers use similar vocabulary, not on how much variance we've retained. So even though 41% might seem relatively low, the major patterns separating different topics will still be very clear in our embedding visualization.

Understanding Our Tutorial Topics

Before we create our embeddings visualization, let's see how the 12 papers are organized by topic. This will help us understand the patterns we're about to see in the embeddings:

# Print papers grouped by topic
print("=" * 80)
print("PAPER REFERENCE GUIDE")
print("=" * 80)

topics = [
    ("Machine Learning", list(range(0, 4))),
    ("Data Engineering/ETL", list(range(4, 8))),
    ("Data Visualization", list(range(8, 12)))
]

for topic_name, indices in topics:
    print(f"\n{topic_name}:")
    print("-" * 80)
    for idx in indices:
        print(f"  Paper {idx+1}: {papers[idx]['title']}")
================================================================================
PAPER REFERENCE GUIDE
================================================================================

Machine Learning:
--------------------------------------------------------------------------------
  Paper 1: Building Your First Neural Network with PyTorch
  Paper 2: Preventing Overfitting: Regularization Techniques Explained
  Paper 3: Hyperparameter Tuning with Grid Search and Random Search
  Paper 4: Transfer Learning: Using Pre-trained Models for Image Classification

Data Engineering/ETL:
--------------------------------------------------------------------------------
  Paper 5: Handling Missing Data: Strategies and Best Practices
  Paper 6: Data Validation Techniques for ETL Pipelines
  Paper 7: Cleaning Messy CSV Files: A Practical Guide
  Paper 8: Building Scalable ETL Workflows with Apache Airflow

Data Visualization:
--------------------------------------------------------------------------------
  Paper 9: Creating Interactive Dashboards with Plotly Dash
  Paper 10: Matplotlib Best Practices: Making Publication-Quality Plots
  Paper 11: Data Storytelling: Designing Effective Visualizations
  Paper 12: Building Real-Time Visualization Streams with Bokeh

Now that we know which tutorials belong to which topic, let's visualize their embeddings.

Visualizing Embeddings to Reveal Relationships

We're going to create a scatter plot where each point represents one paper abstract. We'll color-code them by topic so we can see how the embeddings naturally group similar content together.

import matplotlib.pyplot as plt
import numpy as np

# Create the visualization
plt.figure(figsize=(8, 6))

# Define colors for different topics
colors = ['#0066CC', '#CC0099', '#FF6600']
categories = ['Machine Learning', 'Data Engineering/ETL', 'Data Visualization']

# Create color mapping for each paper
color_map = []
for i in range(12):
    if i < 4:
        color_map.append(colors[0])  # Machine Learning
    elif i < 8:
        color_map.append(colors[1])  # Data Engineering
    else:
        color_map.append(colors[2])  # Data Visualization

# Plot each paper
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=color_map[i], s=275, alpha=0.7, edgecolors='black', linewidth=1)
    # Add paper numbers as labels
    plt.annotate(str(i+1), (x, y), fontsize=10, fontweight='bold',
                ha='center', va='center')

plt.xlabel('First Principal Component', fontsize=14)
plt.ylabel('Second Principal Component', fontsize=14)
plt.title('Paper Embeddings from Three Data Science Topics\n(Papers close together have similar semantic meaning)',
          fontsize=15, fontweight='bold', pad=20)

# Add a legend showing which colors represent which topics
legend_elements = [plt.Line2D([0], [0], marker='o', color='w',
                              markerfacecolor=colors[i], markersize=12,
                              label=categories[i]) for i in range(len(categories))]
plt.legend(handles=legend_elements, loc='best', fontsize=12)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What the Visualization Reveals About Semantic Similarity

Take a look at the visualization below that was generated using the code above. As you can see, the results are pretty striking! The embeddings have naturally organized themselves into three distinct regions based purely on semantic content.

Keep in mind that we deliberately chose papers from very distinct topics to make the clustering crystal clear. This is perfect for learning, but real-world datasets are messier. When you're working with papers that bridge multiple topics or have overlapping vocabulary, you'll see more gradual transitions between clusters rather than these sharp separations. We'll encounter that reality in the next tutorial when we work with hundreds of real arXiv papers.

Paper Embeddings from Data Science Topics

  • The Machine Learning cluster (blue, papers 1-4) dominates the lower-left side of the plot. These four points sit close together because they all discuss neural networks, training, and model optimization. Look at papers 1 and 4. They're positioned very near each other despite one focusing on building networks from scratch and the other on transfer learning. The embedding model recognizes that they both use the core language of deep learning: layers, weights, training, and model architectures.
  • The Data Engineering/ETL cluster (magenta, papers 5-8) occupies the upper portion of the plot. These papers share vocabulary around data quality, pipelines, and validation. Notice how papers 5, 6, and 7 form a tight grouping. They all discuss data quality issues using terms like "missing values," "validation," and "cleaning." Paper 8 (about Airflow) sits slightly apart from the others, which makes sense: while it's definitely about data engineering, it focuses more on workflow orchestration than data quality, giving it a slightly different semantic fingerprint.
  • The Data Visualization cluster (orange, papers 9-12) is gathered on the lower-right side. These four papers are packed close together because they all use visualization-specific vocabulary: "charts," "dashboards," "colors," and "interactive elements." The tight clustering here shows just how distinct visualization terminology is from both ML and data engineering language.

What's remarkable is the clear separation between all three clusters. The distance between the ML papers on the left and the visualization papers on the right tells us that these topics use fundamentally different vocabularies. There's minimal semantic overlap between "neural networks" and "dashboards," so they end up far apart in the embedding space.

How the Model Learned to Understand Meaning

The all-MiniLM-L6-v2 embedding model was trained on millions of text pairs, learning which words tend to appear together. When it sees a tutorial full of words like "layers," "training," and "optimization," it produces an embedding vector that's mathematically similar to other texts with that same vocabulary pattern. The clustering emerges naturally from those learned associations.

Why This Matters for Your Work as an AI Engineer

Embeddings are foundational to the modern AI systems you'll build as an AI Engineer. Let's look at how embeddings enable the core technologies you'll work with:

  1. Building Intelligent Search Systems

    Traditional keyword search has a fundamental limitation: it can only find exact matches. If a user searches for "handling null values," they won't find documents about "missing data strategies" or "imputation techniques," even though these are exactly what they need. Embeddings solve this by understanding semantic similarity. When you embed both the search query and your documents, you can find relevant content based on meaning rather than word matching. The result is a search system that actually understands what you're looking for.

  2. Working with Vector Databases

    Vector databases are specialized databases that are built to store and query embeddings efficiently. Instead of SQL queries that match exact values, vector databases let you ask "find me all documents similar to this one" and get results ranked by semantic similarity. They're optimized for the mathematical operations that embeddings require, like calculating distances between high-dimensional vectors, which makes them essential infrastructure for AI applications. Modern systems often use hybrid search approaches that combine semantic similarity with traditional keyword matching to get the best of both worlds.

  3. Implementing Retrieval-Augmented Generation (RAG)

    RAG systems are one of the most powerful patterns in modern AI engineering. Here's how they work: you embed a large collection of documents (like company documentation, research papers, or knowledge bases). When a user asks a question, you embed their question and use that embedding to find the most relevant documents from your collection. Then you pass those documents to a language model, which generates an informed answer grounded in your specific data. Embeddings make the retrieval step possible because they're how the system knows which documents are relevant to the question.

  4. Creating AI Agents with Long-Term Memory

    AI agents that can remember past interactions and learn from experience need a way to store and retrieve relevant memories. Embeddings enable this. When an agent has a conversation or completes a task, you can embed the key information and store it in a vector database. Later, when the agent encounters a similar situation, it can retrieve relevant past experiences by finding embeddings close to the current context. This gives agents the ability to learn from history and make better decisions over time. In practice, long-term agent memory often uses similarity thresholds and time-weighted retrieval to prevent irrelevant or outdated information from being recalled.

These four applications (search, vector databases, RAG, and AI agents) are foundational tools for any aspiring AI Engineer's toolkit. Each builds on embeddings as a core technology. Understanding how embeddings capture semantic meaning is the first step toward building production-ready AI systems.

Advanced Topics to Explore

As you continue learning about embeddings, you'll encounter several advanced techniques that are widely used in production systems:

  • Multimodal Embeddings allow you to embed different types of content (text, images, audio) into the same embedding space. This enables powerful cross-modal search capabilities, like finding images based on text descriptions or vice versa. Models like CLIP demonstrate how effective this approach can be.
  • Instruction-Tuned Embeddings are models fine-tuned to better understand specific types of queries or instructions. These specialized models often outperform general-purpose embeddings for domain-specific tasks like legal document search or medical literature retrieval.
  • Quantization reduces the precision of embedding values (from 32-bit floats to 8-bit integers, for example), which can dramatically reduce storage requirements and speed up similarity calculations with minimal impact on search quality. This becomes crucial when working with millions of embeddings.
  • Dimension Truncation takes advantage of the fact that the most important information in embeddings is often concentrated in the first dimensions. By keeping only the first 256 dimensions of a 768-dimensional embedding, you can achieve significant efficiency gains while preserving most of the semantic information.

These techniques become increasingly important as you scale from prototype to production systems handling real-world data volumes.

Building Toward Production Systems

You've now learned the following core foundational embedding concepts:

  • Embeddings convert text into numerical vectors that capture meaning
  • Similar content produces similar vectors
  • These relationships can be visualized to understand how the model organizes information

But we've only worked with 12 handwritten paper abstracts. This is perfect for getting the core concept, but real applications need to handle hundreds or thousands of documents.

In the next tutorial, we'll scale up dramatically. You'll learn how to collect documents programmatically using APIs, generate embeddings at scale, and make strategic decisions about different embedding approaches.

You'll also face the practical challenges that come with real data: rate limits on APIs, processing time for large datasets, the tradeoff between embedding quality and speed, and how to handle edge cases like empty documents or very long texts. These considerations separate a learning exercise from a production system.

By the end of the next tutorial, you'll be equipped to build an embedding system that handles real-world data at scale. That foundation will prepare you for our final embeddings tutorial, where we'll implement similarity search and build a complete semantic search engine.

Next Steps

For now, experiment with the code above:

  • Try replacing one of the paper abstracts with content from your own learning.
    • Where does it appear on the visualization?
    • Does it cluster with one of our three topics, or does it land somewhere in between?
  • Add a paper abstract that bridges two topics, like "Using Machine Learning to Optimize ETL Pipelines."
    • Does it position itself between the ML and data engineering clusters?
    • What does this tell you about how embeddings handle multi-topic content?
  • Try changing the embedding model to see how it affects the visualization.
    • Models like all-mpnet-base-v2 produce different dimensional embeddings.
    • Do the clusters become tighter or more spread out?
  • Experiment with adding a completely unrelated abstract, like a cooking recipe or news article.
    • Where does it land relative to our three data science clusters?
    • How far away is it from the nearest cluster?

This hands-on exploration and experimentation will deepen your intuition about how embeddings work.

Ready to scale things up? In the next tutorial, we'll work with real arXiv data and build an embedding system that can handle thousands of papers. See you there!


Key Takeaways:

  • Embeddings convert text into numerical vectors that capture semantic meaning
  • Similar meanings produce similar vectors, enabling mathematical comparison of concepts
  • Papers from different topics cluster separately because they use distinct vocabulary
  • Dimensionality reduction (like PCA) helps visualize high-dimensional embeddings in 2D
  • Embeddings power modern AI systems, including semantic search, vector databases, RAG, and AI agents

Getting Started with Claude Code for Data Scientists

16 October 2025 at 23:39

If you've spent hours debugging a pandas KeyError, or writing the same data validation code for the hundredth time, or refactoring a messy analysis script, you know the frustration of tedious coding work. Real data science work involves analytical thinking and creative problem-solving, but it also requires a lot of mechanical coding: boilerplate writing, test generation, and documentation creation.

What if you could delegate the mechanical parts to an AI assistant that understands your codebase and handles implementation details while you focus on the analytical decisions?

That's what Claude Code does for data scientists.

What Is Claude Code?

Claude Code is Anthropic's terminal-based AI coding assistant that helps you write, refactor, debug, and document code through natural language conversations. Unlike autocomplete tools that suggest individual lines as you type, Claude Code understands project context, makes coordinated multi-file edits, and can execute workflows autonomously.

Claude Code excels at generating boilerplate code for data loading and validation, refactoring messy scripts into clean modules, debugging obscure errors in pandas or numpy operations, implementing standard patterns like preprocessing pipelines, and creating tests and documentation. However, it doesn't replace your analytical judgment, make methodological decisions about statistical approaches, or fix poorly conceived analysis strategies.

In this tutorial, you'll learn how to install Claude Code, understand its capabilities and limitations, and start using it productively for data science work. You'll see the core commands, discover tips that improve efficiency, and see concrete examples of how Claude Code handles common data science tasks.

Key Benefits for Data Scientists

Before we get into installation, let's establish what Claude Code actually does for data scientists:

  1. Eliminate boilerplate code writing for repetitive patterns that consume time without requiring creative thought. File loading with error handling, data validation checks that verify column existence and types, preprocessing pipelines with standard transformations—Claude Code generates these in seconds rather than requiring manual implementation of logic you've written dozens of times before.
  2. Generate test suites for data processing functions covering normal operation, edge cases with malformed or missing data, and validation of output characteristics. Testing data pipelines becomes straightforward rather than work you postpone.
  3. Accelerate documentation creation for data analysis workflows by generating detailed docstrings, README files explaining project setup, and inline comments that explain complex transformations.
  4. Debug obscure errors more efficiently in pandas operations, numpy array manipulations, or scikit-learn pipeline configurations. Claude Code interprets cryptic error messages, suggests likely causes based on common patterns, and proposes fixes you can evaluate immediately.
  5. Refactor exploratory code into production-quality modules with proper structure, error handling, and maintainability standards. The transition from research notebook to deployable pipeline becomes faster and less painful.

These benefits translate directly to time savings on mechanical tasks, allowing you to focus on analysis, modeling decisions, and generating insights rather than wrestling with implementation details.

Installation and Setup

Let's get Claude Code installed and configured. The process takes about 10-15 minutes, including account creation and verification.

Step 1: Obtain Your Anthropic API Key

Navigate to console.anthropic.com and create an account if you don't have one. Once logged in, access the API keys section from the navigation menu on the left, and generate a new API key by clicking on + Create Key.

Claude_Code_API_Key.png

While you can generate a new key anytime from the console, you won’t be able to retrieve any existing API keys once they have been created. For this reason, you’ll want to copy your API key immediately and store it somewhere safe—you'll need it for authentication.

Always keep your API keys secure. Treat them like passwords and never commit them to version control or share them publicly.

Step 2: Install Claude Code

Claude Code installs via npm (Node Package Manager). If you don't have Node.js installed on your system, download it from nodejs.org before proceeding.

Once Node.js is installed, open your terminal and run:

npm install -g @anthropic-ai/claude-code

The -g flag installs Claude Code globally, making it available from any directory on your system.

Common installation issues:

  • "npm: command not found": You need to install Node.js first. Download it from nodejs.org and restart your terminal after installation.
  • Permission errors on Mac/Linux: Try sudo npm install -g @anthropic-ai/claude-code to install with administrator privileges.
  • PATH issues: If Claude Code installs successfully but the claude command isn't recognized, you may need to add npm's global directory to your system PATH. Run npm config get prefix to find the location, then add [that-location]/bin to your PATH environment variable.

Step 3: Configure Authentication

Set your API key as an environment variable so Claude Code can authenticate with Anthropic's servers:

export ANTHROPIC_API_KEY=your_key_here

Replace your_key_here with the actual API key you copied earlier from the Anthropic console.

To make this permanent (so you don't need to set your API key every time you open a terminal), add the export line above to your shell configuration file:

  • For bash: Add to ~/.bashrc or ~/.bash_profile
  • For zsh: Add to ~/.zshrc
  • For fish: Add to ~/.config/fish/config.fish

You can edit your shell configuration file using nano config_file_name. After adding the line, reload your configuration by running source ~/.bashrc (or whichever file you edited), or simply open a new terminal window.

Step 4: Verify Installation

Confirm that Claude Code is properly installed and authenticated:

claude --version

You should see version information displayed. If you get an error, review the installation steps above.

Try running Claude Code for the first time:

claude

This launches the Claude Code interface. You should see a welcome message and a prompt asking you to select the text style that looks best with your terminal:

Claude_Code_Welcome_Screen.png

Use the arrow keys on your keyboard to select a text style and press Enter to continue.

Next, you’ll be asked to select a login method:

If you have an eligible subscription, select option 1. Otherwise, select option 2. For this tutorial, we will use option 2 (API usage billing).

Claude_Code_Select_Login.png

Once your account setup is complete, you’ll see a welcome message showing the email address for your account:

Claude_Code_Setup_Complete.png

To exit the setup of Claude Code at any point, press Control+C twice.

Security Note

Claude Code can read files you explicitly include and generate code that loads data from files or databases. However, it doesn't automatically access your data without your instruction. You maintain full control over what files and information Claude Code can see. When working with sensitive data, be mindful of what files you include in conversation context and review all generated code before execution, especially code that connects to databases or external systems. For more details, see Anthropic’s Security Documentation.

Understanding the Costs

Claude Code itself is free software, but using it requires an Anthropic API key that operates on usage-based pricing:

  • Free tier: Limited testing suitable for evaluation
  • Pro plan (\$20/month): Reasonable usage for individual data scientists conducting moderate development work
  • Pay-as-you-go: For heavy users working intensively on multiple projects, typically \$6-12 daily for active development

Most practitioners doing regular but not continuous development work find the \$20 Pro plan provides good balance between cost and capability. Start with the free tier to evaluate effectiveness on your actual work, then upgrade based on demonstrated value.

Your First Commands

Now that Claude Code is installed and configured, let's walk through basic usage with hands-on examples.

Starting a Claude Code Session

Navigate to a project directory in your terminal:

cd ~/projects/customer_analysis

Launch Claude Code:

claude

You'll see the Claude Code interface with a prompt where you can type natural language instructions.

Understanding Your Project

Before asking Claude Code to make changes, it needs to understand your project context. Try starting with this exploratory command:

Explain the structure of this project and identify the key files.

Claude Code will read through your directory, examine files, and provide a summary of what it found. This shows that Claude Code actively explores and comprehends codebases before acting.

Your First Refactoring Task

Let's demonstrate Claude Code's practical value with a realistic example. Create a simple file called load_data.py with some intentionally messy code:

import pandas as pd

# Load customer data
data = pd.read_csv('/Users/yourname/Desktop/customers.csv')
print(data.head())

This works but has obvious problems: hardcoded absolute path, no error handling, poor variable naming, and no documentation.

Now ask Claude Code to improve it:

Refactor load_data.py to use best practices: configurable paths, error handling, descriptive variable names, and complete docstrings.

Claude Code will analyze the file and propose improvements. Instead of the hardcoded path, you'll get configurable file paths through command-line arguments. The error handling expands to catch missing files, empty files, and CSV parsing errors. Variable names become descriptive (customer_df or customer_data instead of generic data). A complete docstring appears documenting parameters, return values, and potential exceptions. The function adds proper logging to track what's happening during execution.

Claude Code asks your permission before making these changes. Always review its proposal; if it looks good, approve it. If something seems off, ask for modifications or reject the changes entirely. This permission step ensures you stay in control while delegating the mechanical work.

What Just Happened

This demonstrates Claude Code's workflow:

  1. You describe what you want in natural language
  2. Claude Code analyzes the relevant files and context
  3. Claude Code proposes specific changes with explanations
  4. You review and approve or request modifications
  5. Claude Code applies approved changes

The entire refactoring took 90 seconds instead of 20-30 minutes of manual work. More importantly, Claude Code caught details you might have forgotten, such as adding logging, proper type hints, and handling multiple error cases. The permission-based approach ensures you maintain control while delegating implementation work.

Core Commands and Patterns

Claude Code provides several slash (/) commands that control its behavior and help you work more efficiently.

Important Slash Commands

@filename: Reference files directly in your prompts using the @ symbol. Example: @src/preprocessing.py or Explain the logic in @data_loader.py. Claude Code automatically includes the file's content in context. Use tab completion after typing @ to quickly navigate and select files.

/clear: Reset conversation context entirely, removing all history and file references. Use this when switching between different analyses, datasets, or project areas. Accumulated conversation history consumes tokens and can cause Claude Code to inappropriately reference outdated context. Think of /clear as starting a fresh conversation when you switch tasks.

/help: Display available commands and usage information. Useful when you forget command syntax or want to discover capabilities.

Context Management for Data Science Projects

Claude Code has token limits determining how much code it can consider simultaneously. For small projects with a few files, this rarely matters. For larger data science projects with dozens of notebooks and scripts, strategic context management becomes important.

Reference only files relevant to your current task using @filename syntax. If you're working on data validation, reference the validation script and related utilities (like @validation.py and @utils/data_checks.py) but exclude modeling and visualization code that won't influence the current work.

Effective Prompting Patterns

Claude Code responds best to clear, specific instructions. Compare these approaches:

  • Vague: "Make this code better"
    Specific: "Refactor this preprocessing function to handle missing values using median imputation for numerical columns and mode for categorical columns, add error handling for unexpected data types, and include detailed docstrings"
  • Vague: "Add tests"
    Specific: "Create pytest tests for the data_loader function covering successful loading, missing file errors, empty file handling, and malformed CSV detection"
  • Vague: "Fix the pandas error"
    Specific: "Debug the KeyError in line 47 of data_pipeline.py and suggest why it's failing on the 'customer_id' column"

Specific prompts produce focused, useful results. Vague prompts generate generic suggestions that may not address your actual needs.

Iteration and Refinement

Treat Claude Code's initial output as a starting point rather than expecting perfection on the first attempt. Review what it generates, identify improvements needed, and make follow-up requests:

"The validation function you created is good, but it should also check that dates are within reasonable ranges. Add validation that start_date is after 2000-01-01 and end_date is not in the future."

This iterative approach produces better results than attempting to specify every requirement in a single massive prompt.

Advanced Features

Beyond basic commands, several features improve your Claude Code experience for complex work.

  1. Activate plan mode: Press Shift+Tab before sending your prompt to enable plan mode, which creates an explicit execution plan before implementing changes. Use this for workflows with three or more distinct steps—like loading data, preprocessing, and generating outputs. The planning phase helps Claude maintain focus on the overall objective.

  2. Run commands with bash mode: Prefix prompts with an exclamation mark to execute shell commands and inject their output into Claude Code's context:

    ! python analyze_sales.py

    This runs your analysis script and adds complete output to Claude Code's context. You can then ask questions about the output or request interpretations of the results. This creates a tight feedback loop for iterative data exploration.

  3. Use extended thinking for complex problems: Include "think", "think harder", or "ultrathink" in prompts for thorough analysis:

    think harder: why does my linear regression show high R-squared but poor prediction on validation data?

    Extended thinking produces more careful analysis but takes longer (ultrathink can take several minutes). Apply this when debugging subtle statistical issues or planning sophisticated transformations.

  4. Resume previous sessions: Launch Claude Code with claude --resume to continue your most recent session with complete context preserved, including conversation history, file references, and established conventions all intact. This proves valuable for ongoing analysis where you want to continue today without re-explaining your entire analytical approach.

Optional Power User Setting

For personal projects where you trust all operations, launch with claude --dangerously-skip-permissions to bypass constant approval prompts. This carries risk if Claude Code attempts destructive operations, so use it only on projects where you maintain version control and can recover from mistakes. Never use this on production systems or shared codebases.

Configuring Claude Code for Data Science Projects

The CLAUDE.md file provides project-specific context that improves Claude Code's suggestions by explaining your conventions, requirements, and domain specifics.

Quick Setup with /init

The easiest way to create your CLAUDE.md file is using Claude Code's built-in /init command. From your project directory, launch Claude Code and run:

/init

Claude Code will analyze your project structure and ask you questions about your setup: what kind of project you're working on, your coding conventions, important files and directories, and domain-specific context. It then generates a CLAUDE.md file tailored to your project.

This interactive approach is faster than writing from scratch and ensures you don't miss important details. You can always edit the generated file later to refine it.

Understanding Your CLAUDE.md

Whether you used /init or prefer to create it manually, here's what a typical CLAUDE.md file looks like for a data science project on customer churn. In your project root directory, the file named CLAUDE.md uses markdown format and describes project information:

# Customer Churn Analysis Project

## Project Overview
Predict customer churn for a telecommunications company using historical
customer data and behavior patterns. The goal is identifying at-risk
customers for proactive retention efforts.

## Data Sources
- **Customer demographics**: data/raw/customer_info.csv
- **Usage patterns**: data/raw/usage_data.csv
- **Churn labels**: data/raw/churn_labels.csv

Expected columns documented in data/schemas/column_descriptions.md

## Directory Structure
- `data/raw/`: Original unmodified data files
- `data/processed/`: Cleaned and preprocessed data ready for modeling
- `notebooks/`: Exploratory analysis and experimentation
- `src/`: Production code for data processing and modeling
- `tests/`: Pytest tests for all src/ modules
- `outputs/`: Generated reports, visualizations, and model artifacts

## Coding Conventions
- Use pandas for data manipulation, scikit-learn for modeling
- All scripts should accept command-line arguments for file paths
- Include error handling for data quality issues
- Follow PEP 8 style guidelines
- Write pytest tests for all data processing functions

## Domain Notes
Churn is defined as customer canceling service within 30 days. We care
more about catching churners (recall) than minimizing false positives
because retention outreach is relatively low-cost.

This upfront investment takes 10-15 minutes but improves every subsequent interaction by giving Claude Code context about your project structure, conventions, and requirements.

Hierarchical Configuration for Complex Projects

CLAUDE.md files can be hierarchical. You might maintain a root-level CLAUDE.md describing overall project structure, plus subdirectory-specific files for different analysis areas.

For example, a project analyzing both customer behavior and financial performance might have:

  • Root CLAUDE.md: General project description, directory structure, and shared conventions
  • customer_analysis/CLAUDE.md: Specific details about customer data sources, relevant metrics like lifetime value and engagement scores, and analytical approaches for behavioral patterns
  • financial_analysis/CLAUDE.md: Financial data sources, accounting principles used, and approaches for revenue and cost analysis

Claude Code prioritizes the most specific configuration, so subdirectory files take precedence when working within those areas.

Custom Slash Commands

For frequently used patterns specific to your workflow, you can create custom slash commands. Create a .claude/commands directory in your project and add markdown files named for each slash command you want to define.

For example, .claude/commands/test.md:

Create pytest tests for: $ARGUMENTS

Requirements:
- Test normal operation with valid data
- Test edge cases: empty inputs, missing values, invalid types
- Test expected exceptions are raised appropriately
- Include docstrings explaining what each test validates
- Use descriptive test names that explain the scenario

Then /test my_preprocessing_function generates tests following your specified patterns.

These custom commands represent optional advanced customization. Start with basic CLAUDE.md configuration, and consider custom commands only after you've identified repetitive patterns in your prompting.

Practical Data Science Applications

Let's see Claude Code in action across some common data science tasks.

1. Data Loading and Validation

Generate robust data loading code with error handling:

Create a data loading function for customer_data.csv that:
- Accepts configurable file paths
- Validates expected columns exist with correct types
- Detects and logs missing value patterns
- Handles common errors like missing files or malformed CSV
- Returns the dataframe with a summary of loaded records

Claude Code generates a function that handles all these requirements. The code uses pathlib for cross-platform file paths, includes try-except blocks for multiple error scenarios, validates that required columns exist in the dataframe, logs detailed information about data quality issues like missing values, and provides clear exception messages when problems occur. This handles edge cases you might forget: missing files, parsing errors, column validation, and missing value detection with logging.

2. Exploratory Data Analysis Assistance

Generate EDA code:

Create an EDA script for the customer dataset that generates:
- Distribution plots for numerical features (age, income, tenure)
- Count plots for categorical features (plan_type, region)
- Correlation heatmap for numerical variables
- Summary statistics table
Save all visualizations to outputs/eda/

Claude Code produces a complete analysis script with proper plot styling, figure organization, and file saving—saving 30-45 minutes of matplotlib configuration work.

3. Data Preprocessing Pipeline

Build a preprocessing module:

Create preprocessing.py with functions to:
- Handle missing values: median for numerical, mode for categorical
- Encode categorical variables using one-hot encoding
- Scale numerical features using StandardScaler
- Include type hints, docstrings, and error handling

The generated code includes proper sklearn patterns and documentation, and it handles edge cases like unseen categories during transform.

4. Test Generation

Generate pytest tests:

Create tests for the preprocessing functions covering:
- Successful preprocessing with valid data
- Handling of various missing value patterns
- Error cases like all-missing columns
- Verification that output shapes match expectations

Claude Code generates thorough test coverage including fixtures, parametrized tests, and clear assertions—work that often gets postponed due to tedium.

5. Documentation Generation

Add docstrings and project documentation:

Add docstrings to all functions in data_pipeline.py following NumPy style
Create a README.md explaining:
- Project purpose and business context
- Setup instructions for the development environment
- How to run the preprocessing and modeling pipeline
- Description of output artifacts and their interpretation

Generated documentation captures technical details while remaining readable for collaborators.

6. Maintaining Analysis Documentation

For complex analyses, use Claude Code to maintain living documentation:

Create analysis_log.md and document our approach to handling missing income data, including:
- The statistical justification for using median imputation rather than deletion
- Why we chose median over mean given the right-skewed distribution we observed
- Validation checks we performed to ensure imputation didn't bias results

This documentation serves dual purposes. First, it provides context for Claude Code in future sessions when you resume work on this analysis, as it explains the preprocessing you applied and why those specific choices were methodologically appropriate. Second, it creates stakeholder-ready explanations communicating both technical implementation and analytical reasoning.

As your analysis progresses, continue documenting key decisions:

Add to analysis_log.md: Explain why we chose random forest over logistic regression after observing significant feature interactions in the correlation analysis, and document the cross-validation approach we used given temporal dependencies in our customer data.

This living documentation approach transforms implicit analytical reasoning into explicit written rationale, increasing both reproducibility and transparency of your data science work.

Common Pitfalls and How to Avoid Them

  • Insufficient context leads to generic suggestions that miss project-specific requirements. Claude Code doesn't automatically know your data schema, project conventions, or domain constraints. Maintain a detailed CLAUDE.md file and reference relevant files using @filename syntax in your prompts.
  • Accepting generated code without review risks introducing bugs or inappropriate patterns. Claude Code produces good starting points but isn't perfect. Treat all output as first drafts requiring validation through testing and inspection, especially for statistical computations or data transformations.
  • Attempting overly complex requests in single prompts produces confused or incomplete results. When you ask Claude Code to "build the entire analysis pipeline from scratch," it gets overwhelmed. Break large tasks into focused steps—first create data loading, then validation, then preprocessing—building incrementally toward the desired outcome.
  • Ignoring error messages when Claude Code encounters problems prevents identifying root causes. Read errors carefully and ask Claude Code for specific debugging assistance: "The preprocessing function failed with KeyError on 'customer_id'. What might cause this and how should I fix it?"

Understanding Claude Code's Limitations

Setting realistic expectations about what Claude Code cannot do well builds trust through transparency.

Domain-specific understanding requires your input. Claude Code generates code based on patterns and best practices but cannot validate whether analytical approaches are appropriate for your research questions or business problems. You must provide domain expertise and methodological judgment.

Subtle bugs can slip through. Generated code for advanced statistical methods, custom loss functions, or intricate data transformations requires careful validation. Always test generated code thoroughly against known-good examples.

Large project understanding is limited. Claude Code works best on focused tasks within individual files rather than system-wide refactoring across complex architectures with dozens of interconnected files.

Edge cases may not be handled. Preprocessing code might handle clean training data perfectly but break on production data with unexpected null patterns or outlier distributions that weren't present during development.

Expertise is not replaceable. Claude Code accelerates implementation but does not replace fundamental understanding of data science principles, statistical methods, or domain knowledge.

Security Considerations

When Claude Code accesses external data sources, malicious actors could potentially embed instructions in data that Claude Code interprets as commands. This concern is known as prompt injection.

Maintain skepticism about Claude Code suggestions when working with untrusted external sources. Never grant Claude Code access to production databases, sensitive customer information, or critical systems without careful review of proposed operations.

For most data scientists working with internal datasets and trusted sources, this risk remains theoretical, but awareness becomes important as you expand usage into more automated workflows.

Frequently Asked Questions

How much does Claude Code cost for typical data science usage?

Claude Code itself is free to install, but it requires an Anthropic API key with usage-based pricing. The free tier allows limited testing suitable for evaluation. The Pro plan at \$20/month handles moderate daily development—generating preprocessing code, debugging errors, refactoring functions. Heavy users working intensively on multiple projects may prefer pay-as-you-go pricing, typically \$6-12 daily for active development. Start with the free tier to evaluate effectiveness, then upgrade based on value.

Does Claude Code work with Jupyter notebooks?

Claude Code operates as a command-line tool and works best with Python scripts and modules. For Jupyter notebooks, use Claude Code to build utility modules that your notebooks import, creating cleaner separation between exploratory analysis and reusable logic. You can also copy code cells into Python files, improve them with Claude Code, then bring the enhanced code back to the notebook.

Can Claude Code access my data files or databases?

Claude Code reads files you explicitly include through context and generates code that loads data from files or databases. It doesn't automatically access your data without instruction. You maintain full control over what files and information Claude Code can see. When you ask Claude Code to analyze data patterns, it reads the data through code execution, not by directly accessing databases or files independently.

How does Claude Code compare to GitHub Copilot?

GitHub Copilot provides inline code suggestions as you type within an IDE, excelling at completing individual lines or functions. Claude Code offers more substantial assistance with entire file transformations, debugging sessions, and refactoring through conversational interaction. Many practitioners use both—Copilot for writing code interactively, Claude Code for larger refactoring and debugging work. They complement each other rather than compete.

Next Steps

You now have Claude Code installed, understand its capabilities and limitations, and have seen concrete examples of how it handles data science tasks.

Start by using Claude Code for low-risk tasks where mistakes are easily corrected: generating documentation for existing functions, creating test cases for well-understood code, or refactoring non-critical utility scripts. This builds confidence without risking important work. Gradually increase complexity as you become comfortable.

Maintain a personal collection of effective prompts for data science tasks you perform regularly. When you discover a prompt pattern that produces excellent results, save it for reuse. This accelerates work on similar future tasks.

For technical details and advanced features, explore Anthropic's Claude Code documentation. The official docs cover advanced topics like Model Context Protocol servers, custom hooks, and integration patterns.

To systematically learn generative AI across your entire practice, check out our Generative AI Fundamentals in Python skill path. For deeper understanding of effective prompt design, our Prompting Large Language Models in Python course teaches frameworks for crafting prompts that consistently produce useful results.

Getting Started

AI-assisted development requires practice and iteration. You'll experience some awkwardness as you learn to communicate effectively with Claude Code, but this learning curve is brief. Most practitioners feel productive within their first week of regular use.

Install Claude Code, work through the examples in this tutorial with your own projects, and discover how AI assistance fits into your workflow.


Have questions or want to share your Claude Code experience? Join the discussion in the Dataquest Community where thousands of data scientists are exploring AI-assisted development together.

Build Your First ETL Pipeline with PySpark

15 October 2025 at 23:57

You've learned PySpark basics: RDDs, DataFrames, maybe some SQL queries. You can transform data and run aggregations in notebooks. But here's the thing: data engineering is about building pipelines that run reliably every single day, handling the messy reality of production data.

Today, we're building a complete ETL pipeline from scratch. This pipeline will handle the chaos you'll actually encounter at work: inconsistent date formats, prices with dollar signs, test data that somehow made it to production, and customer IDs that follow different naming conventions.

Here's the scenario: You just started as a junior data engineer at an online grocery delivery service. Your team lead drops by your desk with a problem. "Hey, we need help. Our daily sales report is a mess. The data comes in as CSVs from three different systems, nothing matches up, and the analyst team is doing everything manually in Excel. Can you build us an ETL pipeline?"

She shows you what she's dealing with:

  • Order files that need standardized date formatting
  • Product prices stored as "$12.99" in some files, "12.99" in others
  • Customer IDs that are sometimes numbers, sometimes start with "CUST_"
  • Random blank rows and test data mixed in ("TEST ORDER - PLEASE IGNORE")

"Just get it into clean CSV files," she says. "We'll worry about performance and parquet later. We just need something that works."

Your mission? Build an ETL pipeline that takes this mess and turns it into clean, reliable data the analytics team can actually use. No fancy optimizations needed, just something that runs every morning without breaking.

Setting Up Your First ETL Project

Let's start with structure. One of the biggest mistakes new data engineers make is jumping straight into writing transformation code without thinking about organization. You end up with a single massive Python file that's impossible to debug, test, or explain to your team.

We're going to build this the way professionals do it, but keep it simple enough that you won't get lost in abstractions.

Project Structure That Makes Sense

Here's what we're creating:

grocery_etl/
├── data/
│   ├── raw/         # Your messy input CSVs
│   ├── processed/   # Clean output files
├── src/
│   └── etl_pipeline.py
├── main.py
└── requirements.txt

Why this structure? Three reasons:

First, it separates concerns. Your main.py handles orchestration; starting Spark, calling functions, handling errors. Your src/etl_pipeline.py contains all the actual ETL logic. When something breaks, you'll know exactly where to look.

Second, it mirrors the organizational pattern you'll use in production pipelines (even though the specifics will differ). Whether you're deploying to Databricks, AWS EMR, or anywhere else, you'll separate concerns the same way: orchestration code (main.py), ETL logic (src/etl_pipeline.py), and clear data boundaries. The actual file paths will change (e.g., production uses distributed filesystems like s3://data-lake/raw/ or /mnt/efs/raw/ instead of local folders), but the structure scales.

Third, it keeps your local development organized. Raw data stays raw. Processed data goes to a different folder. This makes debugging easier and mirrors the input/output separation you'll have in production, just on your local machine.

Ready to start? Get the sample CSV files and project skeleton from our starter repository. You can either:

# Clone the full tutorials repo and navigate to this project
git clone https://github.com/dataquestio/tutorials.git
cd tutorials/pyspark-etl-tutorial

Or download just the pyspark-etl-tutorial folder as a ZIP from the GitHub page.

Getting Started

We'll build this project in two files:

  • src/etl_pipeline.py: All our ETL functions (extract, transform, load)
  • main.py: Orchestration logic that calls those functions

Let's set up the basics. You'll need Python 3.9+ and Java 11 or 17 installed (required for Spark 4.0). Note: In production, you'd match your PySpark version to whatever your cluster is running (Databricks, EMR, etc.).

# requirements.txt
pyspark==4.0.1
# main.py - Just the skeleton for now
from pyspark.sql import SparkSession
import logging
import sys

def main():
    # We'll complete this orchestration logic later
    pass

if __name__ == "__main__":
    main()

That's it for setup. Notice we're not installing dozens of dependencies or configuring complex build tools. We're keeping it minimal because the goal is to understand ETL patterns, not fight with tooling.

Optional: Interactive Data Exploration

Before we dive into writing pipeline code, you might want to poke around the data interactively. This is completely optional. If you prefer to jump straight into building, skip to the next section, but if you want to see what you're up against, fire up the PySpark shell:

pyspark

Now you can explore interactively from the command line:

df = spark.read.csv("data/raw/online_orders.csv", header=True)

# See the data firsthand
df.show(5, truncate=False)
df.printSchema()
df.describe().show()

# Count how many weird values we have
df.filter(df.price.contains("$")).count()
df.filter(df.customer_id.contains("TEST")).count()

This exploration helps you understand what cleaning you'll need to build into your pipeline. Real data engineers do this all the time: you load a sample, poke around, discover the problems, then write code to fix them systematically.

But interactive exploration is for understanding the data. The actual pipeline needs to be scripted, testable, and able to run without you babysitting it. That's what we're building next.

Extract: Getting Data Flexibly

The Extract phase is where most beginner ETL pipelines break. You write code that works perfectly with your test file, then the next day's data arrives with a slightly different format, and everything crashes.

We're going to read CSVs the defensive way: assume everything will go wrong, capture the problems, and keep the pipeline running.

Reading Messy CSV Files

Let's start building src/etl_pipeline.py. We'll begin with imports and a function to create our Spark session:

# src/etl_pipeline.py

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import logging

# Set up logger for this module
logger = logging.getLogger(__name__)

def create_spark_session():
    """Create a Spark session for our ETL job"""
    return SparkSession.builder \
        .appName("Grocery_Daily_ETL") \
        .config("spark.sql.adaptive.enabled", "true") \
        .getOrCreate()

This is a basic local configuration. Real production pipelines need more: time zone handling, memory allocation tuned to your cluster, policies for parsing dates, which we’ll cover in a future tutorial on production deployment. For now, we're focusing on the pattern.

If you're new to the logging module, logger.info() writes to log files with timestamps and severity levels. When something breaks, you can check the logs to see exactly what happened. It's a small habit that saves debugging time.

Now let's read the data:

def extract_sales_data(spark, input_path):
    """Read sales CSVs with all their real-world messiness"""

    logger.info(f"Reading sales data from {input_path}")

    expected_schema = StructType([
        StructField("order_id", StringType(), True),
        StructField("customer_id", StringType(), True),
        StructField("product_name", StringType(), True),
        StructField("price", StringType(), True),
        StructField("quantity", StringType(), True),
        StructField("order_date", StringType(), True),
        StructField("region", StringType(), True)
    ])

StructType and StructField let you define exactly what columns you expect and what data types they should have. The True at the end means the field can be null. You could let Spark infer the schema automatically, but explicit schemas catch problems earlier. If someone adds a surprise column next week, you'll know immediately instead of discovering it three steps downstream.

Notice everything is StringType(). You might think "wait, customer_id has numbers, shouldn't that be IntegerType?" Here's the thing: some customer IDs are "12345" and some are "CUST_12345". If we used IntegerType(), Spark would convert "CUST_12345" to null and we'd lose data.

The strategy is simple: prevent data loss by preserving everything as strings in the Extract phase, then clean and convert in the Transform phase, where we have control over error handling.

Now let's read the file defensively:

    df = spark.read.csv(
        input_path,
        header=True,
        schema=expected_schema,
        mode="PERMISSIVE"
    )

    total_records = df.count()
    logger.info(f"Found {total_records} total records")

    return df

The PERMISSIVE mode tells Spark to be lenient with malformed data. When it encounters rows that don't match the schema, it sets unparseable fields to null instead of crashing the entire job. This keeps production pipelines running even when data quality takes a hit. We'll validate and handle data quality issues in the Transform phase, where we have better control.

Dealing with Multiple Files

Real data comes from multiple systems. Let's combine them:

def extract_all_data(spark):
    """Combine data from multiple sources"""

    # Each system exports differently
    online_orders = extract_sales_data(spark, "data/raw/online_orders.csv")
    store_orders = extract_sales_data(spark, "data/raw/store_orders.csv")
    mobile_orders = extract_sales_data(spark, "data/raw/mobile_orders.csv")

    # Union them all together
    all_orders = online_orders.unionByName(store_orders).unionByName(mobile_orders)

    logger.info(f"Combined dataset has {all_orders.count()} orders")
    return all_orders

In production, you'd often use wildcards like "data/raw/online_orders*.csv" to process multiple files at once (like daily exports). Spark reads them all and combines them automatically. We're keeping it simple here with one file per source.

The .unionByName() method stacks DataFrames vertically, matching columns by name rather than position. This prevents silent data corruption if schemas don't match perfectly, which is a common issue when combining data from different systems. Since we defined the same schema for all three sources, this works cleanly.

You've now built the Extract phase: reading data defensively and combining multiple sources. The data isn't clean yet, but at least we didn't lose any of it. That's what matters in Extract.

Transform: Fixing the Data Issues

This is where the real work happens. You've got all your data loaded, good and bad separated. Now we need to turn those messy strings into clean, usable data types.

The Transform phase is where you fix all the problems you discovered during extraction. Each transformation function handles one specific issue, making the code easier to test and debug.

Standardizing Customer IDs

Remember how customer IDs come in two formats? Some are just numbers, some have the "CUST_" prefix. Let's standardize them:

# src/etl_pipeline.py (continuing in same file)

def clean_customer_id(df):
    """Standardize customer IDs (some are numbers, some are CUST_123 format)"""

    df_cleaned = df.withColumn(
        "customer_id_cleaned",
        when(col("customer_id").startswith("CUST_"), col("customer_id"))
        .when(col("customer_id").rlike("^[0-9]+$"), concat(lit("CUST_"), col("customer_id")))
        .otherwise(col("customer_id"))
    )

    return df_cleaned.drop("customer_id").withColumnRenamed("customer_id_cleaned", "customer_id")

The logic here: if it already starts with "CUST_", keep it. If it's just numbers (rlike("^[0-9]+$") checks for that), add the "CUST_" prefix. Everything else stays as-is for now. This gives us a consistent format to work with downstream.

Cleaning Price Data

Prices are often messy. Dollar signs, commas, who knows what else:

# src/etl_pipeline.py (continuing in same file)

def clean_price_column(df):
    """Fix the price column"""

    # Remove dollar signs, commas, etc. (keep digits, decimals, and negatives)
    df_cleaned = df.withColumn(
        "price_cleaned",
        regexp_replace(col("price"), r"[^0-9.\-]", "")
    )

    # Convert to decimal, default to 0 if it fails
    df_final = df_cleaned.withColumn(
        "price_decimal",
        when(col("price_cleaned").isNotNull(),
             col("price_cleaned").cast(DoubleType()))
        .otherwise(0.0)
    )

    # Flag suspicious values for review
    df_flagged = df_final.withColumn(
        "price_quality_flag",
        when(col("price_decimal") == 0.0, "CHECK_ZERO_PRICE")
        .when(col("price_decimal") > 1000, "CHECK_HIGH_PRICE")
        .when(col("price_decimal") < 0, "CHECK_NEGATIVE_PRICE")
        .otherwise("OK")
    )

    bad_price_count = df_flagged.filter(col("price_quality_flag") != "OK").count()
    logger.warning(f"Found {bad_price_count} orders with suspicious prices")

    return df_flagged.drop("price", "price_cleaned")

The regexp_replace function strips out everything that isn't a digit or decimal point. Then we convert to a proper decimal type. The quality flag column helps us track suspicious values without throwing them out. This is important: we're not perfect at cleaning, so we flag problems for humans to review later.

Note that we're assuming US price format here (periods as decimal separators). European formats with commas would need different logic, but for this tutorial, we're keeping it focused on the ETL pattern rather than international number handling.

Standardizing Dates

Date parsing is one of those things that looks simple but gets complicated fast. Different systems export dates in different formats: some use MM/dd/yyyy, others use dd-MM-yyyy, and ISO standard is yyyy-MM-dd.

def standardize_dates(df):
    """Parse dates in multiple common formats"""

    # Try each format - coalesce returns the first non-null result
    fmt1 = to_date(col("order_date"), "yyyy-MM-dd")
    fmt2 = to_date(col("order_date"), "MM/dd/yyyy")
    fmt3 = to_date(col("order_date"), "dd-MM-yyyy")

    df_parsed = df.withColumn(
        "order_date_parsed",
        coalesce(fmt1, fmt2, fmt3)
    )

    # Check how many we couldn't parse
    unparsed = df_parsed.filter(col("order_date_parsed").isNull()).count()
    if unparsed > 0:
        logger.warning(f"Could not parse {unparsed} dates")

    return df_parsed.drop("order_date")

We use coalesce() to try each format in order, taking the first one that successfully parses. This handles the most common date format variations you'll encounter.

Note: This approach works for simple date strings but doesn't handle datetime strings with times or timezones. For production systems dealing with international data or precise timestamps, you'd need more sophisticated parsing logic. For now, we're focusing on the core pattern.

Removing Test Data

Test data in production is inevitable. Let's filter it out:

# src/etl_pipeline.py (continuing in same file)

def remove_test_data(df):
    """Remove test orders that somehow made it to production"""

    df_filtered = df.filter(
        ~(upper(col("customer_id")).contains("TEST") |
          upper(col("product_name")).contains("TEST") |
          col("customer_id").isNull() |
          col("order_id").isNull())
    )

    removed_count = df.count() - df_filtered.count()
    logger.info(f"Removed {removed_count} test/invalid orders")

    return df_filtered

We're checking for "TEST" in customer IDs and product names, plus filtering out any rows with null order IDs or customer IDs. That tilde (~) means "not", so we're keeping everything that doesn't match these patterns.

Handling Duplicates

Sometimes the same order appears multiple times, usually from system retries:

# src/etl_pipeline.py (continuing in same file)

def handle_duplicates(df):
    """Remove duplicate orders (usually from retries)"""

    df_deduped = df.dropDuplicates(["order_id"])

    duplicate_count = df.count() - df_deduped.count()
    if duplicate_count > 0:
        logger.info(f"Removed {duplicate_count} duplicate orders")

    return df_deduped

We keep the first occurrence of each order_id and drop the rest. Simple and effective.

Bringing It All Together

Now we chain all these transformations in sequence:

# src/etl_pipeline.py (continuing in same file)

def transform_orders(df):
    """Apply all transformations in sequence"""

    logger.info("Starting data transformation...")

    # Clean each aspect of the data
    df = clean_customer_id(df)
    df = clean_price_column(df)
    df = standardize_dates(df)
    df = remove_test_data(df)
    df = handle_duplicates(df)

    # Cast quantity to integer
    df = df.withColumn(
        "quantity",
        when(col("quantity").isNotNull(), col("quantity").cast(IntegerType()))
        .otherwise(1)
    )

    # Add some useful calculated fields
    df = df.withColumn("total_amount", col("price_decimal") * col("quantity")) \
           .withColumn("processing_date", current_date()) \
           .withColumn("year", year(col("order_date_parsed"))) \
           .withColumn("month", month(col("order_date_parsed")))

    # Rename for clarity
    df = df.withColumnRenamed("order_date_parsed", "order_date") \
           .withColumnRenamed("price_decimal", "unit_price")

    logger.info(f"Transformation complete. Final record count: {df.count()}")

    return df

Each transformation returns a new DataFrame (remember, PySpark DataFrames are immutable), so we reassign the result back to df each time. The order matters here: we clean customer IDs before removing test data because the test removal logic checks for "TEST" in customer IDs. We standardize dates before extracting year and month because those extraction functions need properly parsed dates to work. If you swap the order around, transformations can fail or produce wrong results.

We also add some calculated fields that will be useful for analysis: total_amount (price times quantity), processing_date (when this ETL ran), and time partitions (year and month) for efficient querying later.

The data is now clean, typed correctly, and enriched with useful fields. Time to save it.

Load: Saving Your Work

The Load phase is when we write the cleaned data somewhere useful. We're using pandas to write the final CSV because it avoids platform-specific issues during local development. In production on a real Spark cluster, you'd use Spark's native writers for parquet format with partitioning for better performance. For now, we're focusing on getting the pipeline working reliably across different development environments. You can always swap the output format to parquet once you deploy to a production cluster.

Writing Clean Files

Let's write our data in a way that makes future queries fast:

# src/etl_pipeline.py (continuing in same file)

def load_to_csv(spark, df, output_path):
    """Save processed data for downstream use"""

    logger.info(f"Writing {df.count()} records to {output_path}")

    # Convert to pandas for local development ONLY (not suitable for large datasets)
    pandas_df = df.toPandas()

    # Create output directory if needed
    import os
    os.makedirs(output_path, exist_ok=True)

    output_file = f"{output_path}/orders.csv"
    pandas_df.to_csv(output_file, index=False)

    logger.info(f"Successfully wrote {len(pandas_df)} records")
    logger.info(f"Output location: {output_file}")

    return len(pandas_df)

Important: The .toPandas() method collects all distributed data into the driver's memory. This is dangerous for real production data! If your dataset is larger than your driver's RAM, your job will crash. We're using this approach only because:

  1. Our tutorial dataset is tiny (85 rows)
  2. It avoids platform-specific Spark/Hadoop setup issues on Windows
  3. The focus is on learning ETL patterns, not deployment

In production, always use Spark's native writers (df.write.parquet(), df.write.csv()) even though they require proper cluster configuration. Never use .toPandas() for datasets larger than a few thousand rows or anything you wouldn't comfortably fit in a single machine's memory.

Quick Validation with Spark SQL

Before we call it done, let's verify our data makes sense. This is where Spark SQL comes in handy:

# src/etl_pipeline.py (continuing in same file)

def sanity_check_data(spark, output_path):
    """Quick validation using Spark SQL"""

    # Read the CSV file back
    output_file = f"{output_path}/orders.csv"
    df = spark.read.csv(output_file, header=True, inferSchema=True)
    df.createOrReplaceTempView("orders")

    # Run some quick validation queries
    total_count = spark.sql("SELECT COUNT(*) as total FROM orders").collect()[0]['total']
    logger.info(f"Sanity check - Total orders: {total_count}")

    # Check for any suspicious data that slipped through
    zero_price_count = spark.sql("""
        SELECT COUNT(*) as zero_prices
        FROM orders
        WHERE unit_price = 0
    """).collect()[0]['zero_prices']

    if zero_price_count > 0:
        logger.warning(f"Found {zero_price_count} orders with zero price")

    # Verify date ranges make sense
    date_range = spark.sql("""
        SELECT
            MIN(order_date) as earliest,
            MAX(order_date) as latest
        FROM orders
    """).collect()[0]

    logger.info(f"Date range: {date_range['earliest']} to {date_range['latest']}")

    return True

The createOrReplaceTempView() lets us query the DataFrame using SQL. This is useful for validation because SQL is often clearer for these kinds of checks than chaining DataFrame operations. We're checking the record count, looking for zero prices that might indicate cleaning issues, and verifying the date range looks reasonable.

Creating a Summary Report

Your team lead is going to ask, "How'd the ETL go today?” Let's give her the answer automatically:

# src/etl_pipeline.py (continuing in same file)

def create_summary_report(df):
    """Generate metrics about the ETL run"""

    summary = {
        "total_orders": df.count(),
        "unique_customers": df.select("customer_id").distinct().count(),
        "unique_products": df.select("product_name").distinct().count(),
        "total_revenue": df.agg(sum("total_amount")).collect()[0][0],
        "date_range": f"{df.agg(min('order_date')).collect()[0][0]} to {df.agg(max('order_date')).collect()[0][0]}",
        "regions": df.select("region").distinct().count()
    }

    logger.info("\n=== ETL Summary Report ===")
    for key, value in summary.items():
        logger.info(f"{key}: {value}")
    logger.info("========================\n")

    return summary

This generates a quick summary of what got processed. In a real production system, you might email this summary or post it to Slack so the team knows the pipeline ran successfully.

One note about performance: this summary triggers multiple separate actions on the DataFrame. Each .count() and .distinct().count() scans the data independently, which isn't optimized. We could compute all these metrics in a single pass, but that's a topic for a future tutorial on performance optimization. Right now, we're prioritizing readable code that works.

Putting It All Together

We've built all the pieces. Now let's wire them up into a complete pipeline that runs from start to finish.

Remember how we set up main.py as just a skeleton? Time to fill it in. This file orchestrates everything: starting Spark, calling our ETL functions in order, handling errors, and cleaning up when we're done.

The Complete Pipeline

# main.py
from pyspark.sql import SparkSession
import logging
import sys
import traceback
from datetime import datetime
import os

# Import our ETL functions
from src.etl_pipeline import *

def setup_logging():
    """Basic logging setup"""

    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(f'logs/etl_run_{datetime.now().strftime("%Y%m%d")}.log'),
            logging.StreamHandler(sys.stdout)
        ]
    )
    return logging.getLogger(__name__)

def main():
    """Main ETL pipeline"""

    # Create necessary directories
    os.makedirs('logs', exist_ok=True)
    os.makedirs('data/processed/orders', exist_ok=True)

    logger = setup_logging()
    logger.info("Starting Grocery ETL Pipeline")

    # Track runtime
    start_time = datetime.now()

    try:
        # Initialize Spark
        spark = create_spark_session()
        logger.info("Spark session created")

        # Extract
        raw_df = extract_all_data(spark)
        logger.info(f"Extracted {raw_df.count()} raw records")

        # Transform
        clean_df = transform_orders(raw_df)
        logger.info(f"Transformed to {clean_df.count()} clean records")

        # Load
        output_path = "data/processed/orders"
        load_to_csv(spark, clean_df, output_path)

        # Sanity check
        sanity_check_data(spark, output_path)

        # Create summary
        summary = create_summary_report(clean_df)

        # Calculate runtime
        runtime = (datetime.now() - start_time).total_seconds()
        logger.info(f"Pipeline completed successfully in {runtime:.2f} seconds")

    except Exception as e:
        logger.error(f"Pipeline failed: {str(e)}")
        logger.error(traceback.format_exc())
        raise

    finally:
        spark.stop()
        logger.info("Spark session closed")

if __name__ == "__main__":
    main()

Let's walk through what's happening here.

The setup_logging() function configures logging to write to both a file and the console. The log file gets named with today's date, so you'll have a history of every pipeline run. This is invaluable when you're debugging issues that happened last Tuesday.

The main function wraps everything in a try-except-finally block, which is important for production pipelines. The try block runs your ETL logic. If anything fails, the except block logs the error with a full traceback (that traceback.format_exc() is especially helpful when Spark's Java stack traces get messy). The finally block ensures we always close the Spark session, even if something crashed.

Notice we're using relative paths like "data/processed/orders". This is fine for local development but brittle in production. Real pipelines use environment variables or configuration files for paths. We'll cover that in a future tutorial on production deployment.

Running Your Pipeline

With everything in place, you can run your pipeline with spark-submit:

# Basic run
spark-submit main.py

# With more memory for bigger datasets
spark-submit --driver-memory 4g main.py

# See what's happening with Spark's adaptive execution
spark-submit --conf spark.sql.adaptive.enabled=true main.py

The first time you run this, you'll probably hit some issues, but that's completely normal. Let's talk about the most common ones.

Common Issues You'll Hit

No ETL pipeline works perfectly on the first try. Here are the problems everyone runs into and how to fix them.

Memory Errors

If you see java.lang.OutOfMemoryError, Spark ran out of memory. Since we're using .toPandas() to write our output, this most commonly happens if your cleaned dataset is too large to fit in the driver's memory:

# Option 1: Increase driver memory
spark-submit --driver-memory 4g main.py

# Option 2: Sample the data first to verify the pipeline works
df.sample(0.1).toPandas()  # Process 10% to test

# Option 3: Switch to Spark's native CSV writer for large data
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(output_path)

For local development with reasonably-sized data, increasing driver memory usually solves the problem. For truly massive datasets, you'd switch back to Spark's distributed writers.

Schema Mismatches

If you get "cannot resolve column name" errors, your DataFrame doesn't have the columns you think it does:

# Debug by checking what columns actually exist
df.printSchema()
print(df.columns)

This usually means a transformation dropped or renamed a column, and you forgot to update the downstream code.

Slow Performance

If your pipeline is running but taking forever, don't worry about optimization yet. That's a whole separate topic. For now, just get it working. But if it's really unbearably slow, try caching DataFrames you reference multiple times:

df.cache()  # Keep frequently used data in memory

Just remember to call df.unpersist() when you're done with it to free up memory.

What You've Accomplished

You just built a complete ETL pipeline from scratch. Here's what you learned:

  • You can handle messy real-world data. CSV files with dollar signs in prices, mixed date formats, and test records mixed into production data.
  • You can structure projects professionally. Separate functions for extract, transform, and load. Logging that helps you debug failures. Error handling that keeps the pipeline running when something goes wrong.
  • You know how to run production-style jobs. Code you can deploy with spark-submit that runs on a schedule.
  • You can spot and flag data quality issues. Suspicious prices get flagged. Test data gets filtered. Summary reports tell you what processed.

This is the foundation every data engineer needs. You're ready to build ETL pipelines for real projects.

What's Next

This pipeline works, but it's not optimized. Here's what comes after you’re comfortable with the basics:

  • Performance optimization - Make this pipeline 10x faster by reducing shuffles, tuning partitions, and computing metrics efficiently.
  • Production deployment - Run this on Databricks or EMR. Handle configuration properly, monitor with metrics, and schedule with Airflow.
  • Testing and validation - Write tests for your transformations. Add data quality checks. Build confidence that changes won't break production.

But those are advanced topics. For now, you've built something real. Take a break, then find a messy CSV dataset and build an ETL pipeline for it. The best way to learn is by doing, so here's a concrete exercise to cement what you've learned:

  1. Find any CSV dataset (Kaggle has thousands)
  2. Build an ETL pipeline for it
  3. Add handling for three data quality issues you discover
  4. Output clean parquet files partitioned by a date or category field
  5. Create a summary report showing what you processed

You now have the foundation every data engineer needs. The next time you see messy data at work, you'll know exactly how to turn it into something useful.

To learn more about PySpark, check out the rest of our tutorial series:

How to Upgrade LMDE 6 to LMDE 7

13 October 2025 at 20:58

LMDE 7 Upgrade

A quick and painless tutorial on how to upgrade your Linux Mint Debian Edition (LMDE) 6 installations to Linux Mint Debian Edition (LMDE) 7.

The post How to Upgrade LMDE 6 to LMDE 7 appeared first on 9to5Linux - do not reproduce this article without permission. This RSS feed is intended for readers, not scrapers.

Introduction to Apache Airflow

13 October 2025 at 23:26

Imagine this: you’re a data engineer at a growing company that thrives on data-driven decisions. Every morning, dashboards must refresh with the latest numbers, reports need updating, and machine learning models retrain with new data.

At first, you write a few scripts, one to pull data from an API, another to clean it, and a third to load it into a warehouse. You schedule them with cron or run them manually when needed. It works fine, until it doesn’t.

As data volumes grow, scripts multiply, and dependencies become increasingly tangled. Failures start cascading, jobs run out of order, schedules break, and quick fixes pile up into fragile automation. Before long, you're maintaining a system held together by patchwork scripts and luck. That’s where data orchestration comes in.

Data orchestration coordinates multiple interdependent processes, ensuring each task runs in the correct order, at the right time, and under the right conditions. It’s the invisible conductor that keeps data pipelines flowing smoothly from extraction to transformation to loading, reliably and automatically. And among the most powerful and widely adopted orchestration tools is Apache Airflow.

In this tutorial, we’ll use Airflow as our case study to explore how workflow orchestration works in practice. You’ll learn what orchestration means, why it matters, and how Airflow’s architecture, with its DAGs, tasks, operators, scheduler, and new event-driven features- brings order to complex data systems.

By the end, you’ll understand not just how Airflow orchestrates workflows, but why orchestration itself is the cornerstone of every scalable, reliable, and automated data engineering ecosystem.

What Workflow Orchestration Is and Why It Matters

Modern data pipelines involve multiple interconnected stages, data extraction, transformation, loading, and often downstream analytics or machine learning. Each stage depends on the successful completion of the previous one, forming a chain that must execute in the correct order and at the right time.

Many data engineers start by managing these workflows with scripts or cron jobs. But as systems grow, dependencies multiply, and processes become more complex, this manual approach quickly breaks down:

  • Unreliable execution: Tasks may run out of order, producing incomplete or inconsistent data.
  • Limited visibility: Failures often go unnoticed until reports or dashboards break.
  • Poor scalability: Adding new tasks or environments becomes error-prone and hard to maintain.

Workflow orchestration solves these challenges by automating, coordinating, and monitoring interdependent tasks. It ensures each step runs in the right sequence, at the right time, and under the right conditions, bringing structure, reliability, and transparency to data operations.

With orchestration, a loose collection of scripts becomes a cohesive system that can be observed, retried, and scaled, freeing engineers to focus on building insights rather than fixing failures.

Apache Airflow uses these principles and extends them with modern capabilities such as:

  • Deferrable sensors and the triggerer: Improve efficiency by freeing workers while waiting for external events like file arrivals or API responses.
  • Built-in idempotency and backfills: Safely re-run historical or failed workflows without duplication.
  • Data-aware scheduling: Enable event-driven pipelines that automatically respond when new data arrives.

While Airflow is not a real-time streaming engine, it excels at orchestrating batch and scheduled workflows with reliability, observability, and control. Trusted by organizations like Airbnb, Meta, and NASA, it remains the industry standard for automating and scaling complex data workflows.

Next, we’ll explore Airflow’s core concepts, DAGs, tasks, operators, and the scheduler, to see orchestration in action.

Core Airflow Concepts

To understand how Airflow orchestrates workflows, let’s explore its foundational components, the DAG, tasks, scheduler, executor, triggerer, and metadata database.

Together, these components coordinate how data flows from extraction to transformation, model training, and loading results in a seamless, automated pipeline.

We’ll use a simple ETL (Extract → Transform → Load) data workflow as our running example. Each day, Airflow will:

  1. Collect daily event data,
  2. Transform it into a clean format,
  3. Upload the results to Amazon S3.

This process will help us connect each concept to a real-world orchestration scenario.

i. DAG (Directed Acyclic Graph)

A DAG is the blueprint of your workflow. It defines what tasks exist and in what order they should run.

Think of it as the pipeline skeleton that connects your data extraction, transformation, and loading steps:

collect_data → transform_data → upload_results

DAGs can be triggered by time (e.g., daily schedules) or events, such as when a new dataset or asset becomes available.

from airflow.decorators import dag
from datetime import datetime

@dag(
    dag_id="daily_ml_pipeline",
    schedule="@daily",
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    pass

The @dag line is a decorator, a Python feature that lets you add behavior or metadata to functions in a clean, readable way. In this case, it turns the pipeline() function into a fully functional Airflow DAG.

The DAG defines when and in what order your workflow runs, but the individual tasks define how each step actually happens.

If you want to learn more about Python decorators, check out our lesson on Buidling a Pipeline Class to see them in action.

  • Don’t worry if the code above feels overwhelming. In the next tutorial, we’ll take a closer look at them and understand how they work in Airflow. For now, we’ll keep things simple and more conceptual.

ii. Tasks: The Actions Within the Workflow

A task is the smallest unit of work in Airflow, a single, well-defined action, like fetching data, cleaning it, or training a model.

If the DAG defines the structure, tasks define the actions that bring it to life.

Using the TaskFlow API, you can turn any Python function into a task with the @task decorator:

from airflow.decorators import task

@task
def collect_data():
    print("Collecting event data...")
    return "raw_events.csv"

@task
def transform_data(file):
    print(f"Transforming {file}")
    return "clean_data.csv"

@task
def upload_to_s3(file):
    print(f"Uploading {file} to S3...")

Tasks can be linked simply by calling them in sequence:

upload_to_s3(transform_data(collect_data()))

Airflow automatically constructs the DAG relationships, ensuring that each step runs only after its dependency completes successfully.

iii. From Operators to the TaskFlow API

In earlier Airflow versions, you defined each task using explicit operators, for example, a PythonOperator or BashOperator , to tell Airflow how to execute the logic.

Airflow simplifies this with the TaskFlow API, eliminating boilerplate while maintaining backward compatibility.

# Old style (Airflow 1 & 2)
from airflow.operators.python import PythonOperator

task_transform = PythonOperator(
    task_id="transform_data",
    python_callable=transform_data
)

With the TaskFlow API, you no longer need to create operators manually. Each @task function automatically becomes an operator-backed task.

# Airflow 3
@task
def transform_data():
    ...

Under the hood, Airflow still uses operators as the execution engine, but you no longer need to create them manually. The result is cleaner, more Pythonic workflows.

iv. Dynamic Task Mapping: Scaling the Transformation

Modern data workflows often need to process multiple files, users, or datasets in parallel.

Dynamic task mapping allows Airflow to create task instances at runtime based on data inputs, perfect for scaling transformations.

@task
def get_files():
    return ["file1.csv", "file2.csv", "file3.csv"]

@task
def transform_file(file):
    print(f"Transforming {file}")

transform_file.expand(file=get_files())

Airflow will automatically create and run a separate transform_file task for each file, enabling efficient, parallel execution.

v. Scheduler and Triggerer

The scheduler decides when tasks run, either on a fixed schedule or in response to updates in data assets.

The triggerer, on the other hand, handles event-based execution behind the scenes, using asynchronous I/O to efficiently wait for external signals like file arrivals or API responses.

from airflow.assets import Asset 
events_asset = Asset("s3://data/events.csv")

@dag(
    dag_id="event_driven_pipeline",
    schedule=[events_asset],  # Triggered automatically when this asset is updated
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    ...

In this example, the scheduler monitors the asset and triggers the DAG when new data appears.

If the DAG included deferrable operators or sensors, the triggerer would take over waiting asynchronously, ensuring Airflow handles both time-based and event-driven workflows seamlessly.

vi. Executor and Workers

Once a task is ready to run, the executor assigns it to available workers, the machines or processes that actually execute your code.

For example, your ETL pipeline might look like this:

collect_data → transform_data → upload_results

Airflow decides where each of these tasks runs. It can execute everything on a single machine using the LocalExecutor, or scale horizontally across multiple nodes with the CeleryExecutor or KubernetesExecutor.

Deferrable tasks further improve efficiency by freeing up workers while waiting for long external operations like API responses or file uploads.

vii. Metadata Database and API Server: The Memory and Interface

Every action in Airflow, task success, failure, duration, or retry, is stored in the metadata database, Airflow’s internal memory.

This makes workflows reproducible, auditable, and observable.

The API server provides visibility and control:

  • View and trigger DAGs,
  • Inspect logs and task histories,
  • Track datasets and dependencies,
  • Monitor system health (scheduler, triggerer, database).

Together, they give you complete insight into orchestration, from individual task logs to system-wide performance.

Exploring the Airflow UI

Every orchestration platform needs a way to observe, manage, and interact with workflows, and in Apache Airflow, that interface is the Airflow Web UI.

The UI is served by the Airflow API Server, which exposes a rich dashboard for visualizing DAGs, checking system health, and monitoring workflow states. Even before running any tasks, it’s useful to understand the layout and purpose of this interface, since it’s where orchestration becomes visible.

Don’t worry if this section feels too conceptual; you’ll explore the Airflow UI in greater detail during the upcoming tutorial. You can also use our Setting up Apache Airflow with Docker Locally (Part I) guide if you’d like to try it right away.

The Role of the Airflow UI in Orchestration

In an orchestrated system, automation alone isn’t enough, engineers need visibility.

The UI bridges that gap. It provides an interactive window into your pipelines, showing:

  • Which workflows (DAGs) exist,
  • Their current state (active, running, or failed),
  • The status of Airflow’s internal components,
  • Historical task performance and logs.

This visibility is essential for diagnosing failures, verifying dependencies, and ensuring the orchestration system runs smoothly.

i. The Home Page Overview

The Airflow UI opens to a dashboard like the one shown below:

The Home Page Overview

At a glance, you can see:

  • Failed DAGs / Running DAGs / Active DAGs, A quick summary of the system’s operational state.
  • Health Indicators — Status checks for Airflow’s internal components:
    • MetaDatabase: Confirms the metadata database connection is healthy.
    • Scheduler: Verifies that the scheduler is running and monitoring DAGs.
    • Triggerer: Ensures event-driven workflows can be activated.
    • DAG Processor: Confirms DAG files are being parsed correctly.

These checks reflect the orchestration backbone at work, even if no DAGs have been created yet.

ii. DAG Management and Visualization

DAG Management and Visualization

In the left sidebar, the DAGs section lists all workflow definitions known to Airflow.

This doesn’t require you to run anything; it’s simply where Airflow displays every DAG it has parsed from the dags/ directory.

Each DAG entry includes:

  • The DAG name and description,
  • Schedule and next run time,
  • Last execution state
  • Controls to enable, pause, or trigger it manually.

When workflows are defined, you’ll be able to explore their structure visually through:

DAG Management and Visualization (2)

  • Graph View — showing task dependencies
  • Grid View — showing historical run outcomes

These views make orchestration transparent, every dependency, sequence, and outcome is visible at a glance.

iii. Assets and Browse

In the sidebar, the Assets and Browse sections provide tools for exploring the internal components of your orchestration environment.

  • Assets list all registered items, such as datasets, data tables, or connections that Airflow tracks or interacts with during workflow execution. It helps you see the resources your DAGs depend on. (Remember: in Airflow 3.x, “Datasets” were renamed to “Assets.”)

    Assets and Browse

  • Browse allows you to inspect historical data within Airflow, including past DAG runs, task instances, logs, and job details. This section is useful for auditing and debugging since it reveals how workflows behaved over time.

    Assets and Browse (2)

Together, these sections let you explore both data assets and orchestration history, offering transparency into what Airflow manages and how your workflows evolve.

iv. Admin

The Admin section provides the configuration tools that control Airflow’s orchestration environment.

Admin

Here, administrators can manage the system’s internal settings and integrations:

  • Variables – store global key–value pairs that DAGs can access at runtime,
  • Pools – limit the number of concurrent tasks to manage resources efficiently,
  • Providers – list the available integration packages (e.g., AWS, GCP, or Slack providers),
  • Plugins – extend Airflow’s capabilities with custom operators, sensors, or hooks,
  • Connections – define credentials for databases, APIs, and cloud services,
  • Config – view configuration values that determine how Airflow components run,

This section essentially controls how Airflow connects, scales, and extends itself, making it central to managing orchestration behavior in both local and production setups.

v. Security

The Security section governs authentication and authorization within Airflow’s web interface.

Security

It allows administrators to manage users, assign roles, and define permissions that determine who can access or modify specific parts of the system.

Within this menu:

  • Users – manage individual accounts for accessing the UI.
  • Roles – define what actions users can perform (e.g., view-only vs. admin).
  • Actions, Resources, Permissions – provide fine-grained control over what parts of Airflow a user can interact with.

Strong security settings ensure that orchestration remains safe, auditable, and compliant, particularly in shared or enterprise environments.

vii. Documentation

At the bottom of the sidebar, Airflow provides quick links under the Documentation section.

Documentation

This includes direct access to:

  • Official Documentation – the complete Airflow user and developer guide,
  • GitHub Repository – the open-source codebase for Airflow,
  • REST API Reference – detailed API endpoints for programmatic orchestration control,
  • Version Info – the currently running Airflow version,

These links make it easy for users to explore Airflow’s architecture, extend its features, or troubleshoot issues, right from within the interface.

Airflow vs Cron

Airflow vs Cron

Many data engineers start automation with cron, the classic Unix schedulersimple, reliable, and perfect for a single recurring script.

But as soon as workflows involve multiple dependent steps, data triggers, or retry, logic, cron’s simplicity turns into fragility.

Apache Airflow moves beyond time-based scheduling into workflow orchestration, managing dependencies, scaling dynamically, and responding to data-driven events, all through native Python.

i. From Scheduling to Dynamic Orchestration

Cron schedules jobs strictly by time:

# Run a data cleaning script every midnight
0 0 * * * /usr/local/bin/clean_data.sh

That works fine for one job, but it breaks down when you need to coordinate a chain like:

extract → transform → train → upload

Cron can’t ensure that step two waits for step one, or that retries occur automatically if a task fails.

In Airflow, you express this entire logic natively in Python using the TaskFlow API:

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def etl_pipeline():
    @task def extract(): ...
    @task def transform(data): ...
    @task def load(data): ...
    load(transform(extract()))

Here, tasks are functions, dependencies are inferred from function calls, and Airflow handles execution, retries, and state tracking automatically.

It’s the difference between telling the system when to run and teaching it how your workflow fits together.

ii. Visibility, Reliability, and Data Awareness

Where cron runs in the background, Airflow makes orchestration observable and intelligent.

Its Web UI and API provide transparency, showing task states, logs, dependencies, and retry attempts in real time.

Failures trigger automatic retries, and missed runs can be easily backfilled to maintain data continuity.

Airflow also introduces data-aware scheduling: workflows can now run automatically when a dataset or asset updates, not just on a clock.

from airflow.assets import Asset  
sales_data = Asset("s3://data/sales.csv")

@dag(schedule=[sales_data], start_date=datetime(2025,10,7))
def refresh_dashboard():
    ...

This makes orchestration responsive, pipelines react to new data as it arrives, keeping dashboards and downstream models always fresh.

iii. Why This Matters

Cron is a timer.

Airflow is an orchestrator, coordinating complex, event-driven, and scalable data systems.

It brings structure, visibility, and resilience to automation, ensuring that each task runs in the right order, with the right data, and for the right reason.

That’s the leap from scheduling to orchestration, and why Airflow is much more than cron with an interface.

Common Airflow Use Cases

Workflow orchestration underpins nearly every data-driven system, from nightly ETL jobs to continuous model retraining.

Because Airflow couples time-based scheduling with dataset awareness and dynamic task mapping, it adapts easily to many workloads.

Below are the most common production-grade scenarios ,all achievable through the TaskFlow API and Airflow’s modular architecture.

i. ETL / ELT Pipelines

ETL (Extract, Transform, Load) remains Airflow’s core use case.

Airflow lets you express a complete ETL pipeline declaratively, with each step defined as a Python @task.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def daily_sales_etl():

    @task
    def extract_sales():
        print("Pulling daily sales from API…")
        return ["sales_us.csv", "sales_uk.csv"]

    @task
    def transform_file(file):
        print(f"Cleaning and aggregating {file}")
        return f"clean_{file}"

    @task
    def load_to_warehouse(files):
        print(f"Loading {len(files)} cleaned files to BigQuery")

    # Dynamic Task Mapping: one transform per file
    cleaned = transform_file.expand(file=extract_sales())
    load_to_warehouse(cleaned)

daily_sales_etl()

Because each transformation task is created dynamically at runtime, the pipeline scales automatically as data sources grow.

When paired with datasets or assets, ETL DAGs can trigger immediately when new data arrives, ensuring freshness without manual scheduling.

ii. Machine Learning Pipelines

Airflow is ideal for orchestrating end-to-end ML lifecycles, data prep, training, evaluation, and deployment.

@dag(schedule="@weekly", start_date=datetime(2025,10,7))
def ml_training_pipeline():

    @task
    def prepare_data():
        return ["us_dataset.csv", "eu_dataset.csv"]

    @task
    def train_model(dataset):
        print(f"Training model on {dataset}")
        return f"model_{dataset}.pkl"

    @task
    def evaluate_models(models):
        print(f"Evaluating {len(models)} models and pushing metrics")

    # Fan-out training jobs
    models = train_model.expand(dataset=prepare_data())
    evaluate_models(models)

ml_training_pipeline()

Dynamic Task Mapping enables fan-out parallel training across datasets, regions, or hyper-parameters, a common pattern in large-scale ML systems.

Airflow’s deferrable sensors can pause training until external data or signals are ready, conserving compute resources.

iii. Analytics and Reporting

Analytics teams rely on Airflow to refresh dashboards and reports automatically.

Airflow can combine time-based and dataset-triggered scheduling so that dashboards always use the latest processed data.

from airflow import Dataset

summary_dataset = Dataset("s3://data/summary_table.csv")

@dag(schedule=[summary_dataset], start_date=datetime(2025,10,7))
def analytics_refresh():

    @task
    def update_powerbi():
        print("Refreshing Power BI dashboard…")

    @task
    def send_report():
        print("Emailing daily analytics summary")

    update_powerbi() >> send_report()

Whenever the summary dataset updates, this DAG runs immediately; no need to wait for a timed window.

That ensures dashboards remain accurate and auditable.

iv. Data Quality and Validation

Trusting your data is as important as moving it.

Airflow lets you automate quality checks and validations before promoting data downstream.

  • Run dbt tests or Great Expectations validations as tasks.
  • Use deferrable sensors to wait for external confirmations (e.g., API signals or file availability) without blocking workers.
  • Fail fast or trigger alerts when anomalies appear.
@task
def validate_row_counts():
    print("Comparing source and target row counts…")

@task
def check_schema():
    print("Ensuring schema consistency…")

validate_row_counts() >> check_schema()

These validations can be embedded directly into the main ETL DAG, creating self-monitoring pipelines that prevent bad data from spreading.

v. Infrastructure Automation and DevOps

Beyond data, Airflow orchestrates operational workflows such as backups, migrations, or cluster scaling.

With the Task SDK and provider integrations, you can automate infrastructure the same way you orchestrate data:

@dag(schedule="@daily", start_date=datetime(2025,10,7))
def infra_maintenance():

    @task
    def backup_database():
        print("Triggering RDS snapshot…")

    @task
    def cleanup_old_files():
        print("Deleting expired objects from S3…")

    backup_database() >> cleanup_old_files()

Airflow turns these system processes into auditable, repeatable, and observable jobs, blending DevOps automation with data-engineering orchestration.

With Airflow, orchestration goes beyond timing, it becomes data-aware, event-driven, and infinitely scalable, empowering teams to automate everything from raw data ingestion to production-ready analytics.

Summary and Up Next

In this tutorial, you explored the foundations of workflow orchestration and how Apache Airflow modernizes data automation through a modular, Pythonic, and data-aware architecture. You learned how Airflow structures workflows using DAGs and the TaskFlow API, scales effortlessly through Dynamic Task Mapping, and responds intelligently to data and events using deferrable tasks and the triggerer.

You also saw how its scheduler, executor, and web UI work together to ensure observability, resilience, and scalability far beyond what traditional schedulers like cron can offer.

In the next tutorial, you’ll bring these concepts to life by installing and running Airflow with Docker, setting up a complete environment where all core services, the apiserver, scheduler, metadata database, triggerer, and workers, operate in harmony.

From there, you’ll create and monitor your first DAG using the TaskFlow API, define dependencies and schedules, and securely manage connections and secrets.

Further Reading

Explore the official Airflow documentation to deepen your understanding of new features and APIs, and prepare your Docker environment for the next tutorial.

Then, apply what you’ve learned to start orchestrating real-world data workflows efficiently, reliably, and at scale.

❌