❌

Normal view

Dataquest vs DataCamp: Which Data Science Platform Is Right for You?

6 December 2025 at 03:36

You're investing time and money in learning data science, so choosing the right platform matters.

Both Dataquest and DataCamp teach you to code in your browser. Both have exercises and projects. But they differ fundamentally in how they prepare you for actual work.

This comparison will help you understand which approach fits your goals.

Dataquest
vs

DataCamp

Portfolio Projects: The Thing That Actually Gets You Hired

Hiring managers care about proof you can solve problems. Your portfolio provides that proof. Course completion certificates from either platform just show you finished the material.

When you apply for data jobs, hiring managers want to see what you can actually do. They want GitHub repositories with real projects. They want to see how you handle messy data, how you communicate insights, how you approach problems. A certificate from any platform matters less than three solid portfolio projects.

Most successful career changers have 3 to 5 portfolio projects showcasing different skills. Data cleaning and analysis. Visualization and storytelling. Maybe some machine learning or recommendation systems. Each project becomes a talking point in interviews.

How Dataquest Builds Your Portfolio

Dataquest includes over 30 guided projects using real, messy datasets. Every project simulates a realistic business scenario. You might analyze Kickstarter campaign data to identify what makes projects successful. Or explore Hacker News post patterns to understand user engagement. Or build a recommendation system analyzing thousands of user ratings.

Here's the critical advantage: all datasets are downloadable.

This means you can recreate these projects in your local environment. You can push them to GitHub with proper documentation. You can show employers exactly what you built, not just claim you learned something. When you're in an interview, and someone asks, "Tell me about a time you worked with messy data," you point to your GitHub and walk them through your actual code.

These aren't toy exercises. One Dataquest project has you working with a dataset of 50,000+ app reviews, cleaning inconsistent entries, handling missing values, and extracting insights. That's the kind of work you'll do on day one of a data job.

Your Dataquest projects become your job application materials while you're learning.

How DataCamp Approaches Projects

DataCamp offers 150+ hands-on projects available on their platform. You complete these projects within the DataCamp environment, working with data and building analyses.

The limitation: you cannot download the datasets.

This means your projects stay within DataCamp's ecosystem. You can describe what you learned and document your approach, but it's harder to show your actual work to potential employers. You can't easily transfer these to GitHub as standalone portfolio pieces.

DataCamp does offer DataLab, an AI-powered notebook environment where you can build analyses. Some users create impressive work in DataLab, and it connects to real databases like Snowflake and BigQuery. But the work remains platform-dependent.

Our verdict: For career changers who need a portfolio to get interviews, Dataquest has a clear advantage here. DataCamp projects work well as learning tools, but many DataCamp users report needing to build independent projects outside the platform to have something substantial to show employers. If portfolio building is your priority, and it should be, Dataquest gives you a significant head start.

How You Actually Learn

Both platforms have browser-based coding environments. Both provide guidance and support. The real difference is in what you're practicing and why.

Dataquest: Practicing Realistic Work Scenarios

When you open a Dataquest lesson, you see a split screen. The explanation and instructions are on the left. Your code editor is on the right.

Dataquest Live Coding Demo

You read a brief explanation with examples, then write code immediately. But what makes it different is that the exercises simulate realistic scenarios from actual data jobs.

You receive clear instructions on the goal and the general approach. Hints are available if you get stuck. The Chandra AI assistant provides context-aware help without giving away answers. There's a Community forum for additional support. You're never abandoned or thrown to the wolves.

You write the complete solution with full guidance throughout the process. The challenge comes from the problem being real, not from a lack of support.

This learning approach helps you build:

  1. Problem-solving approaches that transfer directly to jobs.
  2. Debugging skills, because your code won't always work on the first try, just like in real work.
  3. Confidence tackling unfamiliar problems.
  4. The ability to break down complex tasks into manageable steps.
  5. Experience working with messy, realistic data that doesn't behave perfectly.

This means you're solving the kinds of problems you'll face on day one of a data job. Every mistake you make while learning saves you from making it in an interview or during your first week at work.

DataCamp: Teaching Syntax Through Structured Exercises

DataCamp takes a different approach. You watch a short video, typically 3 to 4 minutes, where an expert instructor explains a concept with clear examples and visual demonstrations.

Then you complete an exercise that focuses on applying that specific syntax or function. Often, some code is already provided. You add or modify specific sections to complete the task. The instructions clearly outline exactly what to do at each step.

For example: "Use the mean() method on the df[sales] column to find its average."

You earn XP points for completing exercises. The gamification system rewards progress with streaks and achievements. The structure is optimized for quick wins and steady forward momentum.

This approach genuinely helps beginners overcome intimidation. Video instruction provides visual clarity that many people need. The scaffolding helps you stay on track and avoid getting lost. Quick wins build motivation and confidence.

The trade-off is that exercises can feel more like syntax memorization than problem-solving. There's less emphasis on understanding why you're taking a particular approach. Some users complete exercises without deeply understanding the underlying concepts.

Research across Reddit and review sites consistently surfaces this pattern. One user put it clearly:

The exercises are all fill-in-the-blank. This is not a good teaching method, at least for me. I felt the exercises focused too much on syntax and knowing what functions to fill in, and not enough on explaining why you want to use a function and what kind of trade-offs are there. The career track isn’t super cohesive. Going from one course to the next isn’t smooth and the knowledge you learn from one course doesn’t carry to the next.

DataCamp teaches you what functions do. Dataquest teaches you when and why to use them in realistic contexts. Both are valuable at different stages.

Our verdict: Choose Dataquest if you want realistic problem-solving practice that transfers directly to job work. Choose DataCamp if you prefer structured video instruction and need confidence-building scaffolding.

Content Focus: Career Preparation vs. Broad Exploration

The differences in the course catalog reflect each platform's philosophy.

Dataquest's Focused Career Paths

Dataquest offers 109 courses organized into 7 career paths and 18 skill paths. Every career path is designed around an actual job role:

  1. Data Analyst in Python
  2. Data Analyst in R
  3. Data Scientist in Python
  4. Data Engineer in Python
  5. Business Analyst with Tableau
  6. Business Analyst with Power BI
  7. Junior Data Analyst

The courses build on each other in a logical progression. There's no fluff or tangential topics. Everything connects directly to your end goal.

The career paths aren't just organized courses. They're blueprints for specific jobs. You learn exactly the skills that role requires, in the order that makes sense for building competence.

For professionals who want targeted upskilling, Dataquest skill paths let you focus on exactly what you need. Want to level up your SQL? There's a path for that. Need machine learning fundamentals? Focused path. Statistics and probability? Covered.

What's included: Python, R, SQL for data work. Libraries like pandas, NumPy for manipulation and analysis. Statistics, probability, and machine learning. Data visualization. Power BI and Tableau for business analytics. Command line, Git, APIs, web scraping. For data engineering: PostgreSQL, data pipelines, and ETL processes.

What's not here: dozens of programming languages, every new technology, broad surveys of tools you might never use. This is intentional. The focus is on core skills that transfer across tools and on depth over breadth.

If you know you want a data career, this focused approach eliminates decision paralysis. No wondering what to learn next. No wasting time on tangential topics. Just a clear path from where you are to being job-ready.

DataCamp's Technology Breadth

DataCamp offers over 610 courses spanning a huge range of technologies. Python, R, SQL, plus Java, Scala, Julia. Cloud platforms including AWS, Azure, Snowflake, and Databricks. Business intelligence tools like Power BI, Tableau, and Looker. DevOps tools including Docker, Kubernetes, Git, and Shell. Emerging technologies like ChatGPT, Generative AI, LangChain, and dbt.

The catalog includes 70+ skill tracks covering nearly everything you might encounter in data and adjacent fields.

This breadth is genuinely impressive and serves specific needs well. If you're a professional exploring new tools for work, you can sample many technologies before committing. Corporate training benefits from having so many options in one place. If you want to stay current with emerging trends, DataCamp adds new courses regularly.

The trade-off is that breadth can mean less depth in core fundamentals. More choices create more decision paralysis about what to learn. With 610 courses, some are inevitably stronger than others. You might learn surface-level understanding across many tools rather than deep competence in the essential ones.

Our verdict: If you know you want a data career and need a clear path from start to job-ready, Dataquest's focused curriculum serves you better. If you're exploring whether data science fits you, or you need exposure to many technologies for your current role, DataCamp's breadth makes more sense.

Pricing as an Investment in Your Career

Let's talk about cost, because this matters when you're making a career change or investing in professional development.

Understanding the Real Investment

These aren't just subscriptions you're comparing. They're investments in a career change or significant professional growth. The real question isn't "which costs less per month?" It's "which gets me job-ready fastest and provides a better return on my investment?"

For career changers, the opportunity cost matters more than the subscription price. If one platform gets you hired three months faster, that's three months of higher salary. That value dwarfs a \$200 per year price difference.

Dataquest: Higher Investment, Faster Outcomes

Dataquest costs \$49 per month or \$399 per year, but often go on sale for up to 50% off. There's also a lifetime option available, typically \$500 to \$700 when on sale. You get a 14-day money-back guarantee, plus a satisfaction guarantee: complete a career path and receive a refund if you're not satisfied with the outcomes.

The free tier includes the first 2 to 3 courses in each career path, so you can genuinely try before committing.

Yes, Dataquest costs more upfront. But consider what you're getting: every dollar includes portfolio-building projects with downloadable datasets. The focused curriculum means less wasted time on topics that won't help you get hired. The realistic exercises build job-ready skills faster.

Career changers using Dataquest report a median salary increase of \$30,000 after completing their programs. Alumni work at Facebook, Uber, Amazon, Deloitte, and Spotify.

Do the math on opportunity cost. If Dataquest's approach gets you hired even three months faster, the value is easily \$15,000 to \$20,000 in additional salary during those months. One successful career change pays for years of subscription.

DataCamp: Lower Cost, Broader Access

DataCamp costs \$28 per month when billed annually, totaling \$336 per year. Students with a .edu email address get 50% off, bringing annual cost down to around \$149. The free tier gives you the first chapter of every course. You also get a 14-day money-back guarantee.

The lower price is genuinely more accessible for budget-conscious learners. The student pricing is excellent for people still in school. There's a lower barrier to entry if you're not sure about your commitment yet.

DataCamp's lower price may mean a longer learning journey. You'll likely need additional time to build an independent portfolio since the projects don't transfer as easily. But if you're exploring rather than committing, or if budget is a serious constraint, the lower cost makes sense.

The best way to think about it is to calculate your target monthly salary in a data role. Multiply that by the number of months you might save by getting hired with better portfolio projects and realistic practice. Compare that number to the difference in subscription prices.

Dataquest DataCamp
Monthly \$49 \$28 (annual billing)
Annual \$399 \$336
Portfolio projects Included, downloadable Limited transferability
Time to job-ready Potentially faster Requires supplementation

Our verdict: For serious career changers, Dataquest's portfolio projects and focused curriculum justify the higher cost. For budget-conscious explorers or students, DataCamp's lower price and student discounts provide better accessibility.

Learning Format: Video vs. Text and Where You Study

This consideration comes down to personal preference and lifestyle.

Video Instruction vs. Reading and Doing

DataCamp's video approach genuinely works for many people. Watching a 3 to 4 minute video with expert instructors provides visual demonstrations of concepts. Seeing someone code along helps visual learners understand. You can pause, rewind, and rewatch as needed. Many people retain visual information better than text.

Instructor personality makes learning engaging. For some learners, a video feels less intimidating than dense text explanations and diagrams.

Dataquest uses brief text explanations with examples, then asks you to immediately apply what you read in the code editor. Some learners prefer reading at their own pace. You can skim familiar concepts or deep-read complex ones. It's faster for people who read quickly and don't need video explanations. There’s also a new read-aloud feature on each screen so you can listen instead of reading.

The text format forces active reading/listening and immediate application. Some people find less distraction without video playing.

There's no objectively better format. If you learn better from videos, DataCamp fits your brain. If you prefer reading and immediately doing, Dataquest fits you. Try both free tiers to see what clicks.

Mobile Access vs. Desktop Focus

DataCamp offers full iOS and Android apps. You can access complete courses on your phone, write code during your commute or lunch break, and sync progress across devices. The mobile experience includes an extended keyboard for coding characters.

The gamification system (XP points, streaks, achievements) works particularly well on mobile. DataCamp designed their mobile app specifically for quick learning sessions during commutes, coffee breaks, or any spare moments away from your desk. The bite-sized lessons make it easy to maintain momentum throughout your day.

For busy professionals, this convenience matters. Making use of small pockets of time throughout your day lowers friction for consistent practice.

Dataquest is desktop-only. No mobile app. No offline access.

That said, the desktop focus is intentional, not an oversight. Realistic coding requires a proper workspace. Building portfolio-quality projects needs concentration and screen space. You're practicing the way you'll actually work in a data job.

Professional development deserves a professional setup. A proper keyboard, adequate screen space, the ability to have documentation open alongside your code. Real coding in data jobs happens at desks with multiple monitors, not on phones during commutes.

Our verdict: Video learners who need mobile flexibility should choose DataCamp. Readers who prefer focused desktop sessions should choose Dataquest. Try both free tiers to see which format clicks with you.

AI Assistance: Learning Support vs. Productivity Tool

Both platforms offer AI assistance, but designed for different purposes.

Chandra: Your Learning-Focused Assistant

Dataquest's Chandra AI assistant runs on Code Llama with 13 billion parameters, fine-tuned specifically for teaching. It's context-aware, meaning it knows exactly where you are in the curriculum and what you should already understand.

Click "Explain" on any piece of code for a detailed breakdown. Chat naturally about problems you're solving. Ask for guidance when stuck.

Here's what makes Chandra different: it's intentionally calibrated to guide without giving away answers. Think of it as having a patient teaching assistant available 24/7 who helps you think through problems rather than solving them for you.

Chandra understands the pedagogical context. Its responses connect to what you should know at your current stage. It encourages a problem-solving approach rather than just providing solutions. You never feel stuck or alone, but you're still doing the learning work.

Like all AI, Chandra can occasionally hallucinate and has a training cutoff date. It's best used for guidance and explaining concepts, not as a definitive source of answers.

Dataquest's AI Assistant Chandra

DataLab: The Professional Productivity Tool

DataCamp's DataLab is an OpenAI-powered assistant within a full notebook environment. It writes, updates, fixes, and explains code based on natural language prompts. It connects to real databases including Snowflake and BigQuery. It's a complete data science environment with collaboration features.

Datalab AI Assistant

DataLab is more powerful in raw capability. It can do actual work for you, not just teach you. The database connections are valuable for building real analyses.

The trade-off: when AI is this powerful, it can do your thinking for you. There's a risk of not learning underlying concepts because the tool handles complexity. DataLab is better for productivity than learning.

The free tier is limited to 3 workbooks and 15 to 20 AI prompts. Premium unlimited access costs extra.

Our verdict: For learning fundamentals, Chandra's teaching-focused approach builds stronger understanding without doing the work for you. For experienced users needing productivity tools, DataLab offers more powerful capabilities.

What Serious Learners Say About Each Platform

Let's look at what real users report, organized by their goals.

For Career Changers

Career changers using Dataquest consistently report better skill retention. The realistic exercises build confidence for job interviews. Portfolio projects directly lead to interview conversations.

One user explained it clearly:

I like Dataquest.io better. I love the format of text-only lessons. The screen is split with the lesson on the left with an code interpreter on the right. They make you repeat what you learned in each lesson over and over again so that you remember what you did.

Dataquest success stories include career changers moving into data analyst and data scientist roles at companies like Facebook, Uber, Amazon, and Deloitte. The common thread: they built portfolios using Dataquest's downloadable projects, then supplemented them with additional independent work.

The reality check both communities agree on: you need independent projects to demonstrate your skills. But Dataquest's downloadable projects give you a significant head start on building your portfolio. DataCamp users consistently report needing to build separate portfolio projects after completing courses.

For Professionals Upskilling

Both platforms serve upskilling professionals, just differently. DataCamp's breadth suits exploratory learning when you need exposure to many tools. Dataquest's skill paths allow targeted improvement in specific areas.

DataCamp's mobile access provides clear advantages for busy schedules. Being able to practice during commutes or lunch breaks fits professional life better for some people.

For Beginners Exploring

DataCamp's structure helps beginners overcome initial intimidation. Videos make abstract concepts more approachable. The scaffolding in exercises reduces anxiety about getting stuck. Gamification maintains motivation during the difficult early stages.

Many beginners appreciate DataCamp as an answer to "Is data science for me?" The lower price and gentler learning curve make it easier to explore without major commitment.

What the Ratings Tell Us

On Course Report, an education-focused review platform where people seriously research learning platforms:

Dataquest: 4.79 out of 5 (65 reviews)

DataCamp: 4.38 out of 5 (146 reviews)

Course Report attracts learners evaluating platforms for career outcomes, not casual users. These are people investing in education and carefully considering effectiveness.

Dataquest reviewers emphasize career transitions, skill retention, and portfolio quality. DataCamp reviewers praise its accessibility and breadth of content.

Consider which priorities match your goals. If you're serious about career outcomes, the audience rating Dataquest higher is probably similar to you.

Making Your Decision: A Framework

Here's how to think about choosing between these platforms.

Choose Dataquest if you:

  • Are serious about career change to data analyst, data scientist, or data engineer
  • Need portfolio projects for job applications and interviews
  • Want realistic problem-solving practice that simulates actual work
  • Have dedicated time for focused desktop learning sessions
  • Value depth and job-readiness over broad tool exposure
  • Are upskilling for specific career advancement
  • Want guided learning through realistic scenarios with full support
  • Can invest more upfront for potentially faster career outcomes
  • Prefer reading and immediately applying over watching videos

Choose DataCamp if you:

  • Are exploring whether data science interests you before committing
  • Want exposure to many technologies before specializing
  • Learn significantly better from video instruction
  • Need mobile learning flexibility for your lifestyle
  • Have a limited budget for initial exploration
  • Like gamification, quick wins, and progress rewards
  • Work in an organization already using it for training
  • Want to learn a specific tool quickly for immediate work needs
  • Are supplementing with other learning resources and just need introductions

The Combined Approach

Some learners use both platforms strategically. Start with DataCamp for initial exploration and confidence building. Switch to Dataquest when you're ready for serious career preparation. Use DataCamp for breadth in specialty areas like specific cloud platforms or tools. Use Dataquest for depth in core data skills and portfolio building.

The Reality Check

Success requires independent projects and consistent practice beyond any course. Dataquest's portfolio projects give you a significant head start on what employers want to see. DataCamp requires more supplementation with external portfolio work.

Your persistence matters more than your platform choice. But the right platform for your goals makes persistence easier. Choose the one that matches where you're trying to go.

Your Next Step

We've covered the meaningful differences. Portfolio building and realistic practice versus broad exploration and mobile convenience. Career-focused depth versus technology breadth. Desktop focus versus mobile flexibility.

The real question isn't "which is better?" It's "which matches my goal?"

If you're planning a career change into data science, Dataquest's focus on realistic problems and portfolio building aligns with what you need. If you're exploring whether data science interests you or need broad exposure for your current role, DataCamp's accessibility and breadth make sense.

Both platforms offer free tiers. Try actual lessons on each before deciding with your wallet. Pay attention to which approach keeps you genuinely engaged, not just which feels easier. Ask yourself honestly: "Am I learning or just completing exercises?"

Notice which platform makes you want to come back tomorrow.

Getting started matters more than perfect platform choice. Consistency beats perfection every time. The best platform is the one you'll actually use every week, the one that makes you want to keep learning.

If you're reading detailed comparison articles, you're already serious about this. That determination is your biggest asset. It matters more than features, pricing, or course catalogs.

Pick the platform that matches your goal. Commit to the work. Show up consistently.

Your future data career is waiting on the other side of that consistent practice.

Metadata Filtering and Hybrid Search for Vector Databases

6 December 2025 at 02:43

In the first tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. We discovered that vector search excels at understanding meaning: a query about "neural network training" successfully retrieved papers about optimization algorithms, even when they didn't use those exact words.

But here's what we couldn't do yet: What if we only want papers from the last two years? What if we need to search specifically within the Machine Learning category? What if someone searches for a rare technical term that vector search might miss?

This tutorial teaches you how to enhance vector search with two powerful capabilities: metadata filtering and hybrid search. By the end, you'll understand how to combine semantic similarity with traditional filters, when keyword search adds value, and how to make intelligent trade-offs between different search strategies.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Design metadata schemas that enable powerful filtering without performance pitfalls
  • Implement filtered vector searches in ChromaDB using metadata constraints
  • Measure and understand the performance overhead of different filter types
  • Build BM25 keyword search alongside your vector search
  • Combine vector similarity and keyword matching using weighted hybrid scoring
  • Evaluate different search strategies systematically using category precision
  • Make informed decisions about when metadata filtering and hybrid search add value

Most importantly, you'll learn to be honest about what works and what doesn't. Our experiments revealed some surprising results that challenge common assumptions about hybrid search.

Dataset and Environment Setup

We'll use the same 5,000 arXiv papers we used previously. If you completed the first tutorial, you already have these files. If you're starting fresh, download them now:

arxiv_papers_5k.csv download (7.7 MB) β†’ Paper metadata
embeddings_cohere_5k.npy download (61.4 MB) β†’ Pre-generated embeddings

The dataset contains 5,000 papers perfectly balanced across five categories:

  • cs.CL (Computational Linguistics): 1,000 papers
  • cs.CV (Computer Vision): 1,000 papers
  • cs.DB (Databases): 1,000 papers
  • cs.LG (Machine Learning): 1,000 papers
  • cs.SE (Software Engineering): 1,000 papers

Environment Setup

You'll need the same packages from previous tutorials, plus one new library for BM25:

# Create virtual environment (if starting fresh)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# rank-bm25==0.2.2  # NEW for keyword search

pip install chromadb numpy pandas cohere python-dotenv rank-bm25

Make sure your .env file contains your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Dataset

Let's load our data and verify everything is in place:

import numpy as np
import pandas as pd
import chromadb
from cohere import ClientV2
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
if not cohere_api_key:
    raise ValueError("COHERE_API_KEY not found in .env file")

co = ClientV2(api_key=cohere_api_key)

# Load the dataset
df = pd.read_csv('arxiv_papers_5k.csv')
embeddings = np.load('embeddings_cohere_5k.npy')

print(f"Loaded {len(df)} papers")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Check what metadata we have
print(f"\nAvailable metadata columns:")
print(df.columns.tolist())
Loaded 5000 papers
Embeddings shape: (5000, 1536)

Papers per category:
category
cs.CL    1000
cs.CV    1000
cs.DB    1000
cs.LG    1000
cs.SE    1000
Name: count, dtype: int64

Available metadata columns:
['arxiv_id', 'title', 'abstract', 'authors', 'published', 'category']

We have rich metadata to work with: paper IDs, titles, abstracts, authors, publication dates, and categories. This metadata will power our filtering and help evaluate our search strategies.

Designing Metadata Schemas

Before we start filtering, we need to think carefully about what metadata to store and how to structure it. Good metadata design makes search powerful and performant. Poor design creates headaches.

What Makes Good Metadata

Good metadata is:

  • Filterable: Choose values that match how users actually search. If users filter by publication year, store year as an integer. If they filter by topic, store normalized category strings.
  • Atomic: Store individual fields separately rather than dumping everything into a single JSON blob. Want to filter by year? Don't make ChromaDB parse "Published: 2024-03-15" from a text field.
  • Indexed: Most vector databases index metadata fields differently than vector embeddings. Keep metadata fields small and specific so indexing works efficiently.
  • Consistent: Use the same data types and formats across all documents. Don't store year as "2024" for one paper and "March 2024" for another.

What Doesn't Belong in Metadata

Avoid storing:

  • Long text in metadata fields: The paper abstract is content, not metadata. Store it as the document text, not in a metadata field.
  • Nested structures: ChromaDB supports nested metadata, but complex JSON trees are hard to filter and often signal confused schema design.
  • Redundant information: If you can derive a field from another (like "decade" from "year"), consider computing it at query time instead of storing it.
  • Frequently changing values: Metadata updates can be expensive. Don't store view counts or frequently updated statistics in metadata.

Preparing Our Metadata

Let's prepare metadata for our 5,000 papers:

def prepare_metadata(df):
    """
    Prepare metadata for ChromaDB from our dataframe.

    Returns list of metadata dictionaries, one per paper.
    """
    metadatas = []

    for _, row in df.iterrows():
        # Extract year from published date (format: YYYY-MM-DD)
        year = int(str(row['published'])[:4])

        # Truncate authors if too long (ChromaDB has reasonable limits)
        authors = row['authors'][:200] if len(row['authors']) <= 200 else row['authors'][:197] + "..."

        metadata = {
            'title': row['title'],
            'category': row['category'],
            'year': year,  # Store as integer for range queries
            'authors': authors
        }
        metadatas.append(metadata)

    return metadatas

# Prepare metadata for all papers
metadatas = prepare_metadata(df)

# Check a sample
print("Sample metadata:")
print(metadatas[0])
Sample metadata:
{'title': 'Optimizing Mixture of Block Attention', 'category': 'cs.LG', 'year': 2025, 'authors': 'Tao He, Liang Ding, Zhenya Huang, Dacheng Tao'}

Notice we're storing:

  • title: The full paper title for display in results
  • category: One of our five CS categories for topic filtering
  • year: Extracted as an integer for range queries like "papers after 2024"
  • authors: Truncated to avoid extremely long strings

This metadata schema supports the filtering patterns users actually want: search within a category, filter by publication date, or display author information in results.

Anti-Patterns to Avoid

Let's look at what NOT to do:

Bad: JSON blob as metadata

# DON'T DO THIS
metadata = {
    'info': json.dumps({
        'title': title,
        'category': category,
        'year': year,
        # ... everything dumped in JSON
    })
}

This makes filtering painful. You can't efficiently filter by year when it's buried in a JSON string.

Bad: Long text as metadata

# DON'T DO THIS
metadata = {
    'abstract': full_abstract_text,  # This belongs in documents, not metadata
    'category': category
}

ChromaDB stores abstracts as document content. Duplicating them in metadata wastes space and doesn't improve search.

Bad: Inconsistent types

# DON'T DO THIS
metadata1 = {'year': 2024}          # Integer
metadata2 = {'year': '2024'}        # String
metadata3 = {'year': 'March 2024'}  # Unparseable

Consistent data types make filtering reliable. Always store years as integers if you want range queries.

Bad: Missing or inconsistent metadata fields

# DON'T DO THIS
paper1_metadata = {'title': 'Paper 1', 'category': 'cs.LG', 'year': 2024}
paper2_metadata = {'title': 'Paper 2', 'category': 'cs.CV'}  # Missing year!
paper3_metadata = {'title': 'Paper 3', 'year': 2023}  # Missing category!

Here's a common source of frustration: if a document is missing a metadata field, ChromaDB's filters won't match it at all. If you filter by {"year": {"$gte": 2024}} and some papers lack a year field, those papers simply won't appear in results. This causes the confusing "where did my document go?" problem.

The fix: Make sure all documents have the same metadata fields. If a value is unknown, store it as None or use a sensible default rather than omitting the field entirely. Consistency prevents documents from mysteriously disappearing when you add filters.

Creating a Collection with Rich Metadata

Now let's create a ChromaDB collection with all our metadata. If you will be experimenting, you'll need to use the delete-and-recreate pattern we used previously:

# Initialize ChromaDB client
client = chromadb.Client()

# Delete existing collection if present (useful for experimentation)
try:
    client.delete_collection(name="arxiv_with_metadata")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create collection with metadata
collection = client.create_collection(
    name="arxiv_with_metadata",
    metadata={
        "description": "5000 arXiv papers with rich metadata for filtering",
        "hnsw:space": "cosine"  # Using cosine similarity
    }
)

print(f"Created collection: {collection.name}")
Created collection: arxiv_with_metadata

Now let's insert our papers with metadata. Remember that ChromaDB has a batch size limit:

# Prepare data for insertion
ids = [f"paper_{i}" for i in range(len(df))]
documents = df['abstract'].tolist()

# Insert with metadata
# Our 5000 papers fit in one batch (limit is ~5,461)
print(f"Inserting {len(df)} papers with metadata...")

collection.add(
    ids=ids,
    embeddings=embeddings.tolist(),
    documents=documents,
    metadatas=metadatas
)

print(f"βœ“ Collection contains {collection.count()} papers with metadata")
Inserting 5000 papers with metadata...
βœ“ Collection contains 5000 papers with metadata

We now have a collection where every paper has both its embedding and rich metadata. This enables powerful combinations of semantic search and traditional filtering.

Metadata Filtering in Practice

Let's start filtering our searches using metadata. ChromaDB uses a where clause syntax similar to database queries.

Basic Filtering by Category

Suppose we want to search only within Machine Learning papers:

# First, let's create a helper function for queries
def search_with_filter(query_text, where_clause=None, n_results=5):
    """
    Search with optional metadata filtering.

    Args:
        query_text: The search query
        where_clause: Optional ChromaDB where clause for filtering
        n_results: Number of results to return

    Returns:
        Search results
    """
    # Embed the query
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Search with optional filter
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        where=where_clause  # Apply metadata filter here
    )

    return results

# Example: Search for "deep learning optimization" only in ML papers
query = "deep learning optimization techniques"

results_filtered = search_with_filter(
    query,
    where_clause={"category": "cs.LG"}  # Only Machine Learning papers
)

print(f"Query: '{query}'")
print("Filter: category = 'cs.LG'")
print("\nTop 5 results:")
for i in range(len(results_filtered['ids'][0])):
    metadata = results_filtered['metadatas'][0][i]
    distance = results_filtered['distances'][0][i]

    print(f"\n{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']} | Year: {metadata['year']}")
    print(f"   Distance: {distance:.4f}")
Query: 'deep learning optimization techniques'
Filter: category = 'cs.LG'

Top 5 results:

1. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
   Category: cs.LG | Year: 2025
   Distance: 0.6449

2. Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
   Category: cs.LG | Year: 2025
   Distance: 0.6571

3. Training Neural Networks at Any Scale
   Category: cs.LG | Year: 2025
   Distance: 0.6674

4. Deep Progressive Training: scaling up depth capacity of zero/one-layer models
   Category: cs.LG | Year: 2025
   Distance: 0.6682

5. DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
   Category: cs.LG | Year: 2025
   Distance: 0.6732

All five results are from cs.LG, exactly as we requested. The filtering worked correctly. The distances are also tightly clustered between 0.64 and 0.67.

This close grouping tells us we found papers that all match our query equally well. The lower distances (compared to the 1.1+ ranges we saw previously) show that filtering down to a specific category helped us find stronger semantic matches.

Filtering by Year Range

What if we want papers from a specific time period?

# Search for papers from 2024 or later
results_recent = search_with_filter(
    "neural network architectures",
    where_clause={"year": {"$gte": 2024}}  # Greater than or equal to 2024
)

print("Recent papers (2024+) about neural network architectures:")
for i in range(3):  # Show top 3
    metadata = results_recent['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']} ({metadata['year']})")
Recent papers (2024+) about neural network architectures:
1. Bearing Syntactic Fruit with Stack-Augmented Neural Networks (2025)
2. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)
3. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)

Notice that results #2 and #3 are the same paper. This happens because some arXiv papers get cross-posted to multiple categories. A paper about neural architectures for language models might appear in both cs.LG and cs.CL, so when we filter only by year, it shows up once for each category assignment.

You could deduplicate results by tracking paper IDs and skipping ones you've already seen, but whether you should depends on your use case. Sometimes knowing a paper appears in multiple categories is actually valuable information. For this tutorial, we're keeping duplicates as-is because they reflect how real databases behave and help us understand what filtering does and doesn't handle. If you were building a paper recommendation system, you'd probably deduplicate. If you were analyzing category overlap patterns, you'd want to keep them.

Comparison Operators

ChromaDB supports several comparison operators for numeric fields:

  • $eq: Equal to
  • $ne: Not equal to
  • $gt: Greater than
  • $gte: Greater than or equal to
  • $lt: Less than
  • $lte: Less than or equal to

Combined Filters

The real power comes from combining multiple filters:

# Find Computer Vision papers from 2025
results_combined = search_with_filter(
    "image recognition and classification",
    where_clause={
        "$and": [
            {"category": "cs.CV"},
            {"year": {"$eq": 2025}}
        ]
    }
)

print("Computer Vision papers from 2025 about image recognition:")
for i in range(3):
    metadata = results_combined['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']}")
    print(f"   {metadata['category']} | {metadata['year']}")
Computer Vision papers from 2025 about image recognition:
1. SWAN -- Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces
   cs.CV | 2025
2. Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
   cs.CV | 2025
3. UniADC: A Unified Framework for Anomaly Detection and Classification
   cs.CV | 2025

ChromaDB also supports $or for alternatives:

# Papers from either Database or Software Engineering categories
where_db_or_se = {
    "$or": [
        {"category": "cs.DB"},
        {"category": "cs.SE"}
    ]
}

These filtering capabilities let you narrow searches to exactly the subset you need.

Measuring Filtering Performance Overhead

Metadata filtering isn't free. Let's measure the actual performance impact of different filter types. We'll run multiple queries to get stable measurements:

import time

def benchmark_filter(where_clause, n_iterations=100, description=""):
    """
    Benchmark query performance with a specific filter.

    Args:
        where_clause: The filter to apply (None for unfiltered)
        n_iterations: Number of times to run the query
        description: Description of what we're testing

    Returns:
        Average query time in milliseconds
    """
    # Use a fixed query embedding to keep comparisons fair
    query_text = "machine learning model training"
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Warm up (run once to load any caches)
    collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=5,
        where=where_clause
    )

    # Benchmark
    start_time = time.time()
    for _ in range(n_iterations):
        collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5,
            where=where_clause
        )
    elapsed = time.time() - start_time
    avg_ms = (elapsed / n_iterations) * 1000

    print(f"{description}")
    print(f"  Average query time: {avg_ms:.2f} ms")
    return avg_ms

print("Running filtering performance benchmarks (100 iterations each)...")
print("=" * 70)

# Baseline: No filtering
baseline_ms = benchmark_filter(None, description="Baseline (no filter)")

print()

# Category filter
category_ms = benchmark_filter(
    {"category": "cs.LG"},
    description="Category filter (category = 'cs.LG')"
)
category_overhead = (category_ms / baseline_ms)
print(f"  Overhead: {category_overhead:.1f}x slower ({(category_overhead-1)*100:.0f}%)")

print()

# Year range filter
year_ms = benchmark_filter(
    {"year": {"$gte": 2024}},
    description="Year range filter (year >= 2024)"
)
year_overhead = (year_ms / baseline_ms)
print(f"  Overhead: {year_overhead:.1f}x slower ({(year_overhead-1)*100:.0f}%)")

print()

# Combined filter
combined_ms = benchmark_filter(
    {"$and": [{"category": "cs.LG"}, {"year": {"$gte": 2024}}]},
    description="Combined filter (category AND year)"
)
combined_overhead = (combined_ms / baseline_ms)
print(f"  Overhead: {combined_overhead:.1f}x slower ({(combined_overhead-1)*100:.0f}%)")

print("\n" + "=" * 70)
print("Summary: Filtering adds 3-10x overhead depending on filter type")
Running filtering performance benchmarks (100 iterations each)...
======================================================================
Baseline (no filter)
  Average query time: 4.45 ms

Category filter (category = 'cs.LG')
  Average query time: 14.82 ms
  Overhead: 3.3x slower (233%)

Year range filter (year >= 2024)
  Average query time: 35.67 ms
  Overhead: 8.0x slower (702%)

Combined filter (category AND year)
  Average query time: 22.34 ms
  Overhead: 5.0x slower (402%)

======================================================================
Summary: Filtering adds 3-10x overhead depending on filter type

What these numbers tell us:

  • Unfiltered queries are fast: Our baseline of 4.45ms means ChromaDB's HNSW index works well.
  • Category filtering costs 3.3x overhead: The query still completes in 14.82ms, which is totally usable, but it's noticeably slower than unfiltered search.
  • Numeric range queries are most expensive: Year filtering at 8x overhead (35.67ms) shows that range queries on numeric fields are particularly costly in ChromaDB.
  • Combined filters fall in between: At 5x overhead (22.34ms), combining filters doesn't just multiply the costs. There's some optimization happening.
  • Real-world variability: If you run these benchmarks yourself, you'll see the exact numbers vary between runs. We saw category filtering range from 13.8-16.1ms across multiple benchmark sessions. This variability is normal. What stays consistent is the order: year filters are always most expensive, then combined filters, then category filters.

Understanding the Performance Trade-off

This overhead is significant. A multi-fold slowdown matters when you're processing hundreds of queries per second. But context is important:

When filtering makes sense despite overhead:

  • Users explicitly request filters ("Show me recent papers")
  • The filtered results are substantially better than unfiltered
  • Your query volume is manageable (even 35ms per query handles 28 queries/second)
  • User experience benefits outweigh the performance cost

When to reconsider filtering:

  • Very high query volume with tight latency requirements
  • Filters don't meaningfully improve results for most queries
  • You need sub-10ms response times at scale

Important context: This overhead is how ChromaDB implements filtering at this scale. When we explore production vector databases in the next tutorial, you'll see how systems like Qdrant handle filtering more efficiently. This isn't a fundamental limitation of vector databases, it's a characteristic of how different systems approach the problem.

For now, understand that metadata filtering in ChromaDB works and is usable, but it comes with measurable performance costs. Design your metadata schema carefully and filter only when the value justifies the overhead.

Implementing BM25 Keyword Search

Vector search excels at understanding semantic meaning, but it can struggle with rare keywords, specific technical terms, or exact name matches. BM25 keyword search complements vector search by ranking documents based on term frequency and document length.

Understanding BM25

BM25 (Best Matching 25) is a ranking function that scores documents based on:

  • How many times query terms appear in the document (term frequency)
  • How rare those terms are across all documents (inverse document frequency)
  • Document length normalization (shorter documents aren't penalized)

BM25 treats words as independent tokens. If you search for "SQL query optimization," BM25 looks for documents containing those exact words, giving higher scores to documents where these terms appear frequently.

Building a BM25 Index

Let's implement BM25 search on our arXiv abstracts:

from rank_bm25 import BM25Okapi
import string

def simple_tokenize(text):
    """
    Basic tokenization for BM25.

    Lowercase text, remove punctuation, split on whitespace.
    """
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

# Tokenize all abstracts
print("Building BM25 index from 5000 abstracts...")
tokenized_corpus = [simple_tokenize(abstract) for abstract in df['abstract']]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print("βœ“ BM25 index created")

# Test it with a sample query
query = "SQL query optimization indexing"
tokenized_query = simple_tokenize(query)

# Get BM25 scores for all documents
bm25_scores = bm25.get_scores(tokenized_query)

# Find top 5 papers by BM25 score
top_indices = np.argsort(bm25_scores)[::-1][:5]

print(f"\nQuery: '{query}'")
print("Top 5 by BM25 keyword matching:")
for rank, idx in enumerate(top_indices, 1):
    score = bm25_scores[idx]
    title = df.iloc[idx]['title']
    category = df.iloc[idx]['category']
    print(f"{rank}. [{category}] {title[:60]}...")
    print(f"   BM25 Score: {score:.2f}")
Building BM25 index from 5000 abstracts...
βœ“ BM25 index created

Query: 'SQL query optimization indexing'
Top 5 by BM25 keyword matching:
1. [cs.DB] Learned Adaptive Indexing...
   BM25 Score: 13.34
2. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   BM25 Score: 13.25
3. [cs.LG] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
4. [cs.DB] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
5. [cs.DB] A Functional Data Model and Query Language is All You Need...
   BM25 Score: 11.91

BM25 correctly identified Database papers about query optimization, with 4 out of 5 results from cs.DB. The third result is from Machine Learning but still relevant to SQL processing (Cortex AISQL), showing how keyword matching can surface related papers from adjacent domains. When the query contains specific technical terms, keyword matching works well.

A note about scale: The rank-bm25 library works great for our 5,000 abstracts and similar small datasets. It's perfect for learning BM25 concepts without complexity. For larger datasets or production systems, you'd typically use faster BM25 implementations found in search engines like Elasticsearch, OpenSearch, or Apache Lucene. These are optimized for millions of documents and high query volumes. For now, rank-bm25 gives us everything we need to understand how keyword search complements vector search.

Comparing BM25 to Vector Search

Let's run the same query through vector search:

# Vector search for the same query
results_vector = search_with_filter(query, n_results=5)

print(f"\nSame query: '{query}'")
print("Top 5 by vector similarity:")
for i in range(5):
    metadata = results_vector['metadatas'][0][i]
    distance = results_vector['distances'][0][i]
    print(f"{i+1}. [{metadata['category']}] {metadata['title'][:60]}...")
    print(f"   Distance: {distance:.4f}")
Same query: 'SQL query optimization indexing'
Top 5 by vector similarity:
1. [cs.DB] VIDEX: A Disaggregated and Extensible Virtual Index for the ...
   Distance: 0.5510
2. [cs.DB] AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational ...
   Distance: 0.5586
3. [cs.DB] AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor...
   Distance: 0.5602
4. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   Distance: 0.5837
5. [cs.DB] Training-Free Query Optimization via LLM-Based Plan Similari...
   Distance: 0.5856

Interesting! While only one paper (LLM4Hint) appears in both top 5 lists, both approaches successfully identify relevant Database papers. The keywords "SQL" and "query" and "optimization" appear frequently in database papers, and the semantic meaning also points to that domain. The different rankings show how keyword matching and semantic search can prioritize different aspects of relevance, even when both correctly identify the target category.

This convergence of categories (both returning cs.DB papers) is common when queries contain domain-specific terminology that appears naturally in relevant documents.

Hybrid Search: Combining Vector and Keyword Search

Hybrid search combines the strengths of both approaches: vector search for semantic understanding, keyword search for exact term matching. Let's implement weighted hybrid scoring.

Our Implementation

Before we dive into the code, let's be clear about what we're building. This is a simplified implementation designed to teach you the core concepts of hybrid search: score normalization, weighted combination, and balancing semantic versus keyword signals.

Production vector databases often handle hybrid scoring internally or use more sophisticated approaches like rank-based fusion (combining rankings rather than scores) or learned rerankers (neural models that re-score results). We'll explore these production systems in the next tutorial. For now, our implementation focuses on the fundamentals that apply across all hybrid approaches.

The Challenge: Normalizing Different Score Scales

BM25 scores range from 0 to potentially 20+ (higher is better). ChromaDB distances range from 0 to 2+ (lower is better). We can't just add them together. We need to:

  1. Normalize both score types to the same 0-1 range
  2. Convert ChromaDB distances to similarities (flip the scale)
  3. Apply weights to combine them

Implementation

Here's our complete hybrid search function:

def hybrid_search(query_text, alpha=0.5, n_results=10):
    """
    Combine BM25 keyword search with vector similarity search.

    Args:
        query_text: The search query
        alpha: Weight for BM25 (0 = pure vector, 1 = pure keyword)
        n_results: Number of results to return

    Returns:
        Combined results with hybrid scores
    """
    # Get BM25 scores
    tokenized_query = simple_tokenize(query_text)
    bm25_scores = bm25.get_scores(tokenized_query)

    # Get vector similarities (we'll search more to ensure good coverage)
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    vector_results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=100  # Get more candidates for better coverage
    )

    # Extract vector distances and convert to similarities
    # ChromaDB returns cosine distance (0 to 2, lower = more similar)
    # We'll convert to similarity scores where higher = better for easier combination
    vector_distances = {}
    for i, paper_id in enumerate(vector_results['ids'][0]):
        distance = vector_results['distances'][0][i]
        # Convert distance to similarity (simple inversion)
        similarity = 1 / (1 + distance)
        vector_distances[paper_id] = similarity

    # Normalize BM25 scores to 0-1 range
    max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
    min_bm25 = min(bm25_scores)
    bm25_normalized = {}
    for i, score in enumerate(bm25_scores):
        paper_id = f"paper_{i}"
        normalized = (score - min_bm25) / (max_bm25 - min_bm25) if max_bm25 > min_bm25 else 0
        bm25_normalized[paper_id] = normalized

    # Combine scores using weighted average
    # hybrid_score = alpha * bm25 + (1 - alpha) * vector
    hybrid_scores = {}
    all_paper_ids = set(bm25_normalized.keys()) | set(vector_distances.keys())

    for paper_id in all_paper_ids:
        bm25_score = bm25_normalized.get(paper_id, 0)
        vector_score = vector_distances.get(paper_id, 0)

        hybrid_score = alpha * bm25_score + (1 - alpha) * vector_score
        hybrid_scores[paper_id] = hybrid_score

    # Get top N by hybrid score
    top_paper_ids = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)[:n_results]

    # Format results
    results = []
    for paper_id, score in top_paper_ids:
        paper_idx = int(paper_id.split('_')[1])
        results.append({
            'paper_id': paper_id,
            'title': df.iloc[paper_idx]['title'],
            'category': df.iloc[paper_idx]['category'],
            'abstract': df.iloc[paper_idx]['abstract'][:200] + "...",
            'hybrid_score': score,
            'bm25_score': bm25_normalized.get(paper_id, 0),
            'vector_score': vector_distances.get(paper_id, 0)
        })

    return results

# Test with different alpha values
query = "neural network training optimization"

print(f"Query: '{query}'")
print("=" * 80)

# Pure vector (alpha = 0)
print("\nPure Vector Search (alpha=0.0):")
results = hybrid_search(query, alpha=0.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 30% keyword, 70% vector
print("\nHybrid 30/70 (alpha=0.3):")
results = hybrid_search(query, alpha=0.3, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 50/50
print("\nHybrid 50/50 (alpha=0.5):")
results = hybrid_search(query, alpha=0.5, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Pure keyword (alpha = 1.0)
print("\nPure BM25 Keyword (alpha=1.0):")
results = hybrid_search(query, alpha=1.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
Query: 'neural network training optimization'
================================================================================

Pure Vector Search (alpha=0.0):
1. [cs.LG] Training Neural Networks at Any Scale...
   Hybrid: 0.642 (Vector: 0.642, BM25: 0.749)
2. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.630 (Vector: 0.630, BM25: 1.000)
3. [cs.LG] Adam symmetry theorem: characterization of the convergence o...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.381)
4. [cs.LG] A Distributed Training Architecture For Combinatorial Optimi...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.884)
5. [cs.LG] Can Training Dynamics of Scale-Invariant Neural Networks Be ...
   Hybrid: 0.609 (Vector: 0.609, BM25: 0.566)

Hybrid 30/70 (alpha=0.3):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.741 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.714 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.709 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.708 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.707 (Vector: 0.603, BM25: 0.948)

Hybrid 50/50 (alpha=0.5):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.815 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.787 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.775 (Vector: 0.603, BM25: 0.948)

Pure BM25 Keyword (alpha=1.0):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 1.000 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.971 (Vector: 0.603, BM25: 0.971)
3. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
4. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.948 (Vector: 0.603, BM25: 0.948)

The output shows how different alpha values affect which papers surface. With pure vector search (alpha=0), you'll see papers that semantically relate to neural network training. As you increase alpha toward 1, you'll increasingly weight papers that literally contain the words "neural," "network," "training," and "optimization."

Evaluating Search Strategies Systematically

We've implemented three search approaches: pure vector, pure keyword, and hybrid. But which one actually works better? We need systematic evaluation.

The Evaluation Metric: Category Precision

For our balanced 5k dataset, we can use category precision as our success metric:

Category precision @k: What percentage of the top k results are in the expected category?

If we search for "SQL query optimization," we expect Database papers (cs.DB). If 4 out of 5 top results are from cs.DB, we have 80% precision@5.

This metric works because our dataset is perfectly balanced and we can predict which category should dominate for specific queries.

Creating Test Queries

Let's create 10 diverse queries targeting different categories:

test_queries = [
    {
        "text": "natural language processing transformers",
        "expected_category": "cs.CL",
        "description": "NLP query"
    },
    {
        "text": "image segmentation computer vision",
        "expected_category": "cs.CV",
        "description": "Vision query"
    },
    {
        "text": "database query optimization indexing",
        "expected_category": "cs.DB",
        "description": "Database query"
    },
    {
        "text": "neural network training deep learning",
        "expected_category": "cs.LG",
        "description": "ML query with clear terms"
    },
    {
        "text": "software testing debugging quality assurance",
        "expected_category": "cs.SE",
        "description": "Software engineering query"
    },
    {
        "text": "attention mechanisms sequence models",
        "expected_category": "cs.CL",
        "description": "NLP architecture query"
    },
    {
        "text": "convolutional neural networks image recognition",
        "expected_category": "cs.CV",
        "description": "Vision with technical terms"
    },
    {
        "text": "distributed systems database consistency",
        "expected_category": "cs.DB",
        "description": "Database systems query"
    },
    {
        "text": "reinforcement learning policy gradient",
        "expected_category": "cs.LG",
        "description": "RL query"
    },
    {
        "text": "code review static analysis",
        "expected_category": "cs.SE",
        "description": "SE development query"
    }
]

print(f"Created {len(test_queries)} test queries")
print("Expected category distribution:")
categories = [q['expected_category'] for q in test_queries]
print(pd.Series(categories).value_counts().sort_index())
Created 10 test queries
Expected category distribution:
cs.CL    2
cs.CV    2
cs.DB    2
cs.LG    2
cs.SE    2
Name: count, dtype: int64

Our test set is balanced across categories, ensuring fair evaluation.

Running the Evaluation

Now let's test pure vector, pure keyword, and hybrid approaches:

def calculate_category_precision(query_text, expected_category, search_type="vector", alpha=0.5):
    """
    Calculate what percentage of top 5 results match expected category.

    Args:
        query_text: The search query
        expected_category: Expected category (e.g., 'cs.LG')
        search_type: 'vector', 'bm25', or 'hybrid'
        alpha: Weight for BM25 if using hybrid

    Returns:
        Precision (0.0 to 1.0)
    """
    if search_type == "vector":
        results = search_with_filter(query_text, n_results=5)
        categories = [r['category'] for r in results['metadatas'][0]]

    elif search_type == "bm25":
        tokenized_query = simple_tokenize(query_text)
        bm25_scores = bm25.get_scores(tokenized_query)
        top_indices = np.argsort(bm25_scores)[::-1][:5]
        categories = [df.iloc[idx]['category'] for idx in top_indices]

    elif search_type == "hybrid":
        results = hybrid_search(query_text, alpha=alpha, n_results=5)
        categories = [r['category'] for r in results]

    # Calculate precision
    matches = sum(1 for cat in categories if cat == expected_category)
    precision = matches / len(categories)

    return precision, categories

# Evaluate all strategies
results_summary = {
    'Pure Vector': [],
    'Hybrid 30/70': [],
    'Hybrid 50/50': [],
    'Pure BM25': []
}

print("Evaluating search strategies on 10 test queries...")
print("=" * 80)

for query_info in test_queries:
    query = query_info['text']
    expected = query_info['expected_category']

    print(f"\nQuery: {query}")
    print(f"Expected: {expected}")

    # Pure vector
    precision, _ = calculate_category_precision(query, expected, "vector")
    results_summary['Pure Vector'].append(precision)
    print(f"  Pure Vector: {precision*100:.0f}% precision")

    # Hybrid 30/70
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.3)
    results_summary['Hybrid 30/70'].append(precision)
    print(f"  Hybrid 30/70: {precision*100:.0f}% precision")

    # Hybrid 50/50
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.5)
    results_summary['Hybrid 50/50'].append(precision)
    print(f"  Hybrid 50/50: {precision*100:.0f}% precision")

    # Pure BM25
    precision, _ = calculate_category_precision(query, expected, "bm25")
    results_summary['Pure BM25'].append(precision)
    print(f"  Pure BM25: {precision*100:.0f}% precision")

# Calculate average precision for each strategy
print("\n" + "=" * 80)
print("OVERALL RESULTS")
print("=" * 80)
for strategy, precisions in results_summary.items():
    avg_precision = sum(precisions) / len(precisions)
    print(f"{strategy}: {avg_precision*100:.0f}% average category precision")
Evaluating search strategies on 10 test queries...
================================================================================

Query: natural language processing transformers
Expected: cs.CL
  Pure Vector: 80% precision
  Hybrid 30/70: 60% precision
  Hybrid 50/50: 60% precision
  Pure BM25: 60% precision

Query: image segmentation computer vision
Expected: cs.CV
  Pure Vector: 80% precision
  Hybrid 30/70: 80% precision
  Hybrid 50/50: 80% precision
  Pure BM25: 80% precision

[... additional queries ...]

================================================================================
OVERALL RESULTS
================================================================================
Pure Vector: 84% average category precision
Hybrid 30/70: 78% average category precision
Hybrid 50/50: 78% average category precision
Pure BM25: 78% average category precision

Understanding What the Results Tell Us

These results deserve careful interpretation. Let's be honest about what we discovered.

Finding 1: Pure Vector Performed Best on This Dataset

Pure vector search achieved 84% category precision compared to 78% for hybrid and 78% for BM25. This might surprise you if you've read guides claiming hybrid search always outperforms pure approaches.

Why pure vector dominated on academic abstracts:

Academic papers have rich vocabulary and technical terminology. ML papers naturally use words like "training," "optimization," "neural networks." Database papers naturally use words like "query," "index," "transaction." The semantic embeddings capture these domain-specific patterns well.

Adding BM25 keyword matching introduced false positives. Papers that coincidentally used similar words in different contexts got boosted incorrectly. For example, a database paper might mention "model training" when discussing query optimization models, causing it to rank high for "neural network training" queries even though it's not about neural networks.

Finding 2: Hybrid Search Can Still Add Value

Just because pure vector won on this dataset doesn't mean hybrid search is worthless. There are scenarios where keyword matching helps:

When hybrid might outperform pure vector:

  • Searching structured data (product catalogs, API documentation)
  • Queries with rare technical terms that might not embed well
  • Domains where exact keyword presence is meaningful
  • Documents with inconsistent writing quality where semantic meaning is unclear

On our academic abstracts: The rich vocabulary gave vector search everything it needed. Keyword matching added noise more than signal.

Finding 3: The Vocabulary Mismatch Problem

Some queries failed across ALL strategies. For example, we tested "reducing storage requirements for system event data" hoping to find a paper about log compression. None of the approaches found it. Why?

The query used "reducing storage requirements" but the paper said "compression" and "resource savings." These are semantically equivalent, but the vocabulary differs. At 5k scale with multiple papers legitimately matching each query, vocabulary mismatches become visible.

This isn't a failure of vector search or hybrid search. It's the reality of semantic retrieval: users search with general terms, papers use technical jargon. Sometimes the gap is too wide.

Finding 4: Query Quality Matters More Than Strategy

Throughout our evaluation, we noticed that well-crafted queries with clear technical terms performed well across all strategies, while vague queries struggled everywhere.

A query like "neural network training optimization techniques" succeeded because it used the same language papers use. A query like "making models work better" failed because it's too general and uses informal language.

The lesson: Before optimizing your search strategy, make sure your queries match how your documents are written. Understanding your corpus matters more than choosing between vector and keyword search.

Practical Guidance for Real Projects

Let's consolidate what we've learned into actionable advice.

When to Use Metadata Filtering

Use filtering when:

  • Users explicitly request filters ("show me papers from 2024")
  • Filtering meaningfully improves result quality
  • Your query volume is manageable (ChromaDB can handle dozens of filtered queries per second)
  • The performance cost is acceptable for your use case

Design your schema carefully:

  • Store filterable fields as atomic values (integers for years, strings for categories)
  • Avoid nested JSON blobs or long text in metadata
  • Keep metadata consistent across documents
  • Test filtering performance on your actual data before deploying

Accept the overhead:

  • Filtered queries run slower than unfiltered ones in ChromaDB
  • This is a characteristic of how ChromaDB approaches the problem
  • Production databases handle filtering with different tradeoffs (we'll see this in the next tutorial)
  • Design for the database you're actually using

When to Consider Hybrid Search

Try hybrid search when:

  • Your documents have structured fields where exact matches matter
  • Queries include rare technical terms that might not embed well
  • Testing shows hybrid outperforms pure vector on your test queries
  • You can afford the implementation and maintenance complexity

Stick with pure vector when:

  • Your documents have rich natural language (like our academic abstracts)
  • Vector search already achieves high precision on test queries
  • Simplicity and maintainability matter
  • Your embedding model captures domain terminology well

The decision framework:

  1. Build pure vector search first
  2. Create representative test queries
  3. Measure precision/recall on pure vector
  4. Only if results are inadequate, implement hybrid
  5. Compare hybrid against pure vector on same test queries
  6. Choose the approach with measurably better results

Don't add complexity without evidence it helps.

Start Simple, Measure, Then Optimize

The pattern that emerged across our experiments:

  1. Start with pure vector search: It's simpler to implement and maintain
  2. Build evaluation framework: Create test queries with expected results
  3. Measure performance: Calculate precision, recall, or domain-specific metrics
  4. Identify gaps: Where does pure vector fail?
  5. Add complexity thoughtfully: Try metadata filtering or hybrid search
  6. Re-evaluate: Does the added complexity improve results?
  7. Choose based on data: Not based on what tutorials claim always works

This approach keeps your system maintainable while ensuring each added feature provides real value.

Looking Ahead to Production Databases

Throughout this tutorial, we've explored filtering and hybrid search using ChromaDB. We've seen that:

  • Filtering adds measurable overhead, but remains usable for moderate query volumes
  • ChromaDB excels at local development and prototyping
  • Production systems optimize these patterns differently

ChromaDB is designed to be lightweight, easy to use, and perfect for learning. We've used it to understand vector database concepts without worrying about infrastructure. The patterns we learned (metadata schema design, hybrid scoring, evaluation frameworks) transfer directly to production systems.

In the next tutorial, we'll explore production vector databases:

  • PostgreSQL with pgvector: See how vector search integrates with SQL and existing infrastructure
  • Pinecone: Experience managed services with auto-scaling
  • Qdrant: Explore Rust-backed performance and efficient filtering

You'll discover how different systems approach filtering, when managed services make sense, and how to choose the right database for your needs. The core concepts remain the same, but production systems offer different tradeoffs in performance, features, and operational complexity.

But you needed to understand these concepts with an accessible tool first. ChromaDB gave us that foundation.

Practical Exercises

Before moving on, try these experiments to deepen your understanding:

Exercise 1: Explore Different Queries

Test pure vector vs hybrid search on queries from your own domain:

my_queries = [
    "your domain-specific query here",
    "another query relevant to your work",
    # Add more
]

for query in my_queries:
    print(f"\n{'='*70}")
    print(f"Query: {query}")

    # Try pure vector
    results_vector = search_with_filter(query, n_results=5)

    # Try hybrid
    results_hybrid = hybrid_search(query, alpha=0.5, n_results=5)

    # Compare the categories returned
    # Which approach surfaces more relevant papers?

Exercise 2: Tune Hybrid Alpha

Find the optimal alpha value for a specific query:

query = "your challenging query here"

for alpha in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
    results = hybrid_search(query, alpha=alpha, n_results=5)
    categories = [r['category'] for r in results]

    print(f"Alpha={alpha}: {categories}")
    # Which alpha gives the best results for this query?

Exercise 3: Analyze Filter Combinations

Test different metadata filter combinations:

# Try various filter patterns
filters_to_test = [
    {"category": "cs.LG"},
    {"year": {"$gte": 2024}},
    {"category": "cs.LG", "year": {"$eq": 2025}},
    {"$or": [{"category": "cs.LG"}, {"category": "cs.CV"}]}
]

query = "deep learning applications"

for where_clause in filters_to_test:
    results = search_with_filter(query, where_clause, n_results=5)
    categories = [r['category'] for r in results['metadatas'][0]]
    print(f"Filter {where_clause}: {categories}")

Exercise 4: Build Your Own Evaluation

Create test queries for a different domain:

# If you have expertise in a specific field,
# create queries where you KNOW which papers should match

domain_specific_queries = [
    {
        "text": "your expert query",
        "expected_category": "cs.XX",
        "notes": "why this query should return this category"
    },
    # Add more
]

# Run evaluation and see which strategy performs best
# on YOUR domain-specific queries

Summary: What You've Learned

We've covered a lot of ground in this tutorial. Here's what you can now do:

Core Skills

Metadata Schema Design:

  • Store filterable fields as atomic, consistent values
  • Avoid anti-patterns like JSON blobs and long text in metadata
  • Ensure all documents have the same metadata fields to prevent filtering issues
  • Understand that good schema design enables powerful filtering

Metadata Filtering in ChromaDB:

  • Implement category filters, numeric range filters, and combinations
  • Measure the performance overhead of filtering
  • Make informed decisions about when filtering justifies the cost

BM25 Keyword Search:

  • Build BM25 indexes from document text
  • Understand term frequency and inverse document frequency
  • Recognize when keyword matching complements vector search
  • Know the scale limitations of different BM25 implementations

Hybrid Search Implementation:

  • Normalize different score scales (BM25 and vector similarity)
  • Combine scores using weighted averages
  • Test different alpha values to balance keyword vs semantic search
  • Understand this is a teaching implementation of fundamental concepts

Systematic Evaluation:

  • Create test queries with ground truth expectations
  • Calculate precision metrics to compare strategies
  • Make data-driven decisions rather than assuming one approach always wins

Key Insights

1. Pure vector search performed best on our academic abstracts (84% category precision vs 78% for hybrid/BM25). This challenges the assumption that hybrid always wins. The rich vocabulary in academic papers gave vector search everything it needed.

2. Filtering overhead is real but manageable for moderate query volumes. ChromaDB's approach to filtering creates measurable costs that production databases handle differently.

3. Vocabulary mismatch is the biggest challenge. Users search with general terms ("reducing storage"), papers use jargon ("compression"). This gap affects all search strategies.

4. Query quality matters more than search strategy. Well-crafted queries using domain terminology succeed across approaches. Vague queries struggle everywhere.

5. Start simple, measure, then optimize. Build pure vector first, evaluate systematically, add complexity only when data shows it helps.

What's Next

We now understand how to enhance vector search with metadata filtering and hybrid approaches. We've seen what works, what doesn't, and how to measure the difference.

In the next tutorial, we'll explore production vector databases:

  • Set up PostgreSQL with pgvector and see how vector search integrates with SQL
  • Create a Pinecone index and experience managed vector database services
  • Run Qdrant locally and compare its filtering performance
  • Learn decision frameworks for choosing the right database for your needs

You'll get hands-on experience with multiple production systems and develop the judgment to choose appropriately for different scenarios.

Before moving on, make sure you understand:

  • How to design metadata schemas that enable effective filtering
  • The performance tradeoffs of metadata filtering
  • When hybrid search adds value vs adding complexity
  • How to evaluate search strategies systematically using precision metrics
  • Why pure vector search can outperform hybrid on certain datasets

When you're comfortable with these concepts, you're ready to explore production vector databases and learn when to move beyond ChromaDB.


Key Takeaways:

  • Metadata schema design matters: store filterable fields as atomic, consistent values and ensure all documents have the same fields
  • Filtering adds overhead in ChromaDB (category cheapest, year range most expensive, combined in between)
  • Pure vector achieved 84% category precision vs 78% for hybrid/BM25 on academic abstracts due to rich vocabulary
  • Hybrid search has value in specific scenarios (structured data, rare keywords) but adds complexity
  • Vocabulary mismatch between queries and documents affects all search strategies equally
  • Start with pure vector search, measure systematically, add complexity only when data justifies it
  • ChromaDB taught us filtering concepts; production databases optimize differently
  • Evaluation frameworks with test queries matter more than assumptions about "best practices"

Document Chunking Strategies for Vector Databases

6 December 2025 at 02:30

In the previous tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. Our dataset consisted of paper abstracts, each about 200 words long. These abstracts were perfect for embedding as single units: short enough to fit comfortably in an embedding model's context window, yet long enough to capture meaningful semantic content.

But here's the challenge we didn't face yet: What happens when you need to search through full research papers, technical documentation, or long articles? A typical research paper contains 10,000 words. A comprehensive technical guide might have 50,000 words. These documents are far too long to embed as single vectors.

When documents are too long, you need to break them into chunks. This tutorial teaches you how to implement different chunking strategies, evaluate their performance systematically, and understand the tradeoffs between approaches. By the end, you'll know how to make informed decisions about chunking for your own projects.

Why Chunking Still Matters

You might be thinking: "Modern LLMs can handle massive amounts of data. Can't I just embed entire documents?"

There are three reasons why chunking remains essential:

1. Embedding Models Have Context Limits

Many embedding models still have much smaller context limits than modern chat models, and long inputs are also more expensive to embed. Even when a model can technically handle a whole paper, you usually don't want one giant vector: smaller chunks give you better retrieval and lower cost.

2. Retrieval Quality Depends on Granularity

Imagine someone searches for "robotic manipulation techniques." If you embedded an entire 10,000-word paper as a single vector, that search would match the whole paper, even if only one 400-word section actually discusses robotic manipulation. Chunking lets you retrieve the specific relevant section rather than forcing the user to read an entire paper.

3. Semantic Coherence Matters

A single document might cover multiple distinct topics. A paper about machine learning for healthcare might discuss neural network architectures in one section and patient privacy in another. These topics deserve separate embeddings so each can be retrieved independently when relevant.

The question isn't whether to chunk, but how to chunk intelligently. That's what we're going to figure out together.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Understand why chunking strategies affect retrieval quality
  • Implement two practical chunking approaches: fixed token windows and sentence-based chunking
  • Generate embeddings for chunks and store them in ChromaDB
  • Build a systematic evaluation framework to compare strategies
  • Interpret real performance data showing when each strategy excels
  • Make informed decisions about chunk size and strategy for your projects
  • Recognize that query quality matters more than chunking strategy

Most importantly, you'll learn how to evaluate your chunking decisions using real measurements rather than guesses.

Dataset and Setup

For this tutorial, we're working with 20 full research papers from the same arXiv dataset we used previously. These papers are balanced across five computer science categories:

  • cs.CL (Computational Linguistics): 4 papers
  • cs.CV (Computer Vision): 4 papers
  • cs.DB (Databases): 4 papers
  • cs.LG (Machine Learning): 4 papers
  • cs.SE (Software Engineering): 4 papers

We extracted the full text from these papers, and here's what makes them perfect for learning about chunking:

  • Total corpus: 196,181 words
  • Average paper length: 9,809 words (compared to 200-word abstracts)
  • Range: 2,735 to 20,763 words per paper
  • Content: Real academic papers with typical formatting artifacts

These papers are long enough to require chunking, diverse enough to test strategies across topics, and messy enough to reflect real-world document processing.

Required Files

Download arxiv_metadata_and_papers.zip and extract it to your working directory. This archive contains:

  • arxiv_20papers_metadata.csv - Metadata, including: title, abstract, authors, published date, category, and arXiv IDs for the 20 selected papers
  • arxiv_fulltext_papers/ - Directory with the 20 text files (one per corresponding paper)

You'll also need the same Python environment from the previous tutorial, plus two additional packages:

# If you're starting fresh, create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# nltk==3.9.1
# tiktoken==0.12.0

pip install chromadb numpy pandas cohere python-dotenv nltk tiktoken

Make sure you have a .env file with your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Papers

Let's load our papers and examine what we're working with:

import pandas as pd
from pathlib import Path

# Load paper metadata
df = pd.read_csv('arxiv_20papers_metadata.csv')
papers_dir = Path('arxiv_fulltext_papers')

print(f"Loaded {len(df)} papers")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Calculate corpus statistics
total_words = 0
word_counts = []

for arxiv_id in df['arxiv_id']:
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()
        words = len(text.split())
        word_counts.append(words)
        total_words += words

print(f"\nCorpus statistics:")
print(f"  Total words: {total_words:,}")
print(f"  Average words per paper: {sum(word_counts) / len(word_counts):.0f}")
print(f"  Range: {min(word_counts):,} to {max(word_counts):,} words")

# Show a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    text = f.read()
    print(f"\nSample paper ({sample_id}):")
    print(f"  Title: {df[df['arxiv_id'] == sample_id]['title'].iloc[0]}")
    print(f"  Category: {df[df['arxiv_id'] == sample_id]['category'].iloc[0]}")
    print(f"  Length: {len(text.split()):,} words")
    print(f"  Preview: {text[:300]}...")
Loaded 20 papers

Papers per category:
category
cs.CL    4
cs.CV    4
cs.DB    4
cs.LG    4
cs.SE    4
Name: count, dtype: int64

Corpus statistics:
  Total words: 196,181
  Average words per paper: 9809
  Range: 2,735 to 20,763 words

Sample paper (2511.09708v1):
  Title: Efficient Hyperdimensional Computing with Modular Composite Representations
  Category: cs.LG
  Length: 11,293 words
  Preview: 1
Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract β€”The modular composite representation (MCR) is
a computing model that represents information with high-
dimensional...

We have 20 papers averaging nearly 10,000 words each. Compare this to abstracts at 200 words we used previously, and you can see why chunking becomes necessary. A 10,000-word paper cannot be embedded as a single unit without losing the ability to retrieve specific relevant sections.

A Note About Paper Extraction

The papers you're working with were extracted from PDFs using PyPDF2. We've provided the extracted text files so you can focus on chunking strategies rather than PDF processing. The extraction process is straightforward but involves details that aren't central to learning about chunking.

If you're curious about how we downloaded the PDFs and extracted the text, or if you want to extend this work with different papers, you'll find the complete code in the Appendix at the end of this tutorial. For now, just know that we:

  1. Downloaded 20 papers from arXiv (4 from each category)
  2. Extracted text from each PDF using PyPDF2
  3. Saved the extracted text to individual files

The extracted text has minor formatting artifacts like extra spaces or split words, but that's realistic. Real-world document processing always involves some noise. The chunking strategies we'll implement handle this gracefully.

Strategy 1: Fixed Token Windows with Overlap

Let's start with the most common chunking approach in production systems: sliding a fixed-size window across the document with some overlap.

The Concept

Imagine reading a book through a window that shows exactly 500 words at a time. When you finish one window, you slide it forward by 400 words, creating a 100-word overlap with the previous window. This continues until you reach the end of the book.

Fixed token windows work the same way:

  1. Choose a chunk size (we'll use 512 tokens)
  2. Choose an overlap (we'll use 100 tokens, about 20%)
  3. Slide the window through the document
  4. Each window becomes one chunk

Why overlap? Concepts often span boundaries between chunks. If we chunk without overlap, we might split a crucial sentence or paragraph, losing semantic coherence. The 20% overlap ensures that even if something gets split, it appears complete in at least one chunk.

Implementation

Let's implement this strategy. We'll use tiktoken for accurate token counting:

import tiktoken

def chunk_text_fixed_tokens(text, chunk_size=512, overlap=100):
    """
    Chunk text using fixed token windows with overlap.

    Args:
        text: The document text to chunk
        chunk_size: Number of tokens per chunk (default 512)
        overlap: Number of tokens to overlap between chunks (default 100)

    Returns:
        List of text chunks
    """
    # We'll use tiktoken just to approximate token lengths.
    # In production, you'd usually match the tokenizer to your embedding model.
    encoding = tiktoken.get_encoding("cl100k_base")

    # Tokenize the entire text
    tokens = encoding.encode(text)

    chunks = []
    start_idx = 0

    while start_idx < len(tokens):
        # Get chunk_size tokens starting from start_idx
        end_idx = start_idx + chunk_size
        chunk_tokens = tokens[start_idx:end_idx]

        # Decode tokens back to text
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        # Move start_idx forward by (chunk_size - overlap)
        # This creates the overlap between consecutive chunks
        start_idx += (chunk_size - overlap)

        # Stop if we've reached the end
        if end_idx >= len(tokens):
            break

    return chunks

# Test on a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    sample_text = f.read()

sample_chunks = chunk_text_fixed_tokens(sample_text)
print(f"Sample paper chunks: {len(sample_chunks)}")
print(f"First chunk length: {len(sample_chunks[0].split())} words")
print(f"First chunk preview: {sample_chunks[0][:200]}...")
Sample paper chunks: 51
First chunk length: 323 words
First chunk preview: 1 Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract β€”The modular co...

Our sample paper produced 51 chunks with the first chunk containing 323 words. The implementation is working as expected.

Processing All Papers

Now let's apply this chunking strategy to all 20 papers:

# Process all papers and collect chunks
all_chunks = []
chunk_metadata = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk the paper
    chunks = chunk_text_fixed_tokens(text, chunk_size=512, overlap=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        chunk_metadata.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'fixed_token_windows'
        })

print(f"Fixed token chunking results:")
print(f"  Total chunks created: {len(all_chunks)}")
print(f"  Average chunks per paper: {len(all_chunks) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f}")

# Check chunk size distribution
chunk_word_counts = [len(chunk.split()) for chunk in all_chunks]
print(f"  Chunk size range: {min(chunk_word_counts)} to {max(chunk_word_counts)} words")
Fixed token chunking results:
  Total chunks created: 914
  Average chunks per paper: 45.7
  Average words per chunk: 266
  Chunk size range: 16 to 438 words

We created 914 chunks from our 20 papers. Each paper produced about 46 chunks, averaging 266 words each. Notice the wide range: 16 to 438 words. This happens because tokens don't map exactly to words, and our stopping condition creates a small final chunk for some papers.

Edge Cases and Real-World Behavior

That 16-word chunk? It's not a bug. It's what happens when the final portion of a paper contains fewer tokens than our chunk size. In production, you might choose to:

  • Merge tiny final chunks with the previous chunk
  • Set a minimum chunk size threshold
  • Accept them as is (they're rare and often don't hurt retrieval)

We're keeping them to show real-world chunking behavior. Perfect uniformity isn't always necessary or beneficial.

Generating Embeddings

Now we need to embed our 914 chunks using Cohere's API. This is where we need to be careful about rate limits:

from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np

# Load API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
co = ClientV2(api_key=cohere_api_key)

# Configure batching to respect rate limits
# Cohere trial and free keys have strict rate limits.
# We'll use small batches and short pauses so we don't spam the API.
batch_size = 15  # Small batches to stay well under rate limits
wait_time = 15   # Seconds between batches

print("Generating embeddings for fixed token chunks...")
print(f"Total chunks: {len(all_chunks)}")
print(f"Batch size: {batch_size}")

all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks))
    batch = all_chunks[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        # Wait between batches to avoid rate limits
        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit or error: {e}")
        print(f"  Waiting 60 seconds before retry...")
        time.sleep(60)

        # Retry the same batch
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"βœ“ Generated {len(all_embeddings)} embeddings")

# Convert to numpy array for storage
embeddings_array = np.array(all_embeddings)
print(f"Embeddings shape: {embeddings_array.shape}")
Generating embeddings for fixed token chunks...
Total chunks: 914
Batch size: 15
  Processing batch 1/61 (chunks 0 to 15)...
  Processing batch 2/61 (chunks 15 to 30)...
  ...
  Processing batch 35/61 (chunks 510 to 525)...
  ⚠ Hit rate limit or error: Rate limit exceeded
  Waiting 60 seconds before retry...
  Processing batch 36/61 (chunks 525 to 540)...
  ...
βœ“ Generated 914 embeddings
Embeddings shape: (914, 1536)

Important note about rate limiting: We hit Cohere's rate limit during embedding generation. This isn't a failure or something to hide. It's a production reality. Our code handled it with a 60-second wait and retry. Good production code always anticipates and handles rate limits gracefully.

Exact limits depend on your plan and may change over time, so always check the provider docs and be ready to handle 429 "rate limit" errors.

Storing in ChromaDB

Now let's store our chunks in ChromaDB. Remember that ChromaDB won't let you create a collection that already exists. During development, you'll often regenerate chunks with different parameters, so we'll delete any existing collection first:

import chromadb

# Initialize ChromaDB client
client = chromadb.Client()  # In-memory client

# This in-memory client resets whenever you start a fresh Python session.
# Your collections and data will disappear when the script ends. Later tutorials
# will show you how to persist data across sessions using PersistentClient.

# Delete collection if it exists (useful for experimentation)
try:
    client.delete_collection(name="fixed_token_chunks")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create fresh collection
collection = client.create_collection(
    name="fixed_token_chunks",
    metadata={
        "description": "20 arXiv papers chunked with fixed token windows",
        "chunking_strategy": "fixed_token_windows",
        "chunk_size": 512,
        "overlap": 100
    }
)

# Prepare data for insertion
ids = [f"chunk_{i}" for i in range(len(all_chunks))]

print(f"Inserting {len(all_chunks)} chunks into ChromaDB...")

collection.add(
    ids=ids,
    embeddings=embeddings_array.tolist(),
    documents=all_chunks,
    metadatas=chunk_metadata
)

print(f"βœ“ Collection contains {collection.count()} chunks")
Deleted existing collection
Inserting 914 chunks into ChromaDB...
βœ“ Collection contains 914 chunks

Why delete and recreate? During development, you'll iterate on chunking strategies. Maybe you'll try different chunk sizes or overlap values. ChromaDB requires unique collection names, so the cleanest pattern is to delete the old version before creating the new one. This is standard practice while experimenting.

Our fixed token strategy is now complete: 914 chunks embedded and stored in ChromaDB.

Strategy 2: Sentence-Based Chunking

Let's implement our second approach: chunking based on sentence boundaries rather than arbitrary token positions.

The Concept

Instead of sliding a fixed window through tokens, sentence-based chunking respects the natural structure of language:

  1. Split text into sentences
  2. Group sentences together until reaching a target word count
  3. Never split a sentence in the middle
  4. Create a new chunk when adding the next sentence would exceed the target

This approach prioritizes semantic coherence over size consistency. A chunk might be 400 or 600 words, but it will always contain complete sentences that form a coherent thought.

Why sentence boundaries matter: Splitting mid-sentence destroys meaning. The sentence "Neural networks require careful tuning of hyperparameters to achieve optimal performance" loses critical context if split after "hyperparameters." Sentence-based chunking prevents this.

Implementation

We'll use NLTK for sentence tokenization:

import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import sent_tokenize

A quick note: Sentence tokenization on PDF-extracted text isn't always perfect, especially for technical papers with equations, citations, or unusual formatting. It works well enough for this tutorial, but if you experiment with your own papers, you might see occasional issues with sentences getting split or merged incorrectly.

def chunk_text_by_sentences(text, target_words=400, min_words=100):
    """
    Chunk text by grouping sentences until reaching target word count.

    Args:
        text: The document text to chunk
        target_words: Target words per chunk (default 400)
        min_words: Minimum words for a valid chunk (default 100)

    Returns:
        List of text chunks
    """
    # Split text into sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence_words = len(sentence.split())

        # If adding this sentence would exceed target, save current chunk
        if current_word_count > 0 and current_word_count + sentence_words > target_words:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_word_count = sentence_words
        else:
            current_chunk.append(sentence)
            current_word_count += sentence_words

    # Don't forget the last chunk
    if current_chunk and current_word_count >= min_words:
        chunks.append(' '.join(current_chunk))

    return chunks

# Test on the same sample paper
sample_chunks_sent = chunk_text_by_sentences(sample_text, target_words=400)
print(f"Sample paper chunks (sentence-based): {len(sample_chunks_sent)}")
print(f"First chunk length: {len(sample_chunks_sent[0].split())} words")
print(f"First chunk preview: {sample_chunks_sent[0][:200]}...")
Sample paper chunks (sentence-based): 29
First chunk length: 392 words
First chunk preview: 1
Efficient Hyperdimensional Computing with
Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract β€”The modular co...

The same paper that produced 51 fixed-token chunks now produces 29 sentence-based chunks. The first chunk is 392 words, close to our 400-word target but not exact.

Processing All Papers

Let's apply sentence-based chunking to all 20 papers:

# Process all papers with sentence-based chunking
all_chunks_sent = []
chunk_metadata_sent = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk by sentences
    chunks = chunk_text_by_sentences(text, target_words=400, min_words=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks_sent.append(chunk)
        chunk_metadata_sent.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'sentence_based'
        })

print(f"Sentence-based chunking results:")
print(f"  Total chunks created: {len(all_chunks_sent)}")
print(f"  Average chunks per paper: {len(all_chunks_sent) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks_sent) / len(all_chunks_sent):.0f}")

# Check chunk size distribution
chunk_word_counts_sent = [len(chunk.split()) for chunk in all_chunks_sent]
print(f"  Chunk size range: {min(chunk_word_counts_sent)} to {max(chunk_word_counts_sent)} words")
Sentence-based chunking results:
  Total chunks created: 513
  Average chunks per paper: 25.6
  Average words per chunk: 382
  Chunk size range: 110 to 548 words

Sentence-based chunking produced 513 chunks compared to fixed token's 914. That's about 44% fewer chunks. Each chunk averages 382 words instead of 266. This isn't better or worse, it's a different tradeoff:

Fixed Token (914 chunks):

  • More chunks, smaller sizes
  • Consistent token counts
  • More embeddings to generate and store
  • Finer-grained retrieval granularity

Sentence-Based (513 chunks):

  • Fewer chunks, larger sizes
  • Variable sizes respecting sentences
  • Less storage and fewer embeddings
  • Preserves semantic coherence

Comparing Strategies Side-by-Side

Let's create a comparison table:

import pandas as pd

comparison_df = pd.DataFrame({
    'Metric': ['Total Chunks', 'Chunks per Paper', 'Avg Words per Chunk',
               'Min Words', 'Max Words'],
    'Fixed Token': [914, 45.7, 266, 16, 438],
    'Sentence-Based': [513, 25.6, 382, 110, 548]
})

print(comparison_df.to_string(index=False))
              Metric  Fixed Token  Sentence-Based
        Total Chunks          914             513
   Chunks per Paper          45.7            25.6
Avg Words per Chunk           266             382
           Min Words           16             110
           Max Words          438             548

Sentence-based chunking creates 44% fewer chunks. This means:

  • Lower costs: 44% fewer embeddings to generate
  • Less storage: 44% less data to store and query
  • Larger context: Each chunk contains more complete thoughts
  • Better coherence: Never splits mid-sentence

But remember, this isn't automatically "better." Smaller chunks can enable more precise retrieval. The choice depends on your use case.

Generating Embeddings for Sentence-Based Chunks

We'll use the same embedding process as before, with the same rate limiting pattern:

print("Generating embeddings for sentence-based chunks...")
print(f"Total chunks: {len(all_chunks_sent)}")

all_embeddings_sent = []
num_batches = (len(all_chunks_sent) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks_sent))
    batch = all_chunks_sent[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit: {e}")
        print(f"  Waiting 60 seconds...")
        time.sleep(60)

        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"βœ“ Generated {len(all_embeddings_sent)} embeddings")

embeddings_array_sent = np.array(all_embeddings_sent)
print(f"Embeddings shape: {embeddings_array_sent.shape}")
Generating embeddings for sentence-based chunks...
Total chunks: 513
  Processing batch 1/35 (chunks 0 to 15)...
  ...
βœ“ Generated 513 embeddings
Embeddings shape: (513, 1536)

With 513 chunks instead of 914, embedding generation is faster and costs less. This is a concrete benefit of the sentence-based approach.

Storing Sentence-Based Chunks in ChromaDB

We'll create a separate collection for sentence-based chunks:

# Delete existing collection if present
try:
    client.delete_collection(name="sentence_chunks")
except:
    pass

# Create sentence-based collection
collection_sent = client.create_collection(
    name="sentence_chunks",
    metadata={
        "description": "20 arXiv papers chunked by sentences",
        "chunking_strategy": "sentence_based",
        "target_words": 400,
        "min_words": 100
    }
)

# Prepare and insert data
ids_sent = [f"chunk_{i}" for i in range(len(all_chunks_sent))]

print(f"Inserting {len(all_chunks_sent)} chunks into ChromaDB...")

collection_sent.add(
    ids=ids_sent,
    embeddings=embeddings_array_sent.tolist(),
    documents=all_chunks_sent,
    metadatas=chunk_metadata_sent
)

print(f"βœ“ Collection contains {collection_sent.count()} chunks")
Inserting 513 chunks into ChromaDB...
βœ“ Collection contains 513 chunks

Now we have two collections:

  • fixed_token_chunks with 914 chunks
  • sentence_chunks with 513 chunks

Both contain the same 20 papers, just chunked differently. Now comes the critical question: which strategy actually retrieves relevant content better?

Building an Evaluation Framework

We've created two chunking strategies and embedded all the chunks. But how do we know which one works better? We need a systematic way to measure retrieval quality.

The Evaluation Approach

Our evaluation framework works like this:

  1. Create test queries for specific papers we know should be retrieved
  2. Run each query against both chunking strategies
  3. Check if the expected paper appears in the top results
  4. Compare performance across strategies

The key is having ground truth: knowing which papers should match which queries.

Creating Good Test Queries

Here's something we learned the hard way during development: bad queries make any chunking strategy look bad.

When we first built this evaluation, we tried queries like "reinforcement learning optimization" for a paper that was actually about masked diffusion models. Both chunking strategies "failed" because we gave them an impossible task. The problem wasn't the chunking, it was our poor understanding of the documents.

The fix: Before creating queries, read the paper abstracts. Understand what each paper actually discusses. Then create queries that match real content.

Let's create five test queries based on actual paper content:

# Test queries designed from actual paper content
test_queries = [
    {
        "text": "knowledge editing in language models",
        "expected_paper": "2510.25798v1",  # MemEIC paper (cs.CL)
        "description": "Knowledge editing"
    },
    {
        "text": "masked diffusion models for inference optimization",
        "expected_paper": "2511.04647v2",  # Masked diffusion (cs.LG)
        "description": "Optimal inference schedules"
    },
    {
        "text": "robotic manipulation with spatial representations",
        "expected_paper": "2511.09555v1",  # SpatialActor (cs.CV)
        "description": "Robot manipulation"
    },
    {
        "text": "blockchain ledger technology for database integrity",
        "expected_paper": "2507.13932v1",  # Chain Table (cs.DB)
        "description": "Blockchain database integrity"
    },
    {
        "text": "automated test generation and oracle synthesis",
        "expected_paper": "2510.26423v1",  # Nexus (cs.SE)
        "description": "Multi-agent test oracles"
    }
]

print("Test queries:")
for i, query in enumerate(test_queries, 1):
    print(f"{i}. {query['text']}")
    print(f"   Expected paper: {query['expected_paper']}")
    print()
Test queries:
1. knowledge editing in language models
   Expected paper: 2510.25798v1

2. masked diffusion models for inference optimization
   Expected paper: 2511.04647v2

3. robotic manipulation with spatial representations
   Expected paper: 2511.09555v1

4. blockchain ledger technology for database integrity
   Expected paper: 2507.13932v1

5. automated test generation and oracle synthesis
   Expected paper: 2510.26423v1

These queries are specific enough to target particular papers but general enough to represent realistic search behavior. Each query matches actual content from its expected paper.

Implementing the Evaluation Loop

Now let's run these queries against both chunking strategies and compare results:

def evaluate_chunking_strategy(collection, test_queries, strategy_name):
    """
    Evaluate a chunking strategy using test queries.

    Returns:
        Dictionary with success rate and detailed results
    """
    results = []

    for query_info in test_queries:
        query_text = query_info['text']
        expected_paper = query_info['expected_paper']

        # Embed the query
        response = co.embed(
            texts=[query_text],
            model='embed-v4.0',
            input_type='search_query',
            embedding_types=['float']
        )
        query_embedding = np.array(response.embeddings.float_[0])

        # Search the collection
        search_results = collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5
        )

        # Extract paper IDs from chunks
        retrieved_papers = []
        for metadata in search_results['metadatas'][0]:
            paper_id = metadata['arxiv_id']
            if paper_id not in retrieved_papers:
                retrieved_papers.append(paper_id)

        # Check if expected paper was found
        found = expected_paper in retrieved_papers
        position = retrieved_papers.index(expected_paper) + 1 if found else None
        best_distance = search_results['distances'][0][0]

        results.append({
            'query': query_text,
            'expected_paper': expected_paper,
            'found': found,
            'position': position,
            'best_distance': best_distance,
            'retrieved_papers': retrieved_papers[:3]  # Top 3 for comparison
        })

    # Calculate success rate
    success_rate = sum(1 for r in results if r['found']) / len(results)

    return {
        'strategy': strategy_name,
        'success_rate': success_rate,
        'results': results
    }

# Evaluate both strategies
print("Evaluating fixed token strategy...")
fixed_token_eval = evaluate_chunking_strategy(
    collection,
    test_queries,
    "Fixed Token Windows"
)

print("Evaluating sentence-based strategy...")
sentence_eval = evaluate_chunking_strategy(
    collection_sent,
    test_queries,
    "Sentence-Based"
)

print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
Evaluating fixed token strategy...
Evaluating sentence-based strategy...

================================================================================
EVALUATION RESULTS
================================================================================

Comparing Results

Let's examine how each strategy performed:

def print_evaluation_results(eval_results):
    """Print evaluation results in a readable format"""
    strategy = eval_results['strategy']
    success_rate = eval_results['success_rate']
    results = eval_results['results']

    print(f"\n{strategy}")
    print("-" * 80)
    print(f"Success Rate: {len([r for r in results if r['found']])}/{len(results)} queries ({success_rate*100:.0f}%)")
    print()

    for i, result in enumerate(results, 1):
        status = "βœ“" if result['found'] else "βœ—"
        position = f"(position #{result['position']})" if result['found'] else ""

        print(f"{i}. {status} {result['query']}")
        print(f"   Expected: {result['expected_paper']}")
        print(f"   Found: {result['found']} {position}")
        print(f"   Best match distance: {result['best_distance']:.4f}")
        print(f"   Top 3 papers: {', '.join(result['retrieved_papers'][:3])}")
        print()

# Print results for both strategies
print_evaluation_results(fixed_token_eval)
print_evaluation_results(sentence_eval)

# Compare directly
print("\n" + "="*80)
print("DIRECT COMPARISON")
print("="*80)
print(f"{'Query':<60} {'Fixed':<10} {'Sentence':<10}")
print("-" * 80)

for i in range(len(test_queries)):
    query = test_queries[i]['text'][:55]
    fixed_pos = fixed_token_eval['results'][i]['position']
    sent_pos = sentence_eval['results'][i]['position']

    fixed_str = f"#{fixed_pos}" if fixed_pos else "Not found"
    sent_str = f"#{sent_pos}" if sent_pos else "Not found"

    print(f"{query:<60} {fixed_str:<10} {sent_str:<10}")
Fixed Token Windows
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. βœ“ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8865
   Top 3 papers: 2510.25798v1

2. βœ“ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9526
   Top 3 papers: 2511.04647v2

3. βœ“ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.9209
   Top 3 papers: 2511.09555v1

4. βœ“ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6678
   Top 3 papers: 2507.13932v1

5. βœ“ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9395
   Top 3 papers: 2510.26423v1

Sentence-Based
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. βœ“ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8831
   Top 3 papers: 2510.25798v1

2. βœ“ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9586
   Top 3 papers: 2511.04647v2, 2511.07930v1

3. βœ“ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.8863
   Top 3 papers: 2511.09555v1

4. βœ“ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6746
   Top 3 papers: 2507.13932v1

5. βœ“ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9320
   Top 3 papers: 2510.26423v1

================================================================================
DIRECT COMPARISON
================================================================================
Query                                                        Fixed      Sentence  
--------------------------------------------------------------------------------
knowledge editing in language models                         #1         #1        
masked diffusion models for inference optimization           #1         #1        
robotic manipulation with spatial representations            #1         #1        
blockchain ledger technology for database integrity          #1         #1        
automated test generation and oracle synthesis               #1         #1       

Understanding the Results

Let's break down what these results tell us:

Key Finding 1: Both Strategies Work Well

Both chunking strategies achieved 100% success rate. Every test query successfully retrieved its expected paper at position #1. This is the most important result.

When you have good queries that match actual document content, chunking strategy matters less than you might think. Both approaches work because they both preserve the semantic meaning of the content, just in slightly different ways.

Key Finding 2: Sentence-Based Has Better Distances

Look at the distance values. ChromaDB uses squared Euclidean distance by default, where lower values indicate higher similarity:

Query 1 (knowledge editing):

  • Fixed token: 0.8865
  • Sentence-based: 0.8831 (better)

Query 3 (robotic manipulation):

  • Fixed token: 0.9209
  • Sentence-based: 0.8863 (better)

Query 5 (automated test generation):

  • Fixed token: 0.9395
  • Sentence-based: 0.9320 (better)

In 3 out of 5 queries, sentence-based chunking produced lower distances, meaning higher similarity scores. This suggests that preserving sentence boundaries helps maintain semantic coherence, resulting in embeddings that better capture document meaning.

Key Finding 3: Low Agreement in Secondary Results

While both strategies found the right paper at #1, look at the papers in positions #2 and #3. They often differ between strategies:

Query 1: Both found the same top 3 papers
Query 2: Top paper matches, but #2 and #3 differ
Query 5: Only the top paper matches; #2 and #3 are completely different

This happens because chunk size affects which papers surface as similar. Neither is "wrong," they just have different perspectives on what else might be relevant. The important thing is they both got the most relevant paper right.

What This Means for Your Projects

So which chunking strategy should you choose? The answer is: it depends on your constraints and priorities.

Choose Fixed Token Windows when:

  • You need consistent chunk sizes for batch processing or downstream tasks
  • Storage isn't a concern and you want finer-grained retrieval
  • Your documents lack clear sentence structure (logs, code, transcripts)
  • You're working with multilingual content where sentence detection is unreliable

Choose Sentence-Based Chunking when:

  • You want to minimize storage costs (44% fewer chunks)
  • Semantic coherence is more important than size consistency
  • Your documents have clear sentence boundaries (articles, papers, documentation)
  • You want better similarity scores (as our results suggest)

The honest truth: Both strategies work well. If you implement either one properly, you'll get good retrieval results. The choice is less about "which is better" and more about which tradeoffs align with your project constraints.

Beyond These Two Strategies

We've implemented two practical chunking strategies, but there's a third approach worth knowing about: structure-aware chunking.

The Concept

Instead of chunking based on arbitrary token boundaries or sentence groupings, structure-aware chunking respects the logical organization of documents:

  • Academic papers have sections: Introduction, Methods, Results, Discussion
  • Technical documentation has headers, code blocks, and lists
  • Web pages have HTML structure: headings, paragraphs, articles
  • Markdown files have explicit hierarchy markers

Structure-aware chunking says: "Don't just group words or sentences. Recognize that this is an Introduction section, and this is a Methods section, and keep them separate."

Simple Implementation Example

Here's what structure-aware chunking might look like for markdown documents:

def chunk_by_markdown_sections(text, min_words=100):
    """
    Chunk text by markdown section headers.
    Each section becomes one chunk (or multiple if very long).
    """
    chunks = []
    current_section = []

    for line in text.split('\n'):
        # Detect section headers (lines starting with #)
        if line.startswith('#'):
            # Save previous section if it exists
            if current_section:
                section_text = '\n'.join(current_section)
                if len(section_text.split()) >= min_words:
                    chunks.append(section_text)
            # Start new section
            current_section = [line]
        else:
            current_section.append(line)

    # Don't forget the last section
    if current_section:
        section_text = '\n'.join(current_section)
        if len(section_text.split()) >= min_words:
            chunks.append(section_text)

    return chunks

This is pseudocode-level simplicity, but it illustrates the concept: identify structure markers, use them to define chunk boundaries.

When Structure-Aware Chunking Helps

Structure-aware chunking excels when:

  • Document structure matches query patterns: If users search for "Methods," they probably want the Methods section, not a random 512-token window that happens to include some methods
  • Context boundaries are important: Code with comments, FAQs with Q&A pairs, API documentation with endpoints
  • Sections have distinct topics: A paper discussing both neural networks and patient privacy should keep those sections separate

Why We Didn't Implement It Fully

The evaluation framework we built works for any chunking strategy. You have all the tools needed to implement and test structure-aware chunking yourself:

  1. Write a chunking function that respects document structure
  2. Generate embeddings for your chunks
  3. Store them in ChromaDB
  4. Use our evaluation framework to compare against the strategies we built

The process is identical. The only difference is how you define chunk boundaries.

Hyperparameter Tuning Guidance

We made specific choices for our chunking parameters:

  • Fixed token: 512 tokens with 100-token overlap (20%)
  • Sentence-based: 400-word target with 100-word minimum

Are these optimal? Maybe, maybe not. They're reasonable defaults that worked well for academic papers. But your documents might benefit from different values.

When to Experiment with Different Parameters

Try smaller chunks (256 tokens or 200 words) when:

  • Queries target specific facts rather than broad concepts
  • Precision matters more than context
  • Storage costs aren't a constraint

Try larger chunks (1024 tokens or 600 words) when:

  • Context matters more than precision
  • Your queries are conceptual rather than factual
  • You want to reduce the total number of embeddings

Adjust overlap when:

  • Concepts frequently span chunk boundaries (increase overlap to 30-40%)
  • Storage costs are critical (reduce overlap to 10%)
  • You notice important information getting split

How to Experiment Systematically

The evaluation framework we built makes experimentation straightforward:

  1. Modify chunking parameters
  2. Generate new chunks and embeddings
  3. Store in a new ChromaDB collection
  4. Run your test queries
  5. Compare results

Don't spend hours tuning parameters before you know if chunking helps at all. Start with reasonable defaults (like ours), measure performance, then tune if needed. Most projects never need aggressive parameter tuning.

Practical Exercise

Now it's your turn to experiment. Here are some variations to try:

Option 1: Modify Fixed Token Strategy

Change the chunk size to 256 or 1024 tokens. How does this affect:

  • Total number of chunks?
  • Success rate on test queries?
  • Average similarity distances?
# Try this
chunks_small = chunk_text_fixed_tokens(sample_text, chunk_size=256, overlap=50)
chunks_large = chunk_text_fixed_tokens(sample_text, chunk_size=1024, overlap=200)

Option 2: Modify Sentence-Based Strategy

Adjust the target word count to 200 or 600 words:

# Try this
chunks_small_sent = chunk_text_by_sentences(sample_text, target_words=200)
chunks_large_sent = chunk_text_by_sentences(sample_text, target_words=600)

Option 3: Implement Structure-Aware Chunking

If your papers have clear section markers, try implementing a structure-aware chunker. Use the evaluation framework to compare it against our two strategies.

Reflection Questions

As you experiment, consider:

  • When would you choose fixed token over sentence-based chunking?
  • How would you chunk code documentation? Chat logs? News articles?
  • What chunk size makes sense for a chatbot knowledge base? For legal documents?
  • How does overlap affect retrieval quality in your tests?

Summary and Next Steps

We've built and evaluated two complete chunking strategies for vector databases. Here's what we accomplished:

Core Skills Gained

Implementation:

  • Fixed token window chunking with overlap (914 chunks from 20 papers)
  • Sentence-based chunking respecting linguistic boundaries (513 chunks)
  • Batch processing with rate limit handling
  • ChromaDB collection management for experimentation

Evaluation:

  • Systematic evaluation framework with ground truth queries
  • Measuring success rate and ranking position
  • Comparing strategies quantitatively using real performance data
  • Understanding that query quality matters more than chunking strategy

Key Takeaways

  • No Universal "Best" Chunking Strategy: Both strategies achieved 100% success when given good queries. The choice depends on your constraints (storage, semantic coherence, document structure) rather than one approach being objectively better.
  • Query Quality Matters Most: Bad queries make any chunking strategy look bad. Before evaluating chunking, understand your documents and create queries that match actual content. This lesson applies to all retrieval systems, not just chunking.
  • Sentence-Based Provides Better Distances: In 3 out of 5 test queries, sentence-based chunking had lower distances (higher similarity). Preserving sentence boundaries helps maintain semantic coherence in embeddings.
  • Tradeoffs Are Real: Fixed token creates 1.8x more chunks than sentence-based (914 vs 513). This means more storage and more embeddings to generate (which gets expensive at scale). But you get finer retrieval granularity. Neither is wrong, they optimize for different things. Remember that with overlap, you're paying for every chunk: smaller chunks plus overlap means significantly higher API costs when embedding large document collections.
  • Edge Cases Are Normal: That 16-word chunk from fixed token chunking? The 601-word chunk from sentence-based? These are real-world behaviors, not bugs. Production systems handle imperfect inputs gracefully.

Looking Ahead

We now know how to chunk documents and store them in ChromaDB. But what if we want to enhance our searches? What if we need to filter results by publication year? Search only computer vision papers? Combine semantic similarity with traditional keyword matching?

An upcoming tutorial will teach:

  • Designing metadata schemas for effective filtering
  • Combining vector similarity with metadata constraints
  • Implementing hybrid search (BM25 + vector similarity)
  • Understanding performance tradeoffs of different filtering approaches
  • Making metadata work at scale

Before moving on, make sure you understand:

  • How fixed token and sentence-based chunking differ
  • When to choose each strategy based on project needs
  • How to evaluate chunking systematically with test queries
  • Why query quality matters more than chunking strategy
  • How to handle rate limits and ChromaDB collection management

When you're comfortable with these chunking fundamentals, you're ready to enhance your vector search with metadata and hybrid approaches.


Appendix: Dataset Preparation Code

This appendix provides the complete code we used to prepare the dataset for this tutorial. You don't need to run this code to complete the tutorial, but it's here if you want to:

  • Understand how we selected and downloaded papers from arXiv
  • Extract text from your own PDF files
  • Extend the dataset with different papers or categories

Downloading Papers from arXiv

We selected 20 papers (4 from each category) from the 5,000-paper dataset used in the previous tutorial. Here's how we downloaded the PDFs:

import urllib.request
import pandas as pd
from pathlib import Path
import time

def download_arxiv_pdf(arxiv_id, save_dir):
    """
    Download a paper PDF from arXiv.

    Args:
        arxiv_id: The arXiv ID (e.g., '2510.25798v1')
        save_dir: Directory to save the PDF

    Returns:
        Path to downloaded PDF or None if failed
    """
    # Create save directory if it doesn't exist
    save_dir = Path(save_dir)
    save_dir.mkdir(exist_ok=True)

    # Construct arXiv PDF URL
    # arXiv URLs follow pattern: https://arxiv.org/pdf/{id}.pdf
    pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
    save_path = save_dir / f"{arxiv_id}.pdf"

    try:
        print(f"Downloading {arxiv_id}...")
        urllib.request.urlretrieve(pdf_url, save_path)
        print(f"  βœ“ Saved to {save_path}")
        return save_path
    except Exception as e:
        print(f"  βœ— Failed: {e}")
        return None

# Example: Download papers from our metadata file
df = pd.read_csv('arxiv_20papers_metadata.csv')

pdf_dir = Path('arxiv_pdfs')
for arxiv_id in df['arxiv_id']:
    download_arxiv_pdf(arxiv_id, pdf_dir)
    time.sleep(1)  # Be respectful to arXiv servers

Important: The code above respects arXiv's servers by adding a 1-second delay between downloads. For larger downloads, consider using arXiv's bulk data access or API.

Extracting Text from PDFs

Once we had the PDFs, we extracted text using PyPDF2:

import PyPDF2
from pathlib import Path

def extract_paper_text(pdf_path):
    """
    Extract text from a PDF file.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Extracted text as a string
    """
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)

            # Extract text from all pages
            text = ""
            for page in reader.pages:
                text += page.extract_text()

            return text
    except Exception as e:
        print(f"Error extracting {pdf_path}: {e}")
        return None

def extract_all_papers(pdf_dir, output_dir):
    """
    Extract text from all PDFs in a directory.

    Args:
        pdf_dir: Directory containing PDF files
        output_dir: Directory to save extracted text files
    """
    pdf_dir = Path(pdf_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    pdf_files = list(pdf_dir.glob('*.pdf'))
    print(f"Found {len(pdf_files)} PDF files")

    success_count = 0
    for pdf_path in pdf_files:
        print(f"Extracting {pdf_path.name}...")

        text = extract_paper_text(pdf_path)

        if text:
            # Save as text file with same name
            output_path = output_dir / f"{pdf_path.stem}.txt"
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(text)

            word_count = len(text.split())
            print(f"  βœ“ Extracted {word_count:,} words")
            success_count += 1
        else:
            print(f"  βœ— Failed to extract")

    print(f"\nSuccessfully extracted {success_count}/{len(pdf_files)} papers")

# Example: Extract all papers
extract_all_papers('arxiv_pdfs', 'arxiv_fulltext_papers')

Paper Selection Process

We selected 20 papers from the 5,000-paper dataset used in the previous tutorial. The selection criteria were:

import pandas as pd
import numpy as np

# Load the original 5k dataset
df_5k = pd.read_csv('arxiv_papers_5k.csv')

# Select 4 papers from each category
categories = ['cs.CL', 'cs.CV', 'cs.DB', 'cs.LG', 'cs.SE']
selected_papers = []

np.random.seed(42)  # For reproducibility

for category in categories:
    # Get papers from this category
    category_papers = df_5k[df_5k['category'] == category]

    # Randomly sample 4 papers
    # In practice, we also checked that abstracts were substantial
    sampled = category_papers.sample(n=4, random_state=42)
    selected_papers.append(sampled)

# Combine all selected papers
df_selected = pd.concat(selected_papers, ignore_index=True)

# Save to new metadata file
df_selected.to_csv('arxiv_20papers_metadata.csv', index=False)
print(f"Selected {len(df_selected)} papers:")
print(df_selected['category'].value_counts().sort_index())

Text Quality Considerations

PDF extraction isn't perfect. Common issues include:

Formatting artifacts:

  • Extra spaces between words
  • Line breaks in unexpected places
  • Mathematical symbols rendered as Unicode
  • Headers/footers appearing in body text

Handling these issues:

def clean_extracted_text(text):
    """
    Basic cleaning for extracted PDF text.
    """
    # Remove excessive whitespace
    text = ' '.join(text.split())

    # Remove common artifacts (customize based on your PDFs)
    text = text.replace('Γ―Β¬', 'fi')  # Common ligature issue
    text = text.replace('Ò€ℒ', "'")   # Apostrophe encoding issue

    return text

# Apply cleaning when extracting
text = extract_paper_text(pdf_path)
if text:
    text = clean_extracted_text(text)
    # Now save cleaned text

We kept cleaning minimal for this tutorial to show realistic extraction results. In production, you might implement more aggressive cleaning depending on your PDF sources.

Why We Chose These 20 Papers

The 20 papers in this tutorial were selected to provide:

  1. Diversity across topics: 4 papers each from Machine Learning, Computer Vision, Computational Linguistics, Databases, and Software Engineering
  2. Variety in length: Papers range from 2,735 to 20,763 words
  3. Realistic content: Papers published in 2024-2025 with modern topics
  4. Successful extraction: All 20 papers extracted cleanly with readable text

This diversity ensures that chunking strategies are tested across different writing styles, document lengths, and technical domains rather than being optimized for a single type of content.


You now have all the code needed to prepare your own document chunking datasets. The same pattern works for any PDF collection: download, extract, clean, and chunk.

Key Reminders:

  • Both chunking strategies work well (100% success rate) with proper queries
  • Sentence-based requires 44% less storage (513 vs 914 chunks)
  • Sentence-based shows slightly better similarity distances
  • Fixed token provides more consistent sizes and finer granularity
  • Query quality matters more than chunking strategy
  • Rate limiting is normal production behavior, handle it gracefully
  • ChromaDB collection deletion is standard during experimentation
  • Edge cases (tiny chunks, variable sizes) are expected and usually fine
  • Evaluation frameworks transfer to any chunking strategy
  • Choose based on your constraints (storage, coherence, structure) not on "best practice"
❌