Normal view

4 November 2025 at 21:39

In the previous tutorial, you learned that embeddings convert text into numerical vectors that capture semantic meaning. You saw how papers about machine learning, data engineering, and data visualization naturally clustered into distinct groups when we visualized their embeddings. That was the foundation.

But we only worked with 12 handwritten paper abstracts that we typed directly into our code. That approach works great for understanding core concepts, but it doesn't prepare you for real projects. Real applications require processing hundreds or thousands of documents, and you need to make strategic decisions about how to generate those embeddings efficiently.

This tutorial teaches you how to collect documents programmatically and generate embeddings using different approaches. You'll use the arXiv API to gather 500 research papers, then generate embeddings using both local models and cloud services. By comparing these approaches hands-on, you'll understand the tradeoffs and be able to make informed decisions for your own projects.

These techniques form the foundation for production systems, but we're focusing on core concepts with a learning-sized dataset. A real system handling millions of documents would require batching strategies, streaming pipelines, and specialized vector databases. We'll touch on those considerations, but our goal here is to build your intuition about the embedding generation process itself.

Setting Up Your Environment

Before we start collecting data, let's install the libraries we'll need. We'll use the arxiv library to access research papers programmatically, pandas for data manipulation, and the same embedding libraries from the previous tutorial.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.2
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2
# arxiv==2.2.0
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1

pip install sentence-transformers scikit-learn matplotlib numpy arxiv pandas cohere python-dotenv

This tutorial works in any Python environment: Jupyter notebooks, Python scripts, VS Code, or your preferred IDE. Run the pip command above in your terminal before starting, then use the Python code blocks throughout this tutorial.

Collecting Research Papers with the arXiv API

arXiv is a repository of over 2 million scholarly papers in physics, mathematics, computer science, and more. Researchers publish cutting-edge work here before it appears in journals, making it a valuable resource for staying current with AI and machine learning research. Best of all, arXiv provides a free API for programmatic access. While they do monitor usage and have some rate limits to prevent abuse, these limits are generous for learning and research purposes. Check their Terms of Use for current guidelines.

We'll use the arXiv API to collect 500 papers from five different computer science categories. This diversity will give us clear semantic clusters when we visualize or search our embeddings. The categories we'll use are:

cs.LG (Machine Learning): Core ML algorithms, training methods, and theoretical foundations
cs.CV (Computer Vision): Image processing, object detection, and visual recognition
cs.CL (Computational Linguistics/NLP): Natural language processing and understanding
cs.DB (Databases): Data storage, query optimization, and database systems
cs.SE (Software Engineering): Development practices, testing, and software architecture

These categories use distinct vocabularies and will create well-separated clusters in our embedding space. Let's write a function to collect papers from specific arXiv categories:

import arxiv

# Create the arXiv client once and reuse it
# This is recommended by the arxiv package to respect rate limits
client = arxiv.Client()

def collect_arxiv_papers(category, max_results=100):
    """
    Collect papers from arXiv by category.

    Parameters:
    -----------
    category : str
        arXiv category code (e.g., 'cs.LG', 'cs.CV')
    max_results : int
        Maximum number of papers to retrieve

    Returns:
    --------
    list of dict
        List of paper dictionaries containing title, abstract, authors, etc.
    """
    # Construct search query for the category
    search = arxiv.Search(
        query=f"cat:{category}",
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    papers = []
    for result in client.results(search):
        paper = {
            'title': result.title,
            'abstract': result.summary,
            'authors': [author.name for author in result.authors],
            'published': result.published,
            'category': category,
            'arxiv_id': result.entry_id.split('/')[-1]
        }
        papers.append(paper)

    return papers

# Define the categories we want to collect from
categories = [
    ('cs.LG', 'Machine Learning'),
    ('cs.CV', 'Computer Vision'),
    ('cs.CL', 'Computational Linguistics'),
    ('cs.DB', 'Databases'),
    ('cs.SE', 'Software Engineering')
]

# Collect 100 papers from each category
all_papers = []
for category_code, category_name in categories:
    print(f"Collecting papers from {category_name} ({category_code})...")
    papers = collect_arxiv_papers(category_code, max_results=100)
    all_papers.extend(papers)
    print(f"  Collected {len(papers)} papers")

print(f"\nTotal papers collected: {len(all_papers)}")

# Let's examine the first paper from each category
separator = "=" * 80
print(f"\n{separator}", "SAMPLE PAPERS (one from each category)", f"{separator}", sep="\n")
for i, (_, category_name) in enumerate(categories):
    paper = all_papers[i * 100]
    print(f"\n{category_name}:")
    print(f"  Title: {paper['title']}")
    print(f"  Abstract (first 150 chars): {paper['abstract'][:150]}...")

Collecting papers from Machine Learning (cs.LG)...
  Collected 100 papers
Collecting papers from Computer Vision (cs.CV)...
  Collected 100 papers
Collecting papers from Computational Linguistics (cs.CL)...
  Collected 100 papers
Collecting papers from Databases (cs.DB)...
  Collected 100 papers
Collecting papers from Software Engineering (cs.SE)...
  Collected 100 papers

Total papers collected: 500

================================================================================
SAMPLE PAPERS (one from each category)
================================================================================

Machine Learning:
  Title: Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
  Abstract (first 150 chars): Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale struc...

Computer Vision:
  Title: Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
  Abstract (first 150 chars): Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have propose...

Computational Linguistics:
  Title: VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
  Abstract (first 150 chars): LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct an...

Databases:
  Title: Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
  Abstract (first 150 chars): Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it ...

Software Engineering:
  Title: evomap: A Toolbox for Dynamic Mapping in Python
  Abstract (first 150 chars): This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among ...

The code above demonstrates how easy it is to collect papers programmatically. In just a few lines, we've gathered 500 recent research papers from five distinct computer science domains.

Take a look at your output when you run this code. You might notice something interesting: sometimes the same paper title appears under multiple categories. This happens because researchers often cross-list their papers in multiple relevant categories on arXiv. A paper about deep learning for natural language processing could legitimately appear in both Machine Learning (cs.LG) and Computational Linguistics (cs.CL). A paper about neural networks for image generation might be listed in both Machine Learning (cs.LG) and Computer Vision (cs.CV).

While our five categories are conceptually separate, there's naturally some overlap, especially between closely related fields. This real-world messiness is exactly what makes working with actual data more interesting than handcrafted examples. Your specific results will look different from ours because arXiv returns the most recently submitted papers, which change as new research is published.

Preparing Your Dataset

Before generating embeddings, we need to clean and structure our data. Real-world datasets always have imperfections. Some papers might have missing abstracts, others might have abstracts that are too short to be meaningful, and we need to organize everything into a format that's easy to work with.

Let's use pandas to create a DataFrame and handle these data quality issues:

import pandas as pd

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(all_papers)

print("Dataset before cleaning:")
print(f"Total papers: {len(df)}")
print(f"Papers with abstracts: {df['abstract'].notna().sum()}")

# Check for missing abstracts
missing_abstracts = df['abstract'].isna().sum()
if missing_abstracts > 0:
    print(f"\nWarning: {missing_abstracts} papers have missing abstracts")
    df = df.dropna(subset=['abstract'])

# Filter out papers with very short abstracts (less than 100 characters)
# These are often just placeholders or incomplete entries
df['abstract_length'] = df['abstract'].str.len()
df = df[df['abstract_length'] >= 100].copy()

print(f"\nDataset after cleaning:")
print(f"Total papers: {len(df)}")
print(f"Average abstract length: {df['abstract_length'].mean():.0f} characters")

# Show the distribution across categories
print("\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Display the first few entries
separator = "=" * 80
print(f"\n{separator}", "FIRST 3 PAPERS IN CLEANED DATASET", f"{separator}", sep="\n")
for idx, row in df.head(3).iterrows():
    print(f"\n{idx+1}. {row['title']}")
    print(f"   Category: {row['category']}")
    print(f"   Abstract length: {row['abstract_length']} characters")

Dataset before cleaning:
Total papers: 500
Papers with abstracts: 500

Dataset after cleaning:
Total papers: 500
Average abstract length: 1337 characters

Papers per category:
category
cs.CL    100
cs.CV    100
cs.DB    100
cs.LG    100
cs.SE    100
Name: count, dtype: int64

================================================================================
FIRST 3 PAPERS IN CLEANED DATASET
================================================================================

1. Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
   Category: cs.LG
   Abstract length: 1783 characters

2. Multi-Method Analysis of Mathematics Placement Assessments: Classical, Machine Learning, and Clustering Approaches
   Category: cs.LG
   Abstract length: 1519 characters

3. Forgetting is Everywhere
   Category: cs.LG
   Abstract length: 1150 characters

Data preparation matters because poor quality input leads to poor quality embeddings. By filtering out papers with missing or very short abstracts, we ensure that our embeddings will capture meaningful semantic content. In production systems, you'd likely implement more sophisticated quality checks, but this basic approach handles the most common issues.

Strategy One: Local Open-Source Models

Now we're ready to generate embeddings. Let's start with local models using sentence-transformers, the same approach we used in the previous tutorial. The key advantage of local models is that everything runs on your own machine. There are no API costs, no data leaves your computer, and you have complete control over the embedding process.

We'll use all-MiniLM-L6-v2 again for consistency, and we'll also demonstrate a larger model called all-mpnet-base-v2 to show how different models produce different results:

from sentence_transformers import SentenceTransformer
import numpy as np
import time

# Load the same model from the previous tutorial
print("Loading all-MiniLM-L6-v2 model...")
model_small = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all abstracts
abstracts = df['abstract'].tolist()

print(f"Generating embeddings for {len(abstracts)} papers...")
start_time = time.time()

# The encode() method handles batching automatically
embeddings_small = model_small.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32  # Process 32 abstracts at a time
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_small.shape}")
print(f"Each abstract is now a {embeddings_small.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add embeddings to our DataFrame
df['embedding_minilm'] = list(embeddings_small)

Loading all-MiniLM-L6-v2 model...
Generating embeddings for 500 papers...
Batches: 100%|██████████| 16/16 [01:05<00:00,  4.09s/it]

Completed in 65.45 seconds
Embedding shape: (500, 384)
Each abstract is now a 384-dimensional vector
Average time per abstract: 0.131 seconds

That was fast! On a typical laptop, we generated embeddings for 500 abstracts in about 65 seconds. Now let's try a larger, more powerful model to see the difference.

Spoiler alert: this will take several more minutes than the last one, so you may want to freshen up your coffee while it's running:

# Load a larger (more dimensions) model
print("\nLoading all-mpnet-base-v2 model...")
model_large = SentenceTransformer('all-mpnet-base-v2')

print("Generating embeddings with larger model...")
start_time = time.time()

embeddings_large = model_large.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_large.shape}")
print(f"Each abstract is now a {embeddings_large.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add these embeddings to our DataFrame too
df['embedding_mpnet'] = list(embeddings_large)

Loading all-mpnet-base-v2 model...
Generating embeddings with larger model...
Batches: 100%|██████████| 16/16 [11:20<00:00, 30.16s/it]

Completed in 680.47 seconds
Embedding shape: (500, 768)
Each abstract is now a 768-dimensional vector
Average time per abstract: 1.361 seconds

Notice the differences between these two models:

Dimensionality: The smaller model produces 384-dimensional embeddings, while the larger model produces 768-dimensional embeddings. More dimensions can capture more nuanced semantic information.
Speed: The smaller model is about 10 times faster. For 500 papers, that's a difference of about 10 minutes. For thousands of documents, this difference becomes significant.
Quality: Larger models generally produce higher-quality embeddings that better capture subtle semantic relationships. However, the smaller model is often good enough for many applications.

The key insight here is that local models give you flexibility. You can choose models that balance quality, speed, and computational resources based on your specific needs. For rapid prototyping, use smaller models. For production systems where quality matters most, use larger models.

Visualizing Real-World Embeddings

In our previous tutorial, we saw beautifully separated clusters using handcrafted paper abstracts. Let's see what happens when we visualize embeddings from real arXiv papers. We'll use the same PCA approach to reduce our 384-dimensional embeddings down to 2D:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_small)

print(f"Original embedding dimensions: {embeddings_small.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")

Original embedding dimensions: 384
Reduced embedding dimensions: 2

Now let's create a visualization showing how our 500 papers cluster by category:

# Create the visualization
plt.figure(figsize=(12, 8))

# Define colors for different categories
colors = ['#C8102E', '#003DA5', '#00843D', '#FF8200', '#6A1B9A']
category_names = ['Machine Learning', 'Computer Vision', 'Comp. Linguistics', 'Databases', 'Software Eng.']
category_codes = ['cs.LG', 'cs.CV', 'cs.CL', 'cs.DB', 'cs.SE']

# Plot each category
for i, (cat_code, cat_name, color) in enumerate(zip(category_codes, category_names, colors)):
    # Get papers from this category
    mask = df['category'] == cat_code
    cat_embeddings = embeddings_2d[mask]

    plt.scatter(cat_embeddings[:, 0], cat_embeddings[:, 1],
                c=color, label=cat_name, s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

plt.xlabel('First Principal Component', fontsize=12)
plt.ylabel('Second Principal Component', fontsize=12)
plt.title('500 arXiv Papers Across Five Computer Science Categories\n(Real-world embeddings show overlapping clusters)',
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='best', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Visualization of embeddings for 500 arXiv Papers Across Five Computer Science Categories

The visualization above reveals an important aspect of real-world data. Unlike our handcrafted examples in the previous tutorial, where clusters were perfectly separated, these real arXiv papers show more overlap. You can see clear groupings, as well as papers that bridge multiple topics. For example, a paper about "deep learning for database query optimization" uses vocabulary from both machine learning and databases, so it might appear between those clusters.

This is exactly what you'll encounter in production systems. Real data is messy, topics overlap, and semantic boundaries are often fuzzy rather than sharp. The embeddings are still capturing meaningful relationships, but the visualization shows the complexity of actual research papers rather than the idealized examples we used for learning.

Strategy Two: API-Based Embedding Services

Local models work great, but they require computational resources and you're responsible for managing them. API-based embedding services offer an alternative approach. You send your text to a cloud provider, they generate embeddings using their infrastructure, and they send the embeddings back to you.

We'll use Cohere's API for our main example because they offer a generous free trial tier that doesn't require payment information. This makes it perfect for learning and experimentation.

Setting Up Cohere Securely

First, you'll need to create a free Cohere account and get an API key:

Visit Cohere's registration page
Sign up for a free account (no credit card required)
Navigate to the API Keys section in your dashboard
Copy your Trial API key

Important security practice: Never hardcode API keys directly in your notebooks or scripts. Store them in a .env file instead. This prevents accidentally sharing sensitive credentials when you share your code.

Create a file named .env in your project directory with the following entry:

COHERE_API_KEY=your_key_here

Important: Add .env to your .gitignore file to prevent committing it to version control.

Now load your API key securely:

from dotenv import load_dotenv
import os
from cohere import ClientV2
import time

# Load environment variables from .env file
load_dotenv()

# Access your API key
cohere_api_key = os.getenv('COHERE_API_KEY')

if not cohere_api_key:
    raise ValueError(
        "COHERE_API_KEY not found. Please create a .env file with your API key.\n"
        "See https://dashboard.cohere.com for instructions on getting your key."
    )

# Initialize the Cohere client using the V2 API
co = ClientV2(api_key=cohere_api_key)
print("API key loaded successfully from environment")

API key loaded successfully from environment

Now let's generate embeddings using the Cohere API. Here's something we discovered through trial and error: when we first ran this code without delays, we hit Cohere's rate limit and got a 429 TooManyRequestsError with this message: "trial token rate limit exceeded, limit is 100000 tokens per minute."

This exposes an important lesson about working with APIs. Rate limits aren't always clearly documented upfront. Sometimes you discover them by running into them, then you have to dig through the error responses in the documentation to understand what happened. In this case, we found the details in Cohere's error responses documentation. You can also check their rate limits page for current limits, though specifics for free tier accounts aren't always listed there.

With 500 papers averaging around 1,337 characters each, we can easily exceed 100,000 tokens per minute if we send batches too quickly. So we've built in two safeguards: a 12-second delay between batches to stay under the limit, and retry logic in case we do hit it. This makes the process take about 60-70 seconds instead of the 6-8 seconds the API actually needs, but it's reliable and won't throw errors mid-process.

Think of it as the tradeoff for using a free tier: we get access to powerful models without paying, but we work within some constraints. Let's see it in action:

print("Generating embeddings using Cohere API...")
print(f"Processing {len(abstracts)} abstracts...")

start_time = time.time()
actual_api_time = 0  # Track time spent on actual API calls

# Cohere recommends processing in batches for efficiency
# Their API accepts up to 96 texts per request
batch_size = 90
all_embeddings = []

for i in range(0, len(abstracts), batch_size):
    batch = abstracts[i:i+batch_size]
    batch_num = i//batch_size + 1
    total_batches = (len(abstracts) + batch_size - 1) // batch_size
    print(f"Processing batch {batch_num}/{total_batches} ({len(batch)} abstracts)...")

    # Add retry logic for rate limits
    max_retries = 3
    retry_delay = 60  # Wait 60 seconds if we hit rate limit

    for attempt in range(max_retries):
        try:
            # Track actual API call time
            api_start = time.time()

            # Generate embeddings for this batch using V2 API
            response = co.embed(
                texts=batch,
                model='embed-v4.0',
                input_type='search_document',
                embedding_types=['float']
            )

            actual_api_time += time.time() - api_start
            # V2 API returns embeddings in a different structure
            all_embeddings.extend(response.embeddings.float_)
            break  # Success, move to next batch

        except Exception as e:
            if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                print(f"  Rate limit hit. Waiting {retry_delay} seconds before retry...")
                time.sleep(retry_delay)
            else:
                raise  # Re-raise if it's not a rate limit error or we're out of retries

    # Add a delay between batches to avoid hitting rate limits
    # Wait 12 seconds between batches (spreads 500 papers over ~1 minute)
    if i + batch_size < len(abstracts):  # Don't wait after the last batch
        time.sleep(12)

# Convert to numpy array for consistency with local models
embeddings_cohere = np.array(all_embeddings)
elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds (includes rate limit delays)")
print(f"Actual API processing time: {actual_api_time:.2f} seconds")
print(f"Time spent waiting for rate limits: {elapsed_time - actual_api_time:.2f} seconds")
print(f"Embedding shape: {embeddings_cohere.shape}")
print(f"Each abstract is now a {embeddings_cohere.shape[1]}-dimensional vector")
print(f"Average time per abstract (API only): {actual_api_time/len(abstracts):.3f} seconds")

# Add to DataFrame
df['embedding_cohere'] = list(embeddings_cohere)

Generating embeddings using Cohere API...
Processing 500 abstracts...
Processing batch 1/6 (90 abstracts)...
Processing batch 2/6 (90 abstracts)...
Processing batch 3/6 (90 abstracts)...
Processing batch 4/6 (90 abstracts)...
Processing batch 5/6 (90 abstracts)...
Processing batch 6/6 (50 abstracts)...

Completed in 87.23 seconds (includes rate limit delays)
Actual API processing time: 27.18 seconds
Time spent waiting for rate limits: 60.05 seconds
Embedding shape: (500, 1536)
Each abstract is now a 1536-dimensional vector
Average time per abstract (API only): 0.054 seconds

Notice the timing breakdown? The actual API processing was quite fast (around 27 seconds), but we spent most of our time waiting between batches to respect rate limits (around 60 seconds). This is the reality of free-tier accounts: they're fantastic for learning and prototyping, but come with constraints. Paid tiers would give us much higher limits and let us process at full speed.

Something else worth noting: Cohere's embeddings are 1536-dimensional, which is 4x larger than our small local model (384 dimensions) and 2x larger than our large local model (768 dimensions). Yet the API processing was still faster than our small local model. This demonstrates the power of specialized infrastructure. Cohere runs optimized hardware designed specifically for embedding generation at scale, while our local models run on general-purpose computers. Higher dimensions don't automatically mean slower processing when you have the right infrastructure behind them.

For this tutorial, Cohere’s free tier works perfectly. We're focusing on understanding the concepts and comparing approaches, not optimizing for production speed. The key differences from local models:

No local computation: All processing happens on Cohere's servers, so it works equally well on any hardware.
Internet dependency: Requires an active internet connection to work.
Rate limits: Free tier accounts have token-per-minute limits, which is why we added delays between batches.

Other API Options

While we're using Cohere for this tutorial, you should know about other popular embedding APIs:

OpenAI offers excellent embedding models, but requires payment information upfront. If you have an OpenAI account, their text-embedding-3-small model is very affordable at \$0.02 per 1M tokens. You can find setup instructions in their embeddings documentation.

Together AI provides access to many open-source models through their API. They offer models like BAAI/bge-large-en-v1.5 and detailed documentation in their embeddings guide. Note that their rate limit tiers are subject to change, so be sure to check their rate limit documentation to determine the tier you'll need based on your needs.

The choice between these services depends on your priorities. OpenAI has excellent quality but requires payment setup. Together AI offers many model choices and different paid tiers. Cohere has a truly free tier for learning and prototyping.

Comparing Your Options

Now that we've generated embeddings using both local models and an API service, let's think about how to choose between these approaches for real projects. The decision isn't about one being universally better than the other. It's about matching the approach to your specific constraints and requirements.

To clarify terminology: "self-hosted models" means running models on infrastructure you control, whether that's your laptop for learning or your company's cloud servers for production. "API services" means using third-party providers like Cohere or OpenAI where you send data to their servers for processing.

Dimension	Self-hosted Models	API Services
Cost	Zero ongoing costs after initial setup. Best for high-volume applications where you'll generate embeddings frequently.	Pay-per-use model per 1M tokens. Cohere: \$0.12 per 1M tokens. OpenAI: \$0.13 per 1M tokens. Best for low to moderate volume, or when you want predictable costs without infrastructure.
Performance	Speed depends on your hardware. Our results: 0.131 seconds per abstract (small model), 1.361 seconds per abstract (large model). Best for batch processing or when you control the infrastructure.	Speed depends on internet connection and API server load. Our results: 0.054 seconds per abstract (Cohere). Includes network latency and third-party infrastructure considerations. Best when you don't have powerful local hardware or need access to the latest models.
Privacy	All data stays on your infrastructure. Complete control over data handling. No data sent to third parties. Best for sensitive data, healthcare, financial services, or when compliance requires data locality.	Data is sent to third-party servers for processing. Subject to the provider's data handling policies. Cohere states that API data isn't used for training (verify current policy). Best for non-sensitive data, or when provider policies meet your requirements.
Customization	Can fine-tune models on your specific domain. Full control over model selection and updates. Can modify inference parameters. Best for specialized domains, custom requirements, or when you need reproducibility.	Limited to provider's available models. Model updates happen on provider's schedule. Less control over inference details. Best for general-purpose applications, or when using the latest models matters more than control.
Infrastructure	Requires managing infrastructure. Whether running on your laptop or company cloud servers, you handle model updates, dependencies, and scaling. Best for organizations with existing ML infrastructure or when infrastructure control is important.	No infrastructure management needed. Automatic scaling to handle load. Provider manages updates and availability. Best for smaller teams, rapid prototyping, or when you want to focus on application logic rather than infrastructure.

When to Use Each Approach

Here's a practical decision guide to help you choose the right approach for your project:

Choose Self-Hosted Models when you:

Process large volumes of text regularly

Work with sensitive or regulated data

Need offline capability

Have existing ML infrastructure (whether local or cloud-based)

Want to fine-tune models for your domain

Need complete control over the deployment

Choose API Services when you:

Are just getting started or prototyping

Have unpredictable or variable workload

Want to avoid infrastructure management

Need automatic scaling

Prefer the latest models without maintenance

Value simplicity over infrastructure control

For our tutorial series, we've used both approaches to give you hands-on experience with each. In our next tutorial, we'll use the Cohere embeddings for our semantic search implementation. We're choosing Cohere because they offer a generous free tier for learning (no payment required), their models are well-suited for semantic search tasks, and they work consistently across different hardware setups.

In practice, you'd evaluate embedding quality by testing on your specific use case: generate embeddings with different models, run similarity searches on sample queries, and measure which model returns the most relevant results for your domain.

Storing Your Embeddings

We've generated embeddings using multiple methods, and now we need to save them for future use. Storing embeddings properly is important because generating them can be time-consuming and potentially costly. You don't want to regenerate embeddings every time you run your code.

Let's explore two storage approaches:

Option 1: CSV with Numpy Arrays

This approach works well for learning and small-scale prototyping:

# Save the metadata to CSV (without embeddings, which are large arrays)
df_metadata = df[['title', 'abstract', 'authors', 'published', 'category', 'arxiv_id', 'abstract_length']]
df_metadata.to_csv('arxiv_papers_metadata.csv', index=False)
print("Saved metadata to 'arxiv_papers_metadata.csv'")

# Save embeddings as numpy arrays
np.save('embeddings_minilm.npy', embeddings_small)
np.save('embeddings_mpnet.npy', embeddings_large)
np.save('embeddings_cohere.npy', embeddings_cohere)
print("Saved embeddings to .npy files")

# Later, you can load them back like this:
# df_loaded = pd.read_csv('arxiv_papers_metadata.csv')
# embeddings_loaded = np.load('embeddings_cohere.npy')

Saved metadata to 'arxiv_papers_metadata.csv'
Saved embeddings to .npy files

This approach is simple and transparent, making it perfect for learning and experimentation. However, it has significant limitations for larger datasets:

Loading all embeddings into memory doesn't scale beyond a few thousand documents
No indexing for fast similarity search
Manual coordination between CSV metadata and numpy arrays

For production systems with thousands or millions of embeddings, you'll want specialized vector databases (Option 2) that handle indexing, similarity search, and efficient storage automatically.

Option 2: Preparing for Vector Databases

In production systems, you'll likely store embeddings in a specialized vector database like Pinecone, Weaviate, or Chroma. These databases are optimized for similarity search. While we'll cover vector databases in detail in another tutorial series, here's how you'd structure your data for them:

# Prepare data in a format suitable for vector databases
# Most vector databases want: ID, embedding vector, and metadata

vector_db_data = []
for idx, row in df.iterrows():
    vector_db_data.append({
        'id': row['arxiv_id'],
        'embedding': row['embedding_cohere'].tolist(),  # Convert numpy array to list
        'metadata': {
            'title': row['title'],
            'abstract': row['abstract'][:500],  # Many DBs limit metadata size
            'authors': ', '.join(row['authors'][:3]),  # First 3 authors
            'category': row['category'],
            'published': str(row['published'])
        }
    })

# Save in JSON format for easy loading into vector databases
import json
with open('arxiv_papers_vector_db_format.json', 'w') as f:
    json.dump(vector_db_data, f, indent=2)
print("Saved data in vector database format to 'arxiv_papers_vector_db_format.json'")

print(f"\nTotal storage sizes:")
print(f"  Metadata CSV: ~{os.path.getsize('arxiv_papers_metadata.csv')/1024:.1f} KB")
print(f"  JSON for vector DB: ~{os.path.getsize('arxiv_papers_vector_db_format.json')/1024:.1f} KB")

Saved data in vector database format to 'arxiv_papers_vector_db_format.json'

Total storage sizes:
  Metadata CSV: ~764.6 KB
  JSON for vector DB: ~15051.0 KB

Each storage method has its purpose:

CSV + numpy: Best for learning and small-scale experimentation
JSON for vector databases: Best for production systems that need efficient similarity search

Preparing for Semantic Search

You now have 500 research papers from five distinct computer science domains with embeddings that capture their semantic meaning. These embeddings are vectors, which means we can measure how similar or different they are using mathematical distance calculations.

In the next tutorial, you'll use these embeddings to build a search system that finds relevant papers based on meaning rather than keywords. You'll implement similarity calculations, rank results, and see firsthand how semantic search outperforms traditional keyword matching.

Save your embeddings now, especially the Cohere embeddings since we'll use those in the next tutorial to build our search system. We chose Cohere because they work consistently across different hardware setups and provide a consistent baseline for implementing similarity calculations.

Next Steps

Before we move on, try these experiments to deepen your understanding:

Experiment with different arXiv categories:

Try collecting papers from categories like stat.ML (Statistics Machine Learning) or math.OC (Optimization and Control)
Use the PCA visualization code to see how these new categories cluster with your existing five
Do some categories overlap more than others?

Compare embedding models visually:

Generate embeddings for your dataset using all-mpnet-base-v2
Create side-by-side PCA visualizations comparing the small model and large model
Do the clusters look tighter or more separated with the larger model?

Test different dataset sizes:

Collect just 50 papers per category (250 total) and visualize the results
Then try 200 papers per category (1000 total)
How does dataset size affect the clarity of the clusters?
At what point does collection or processing time become noticeable?

Explore model differences:

Visit Hugging Face's sentence similarity models
Try a domain-specific model optimized for scientific text
Compare the embeddings qualitatively by looking at which papers cluster together

Ready to implement similarity search and build a working semantic search engine? The next tutorial will show you how to turn these embeddings into a powerful research discovery tool.

Key Takeaways:

Programmatic data collection through APIs like arXiv enables working with real-world datasets
Collecting papers from diverse categories (cs.LG, cs.CV, cs.CL, cs.DB, cs.SE) creates semantic clusters for effective search
Papers can be cross-listed in multiple arXiv categories, creating natural overlap between related fields
Self-hosted embedding models provide zero-cost, private embedding generation with full control over the process
API-based embedding services offer high-quality embeddings without infrastructure management
Secure credential handling using .env files protects sensitive API keys and tokens
Rate limits aren't always clearly documented and are sometimes discovered through trial and error
The choice between self-hosted and API approaches depends on cost, privacy, scale, and infrastructure considerations
Free tier APIs provide powerful embedding generation for learning, but require handling rate limits and delays that paid tiers avoid
Real-world embeddings show more overlap than handcrafted examples, reflecting the complexity of actual data
Proper storage of embeddings prevents costly regeneration and enables efficient reuse across projects

Received — 29 October 2025 ⏭ Data Science

Dataquest
Setting Up Your Data Engineering Lab with Docker 28 October 2025 at 21:57

Setting Up Your Data Engineering Lab with Docker

Dataquest

28 October 2025 at 21:57

This guide helps you set up a clean, isolated environment for running Dataquest tutorials. While many tutorials work fine directly on your computer, some (particularly those involving data processing tools like PySpark) can run into issues depending on your operating system or existing software setup. The lab environment we'll create ensures everything runs consistently, with the right versions of Python and other tools, without affecting your main system.

What's a Lab Environment?

You can think of this "lab" as a separate workspace just for your Dataquest tutorials. It's a controlled space where you can experiment and test code without affecting your main computer. Just like scientists use labs for experiments, we'll use this development lab to work through tutorials safely.

Benefits for everyone:

Windows/Mac users: Avoid errors from system differences. No more "command not found" or PySpark failing to find files
Linux users: Get the exact versions of Python and Java needed for tutorials, without conflicting with your system's packages
Everyone: Keep your tutorial work separate from personal projects. Your code and files are saved normally, but any packages you install or system changes you make stay contained in the lab

We'll use a tool called Docker to create this isolated workspace. Think of it as having a dedicated computer just for tutorials inside your regular computer. Your files and code are saved just like normal (you can edit them with your favorite editor), but the tutorial environment itself stays clean and separate from everything else on your system.

The lab command you'll use creates this environment, and it mirrors real data engineering workflows (most companies use isolated environments like this to ensure consistency across their teams).

Installing Docker

Docker creates isolated Linux environments on any operating system. This means you'll get a consistent Linux workspace whether you're on Windows, Mac, or even Linux itself. We're using it as a simple tool, so no container orchestration or cloud deployment knowledge is needed.

On Windows:
Download Docker Desktop from docker.com/products/docker-desktop. Run the installer, restart your computer when prompted, and open Docker Desktop. You'll see a whale icon in your system tray when it's running.

Note: Docker Desktop will automatically enable required Windows features like WSL 2. If you see an error about virtualization, you may need to enable it in your computer's BIOS settings. Search online for your computer model + "enable virtualization" for specific steps.

On Mac:
Download Docker Desktop for your chip type (Intel or Apple Silicon) from the same link. Drag Docker to your Applications folder and launch it. You'll see the whale in your menu bar.

On Linux:
You probably already have Docker, but if not, run this command in your terminal:

curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh

Verify it works:
Open your terminal (PowerShell, Terminal, or bash) and run:

docker --version
docker compose version

You should see version numbers for both commands. If you see "command not found," restart your terminal or computer and try again.

Getting the Lab Environment

The lab is already configured in the Dataquest tutorials repository. Clone or download it:

git clone https://github.com/dataquestio/tutorials.git
cd tutorials

If you don't have git, download the repository as a ZIP file from GitHub and extract it.

The repository includes everything you need:

Dockerfile - Configures the Linux environment with Python 3.11 and Java (for Spark)
docker-compose.yml - Defines the lab setup
Tutorial folders with all the code and data

Starting Your Lab

In your IDE’s terminal, ensure you’re in the tutorials folder and start the lab:

docker compose run --rm lab

Note that the first time you run this command, the setup may take 2-5 minutes.

You're now in Linux! Your prompt will change to something like root@abc123:/tutorials#, which is your Linux command line where everything will work as expected.

The --rm flag means the lab cleans itself up when you exit, keeping your system tidy.

Using Your Lab

Once you’re in the lab environment, here's your typical workflow:

1. Navigate to the tutorial you're working on

# See all available tutorials
ls

# Enter a specific tutorial
cd pyspark-etl

2. Install packages as needed
Each tutorial might need different packages:

# For PySpark tutorials
pip install pyspark

# For data manipulation tutorials
pip install pandas numpy

# For database connections
pip install sqlalchemy psycopg2-binary

3. Run the tutorial code

python <script-name>.py

Because the code will run within a standardized Linux environment, you shouldn’t run into setup errors.

4. Edit files normally
The beauty of this setup: you can still use your favorite editor! The tutorials folder on your computer is synchronized with the lab. Edit files in VS Code, PyCharm, or any editor, and the lab sees changes immediately.

5. Exit when done
Type exit or press Ctrl+D to leave the lab. The environment cleans itself up automatically.

Common Workflow Examples

Running a PySpark tutorial:

docker compose run --rm lab
cd pyspark-etl
pip install pyspark pandas
python main.py

Working with Jupyter notebooks:

docker compose run --rm -p 8888:8888 lab
pip install jupyterlab
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
# Open the URL it shows in your browser

Keeping packages installed between sessions:
If you're tired of reinstalling packages, create a requirements file:

# After installing packages, save them
pip freeze > requirements.txt

# Next session, restore them
pip install -r requirements.txt

Quick Reference

The one command you need:

# From the tutorials folder
docker compose run --rm lab

Exit the lab:

exit # Or press Ctrl+D

Where things are:

Tutorial code: Each folder in /tutorials
Your edits: Automatically synchronized
Data files: In each tutorial's data/ folder
Output files: Save to the tutorial folder to see them on your computer

Adding services (databases, etc.):
For tutorials needing PostgreSQL, MongoDB, or other services, we can extend the docker-compose.yml. For now, the base setup handles all Python and PySpark tutorials.

Troubleshooting

"Cannot connect to Docker daemon"
- Docker Desktop needs to be running. Start it from your applications.
"docker compose" not recognized
- Older Docker versions use docker-compose (with a hyphen). Try:
```
docker-compose run --rm lab
```
Slow performance on Windows
- Docker on Windows can be slow with large datasets. For better performance, store data files in the container rather than the mounted folder.
"Permission denied" on Linux
- Add your user to the docker group:
```
sudo usermod -aG docker $USER
```
  Then log out and back in.

You're Ready

You now have a Linux lab environment that matches production systems. Happy experimenting!

Received — 28 October 2025 ⏭ Data Science

Dataquest
How to Become a Data Scientist (Yes, Even in 2025) 28 October 2025 at 16:35

How to Become a Data Scientist (Yes, Even in 2025)

Dataquest

By:Vik Paruchuri

28 October 2025 at 16:35

The world is becoming increasingly data-driven. Data is one of the most valuable resources a company can have, but without a data scientist, it’s just numbers.

Businesses everywhere are looking for professionals who can turn raw data into clear insights. According to the U.S. Bureau of Labor Statistics, jobs for data scientists are expected to grow by 34% between 2024 and 2034, much faster than most careers.

Becoming a data scientist takes more than coding or statistics. It’s a mix of math, computer science, business knowledge, and communication skills. This combination makes the role both challenging and in demand.

I know it’s possible to get there. I started with a history degree and later became a machine learning engineer, data science consultant, and founder of Dataquest. With the right plan, you can do it too.

What is a Data Scientist?

A data scientist is someone who uses data to answer questions and solve problems. They collect large amounts of information, clean it, analyze it, and turn it into something actionable.

They use tools like Python, R, and SQL to manage and explore data. They apply statistics, machine learning, and data visualization to find patterns, understand trends, and make predictions.

Some data scientists build tools and systems for users, while others focus on helping businesses make better decisions by predicting future outcomes.

What Do Data Scientists Do?

Data scientists wear many hats. Their work depends on the company and the type of data they handle, but the goal is always the same: to turn data into useful insights that help people make data-driven decisions.

Data science powers everything from the algorithm showing you the next TikTok video to how ChatGPT answers questions to how Netflix recommends shows.

Some data scientist responsibilities include:

Collect and clean data from databases, APIs, and spreadsheets to prepare it for analysis.
Analyze and explore data to find trends, patterns, and relationships that explain what’s happening.
Build machine learning models and make predictions to forecast sales, detect fraud, or recommend products.
Visualize and communicate insights through charts and dashboards using tools like Tableau, Matplotlib, or Power BI.
Automate and improve systems by creating smarter processes, optimizing marketing campaigns, or building better recommendation engines.

In short, they help businesses make smarter decisions and work faster.

The Wrong and Right Way

When I started learning data science, I followed every online guide I could find, but I ended up bored and without real skills to show for it. It felt like a teacher handing me a pile of books and telling me to read them all.

Eventually, I realized I learn best when I’m solving problems that interest me. So instead of memorizing a checklist of skills, I began building real projects with real data. That approach kept me motivated and mirrored the work I’d actually do as a data scientist.

With that experience, I created Dataquest to help others learn the same way: by doing. But courses alone aren’t enough. To succeed, you need to learn how to think, plan, and execute effectively. This guide will show you how.

How to Become a Data Scientist:

Step 1: Earn a Degree (Recommended, Not Required)
Step 2: Learn the Core Skills
Step 3: Question Everything and Find Your Niche
Step 4: Build Projects
Step 5: Share Your Work
Step 6: Learn From Others
Step 7: Push Your Boundaries
Step 8: Start Looking for a Job

Now, let’s go over each of these one by one.

Step 1: Earn a Degree (Recommended, Not Required)

Most data scientists start with a degree in a technical field. According to Zippia, 51% of data scientists hold a bachelor’s degree, 34% a master’s, and 13% a doctorate.

A degree helps you build a solid foundation in math, statistics, and programming. It also shows employers that you can handle complex concepts and long-term projects.

Relevant degrees include computer science, statistics, mathematics, data science, or engineering.

If university isn’t an option, you can still learn online. Platforms like Dataquest, Coursera, edX, and Google Career Certificates have trusted online courses and programs that teach the same essential skills through practical, hands-on projects.

Step 2: Learn the Core Skills

Even if you can’t study at a university or enroll in a course, the internet and books offer everything you need to get started. So, let’s look at what you should learn.

If you come from a computer science background, many concepts like algorithms, logic, and data structures will feel familiar. If not, Python is a great starting point because it teaches those fundamentals in a practical way.

1. Programming languages

Start with Python. It’s beginner-friendly and powerful for data analysis, machine learning, and automation.

Learn how to:

Write basic code (variables, loops, functions)
Use data science libraries like pandas, NumPy, and Matplotlib
Work with raw data files (e.g., CSVs and JSON) and collect data via APIs

Once you’re comfortable with Python, consider learning R for statistics and SQL for managing and querying databases.

Helpful guides:

2. Math and Statistics

A strong understanding of math and statistics is essential in data science. It helps you make sense of data and build accurate models.

Focus on:

Probability and statistics (mean, variance, correlation, hypothesis testing)
Linear algebra (matrices, vectors — key for machine learning)
Calculus basics (how models learn from data)

3. Data Handling and Visualization

Being able to clean, organize, and visualize data is a key part of any data scientist’s toolkit. These skills help you turn raw data into clear insights that others can easily understand.

You’ll use tools like Excel, Tableau, or Power BI to build dashboards and reports, and Python libraries like pandas and Matplotlib for deeper analysis and visualization.

Here are some learning paths to guide you:

Data Analysis and Visualization with Python — Learn to clean, analyze, and visualize data using Python.
Data Visualization with Tableau — Build interactive dashboards and tell stories with data.
Data Visualization with R — Create detailed visualizations using R’s ggplot2 library.

4. Core Concepts

Once you’ve built a solid technical foundation, it’s time to understand how these skills fit into the bigger picture.

How machine learning models work
How to ask business questions and measure results
How to translate data insights into real business impact

Step 3: Question Everything and Find Your Niche

The data science and data analytics field is appealing because you get to answer interesting questions using actual data and code. These questions can range from “Can I predict whether a flight will be on time?” to “How much does the U.S. spend per student on education?"

To answer these questions, you need to develop an analytical mindset.

The best way to develop this mindset is to start by analyzing news articles. First, find a news article that discusses data. Here are two great examples: Can Running Make You Smarter? or Is Sugar Really Bad for You?

Then, think about the following:

How they reach their conclusions given the data they discuss
How you might design a study to investigate further
What questions you might want to ask if you had access to the underlying data

Some articles, like this one on gun deaths in the U.S. and this one on online communities supporting Donald Trump, actually have the underlying data available for download. This allows you to explore even deeper.

You could do the following:

Download the data, and open it in Excel or an equivalent tool
See what patterns you can find in the data by eyeballing it
Does the data support the conclusions of the article? Why or why not?
What additional questions can you use the data to answer?

Here are some good places to find data-driven articles:

Think About What You’re Interested In

After a few weeks of reading articles, reflect on whether you enjoyed coming up with questions and answering them. Becoming a data scientist is a long road, and you need to be very passionate about the field to make it all the way. What is the industry that attracts you the most?

Perhaps you don't enjoy the process of coming up with questions in the abstract, but maybe you enjoy analyzing healthcare or finance data. Find what you're passionate about, and then start viewing it through an analytical lens.

Personally, I was very interested in stock market data, which motivated me to build a model to predict the market.

If you want to put in the months of hard work necessary to learn data science, working on something you’re passionate about will help you stay motivated when you face setbacks.

Step 4: Build Projects

As you’re learning the basics of coding, start applying your knowledge to get practical experience. Coursework isn't enough. Projects help you practice real-world techniques and develop the practical skills employers look for in the job market. It's a great way to test your knowledge.

Your projects don’t have to be complex. For example, you could analyze Super Bowl winners to find patterns, study weather data to predict rainfall, or explore movie ratings to see what drives popularity. The goal is to take an interesting dataset, ask good questions, and use code to answer them.

As you build projects, keep these points in mind:

Most real-world data science work involves data cleaning and preparation.
Simple machine learning algorithms like linear regression or decision trees are powerful starting points.
Focus on improving how you handle messy data, visualize insights, and communicate results. These are the techniques that make you stand out.
Everyone starts somewhere. Even small projects can show your creativity, logic, and problem-solving skills.

Building projects early helps you get practical experience that will make your portfolio much stronger when entering the job market.

As you're learning the basics of data science, you should start building projects that answer interesting questions that will showcase your data science skills.

If you need help finding free datasets for your projects, we've got you covered!

Where to Find Project Ideas

Not only does building projects help you practice your skills and understand real data science work, it also helps you build a portfolio to show potential employers.

Here are some more detailed guides on building projects on your own:

Additionally, most of Dataquest’s courses contain interactive projects that you can complete while you’re learning.

Here are just a few examples:

Profitable App Profiles for the App Store and Google Play Markets — Explore the app market to see what makes an app successful on both iOS and Android. You’ll analyze real data and find out why some book-based apps perform better than others.
Exploring Hacker News Posts — Analyze a dataset of posts from Hacker News, a popular tech community, and find out which kinds of discussions get the most attention.
Exploring eBay Car Sales Data — Use Python to work with a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
Star Wars Survey — Analyze survey data from Star Wars fans and find fun patterns, like which movie is the most loved (or the most hated).
Analyzing NYC High School Data — Explore how different factors like income and race relate to SAT scores using scatter plots and maps.
Classifying Heart Disease — Go through the complete machine learning workflow of data exploration, data splitting, model creation, and model evaluation to develop a logistic regression classifier for detecting heart disease.

Our students have fun while practicing with these projects. Online courses don’t have to be boring.

Take It Up a Notch

After a few small projects, it’s time to level up! Start adding more complexity to your work so you can learn advanced topics. The key is to choose projects in an area that interests you.

For example, since I was interested in the stock market, I focused on predictive modeling. As your skills grow, you can make your projects more detailed, like using minute-by-minute data or improving prediction accuracy.

Check out our article on Python project ideas for more inspiration.

Step 5: Share Your Work

Once you've built a few data science projects, share them with others on GitHub! This might just be the way to find internships!

Here’s why:

It makes you think about how to best present your projects, which is what you'd do in a data science role.
They allow your peers to view your projects and provide feedback.
They allow employers to view your projects.

Helpful resources about project portfolios:

Start a Simple Blog

Besides uploading projects to GitHub, start a blog. Writing about what you learn helps you understand topics better and spot what you’ve missed. Teaching others is one of the fastest ways to master a concept.

When I was learning data science, writing blog posts helped me do the following:

Capture interest from recruiters
Learn concepts more thoroughly (the process of teaching really helps you learn)
Connect with peers

You can write about:

Explaining data science concepts in simple terms
Walking through your projects and findings
Sharing your learning journey

Here’s an example of a visualization I made on my blog many years ago that tries to answer the question: do the Simpsons characters like each other?

Step 6: Learn From Others

After you've started to build an online presence, it's a good idea to start engaging with other data scientists. You can do this in-person or in online communities.

Here are some good online communities:

Here at Dataquest, we have an online community where learners can receive feedback on projects, discuss tough data-related problems, and build relationships with data professionals.

Personally, I was very active on Quora and Kaggle when I was learning, which helped me immensely.

Engaging in online communities is a good way to do the following:

Find other people to learn with
Enhance your profile and find opportunities
Strengthen your knowledge by learning from others

You can also engage with people in person through Meetups. In-person engagement can help you meet and learn from more experienced data scientists in your area. Take all the opportunities to learn.

Step 7: Push Your Boundaries

What kind of data scientists do organizations want to hire? The ones that find critical insights that save them money or make their customers happier. You have to apply the same process to learning, keep searching for new questions to answer, and keep answering harder and more complex questions.

If you look back on your projects from a month or two ago, and you don’t see room for improvement, you probably aren't pushing your boundaries enough. You should be making strong progress every month, and your work should reflect that. Interesting projects will make you stand out among applicants.

Here are some ways to push your boundaries and learn data science faster:

Try working with a larger dataset
Start a data science project that requires knowledge you don't have
Try making your project run faster
Teach what you did in a project to someone else

Step 8: Start Looking for a Job

Once you’ve built a few projects and learned the core skills, it’s time to start applying, not “someday,” but now. Don’t wait until you feel completely ready. The truth is, no one ever does.

Start with internships, entry-level roles, or freelance gigs. These give you real-world experience and help you understand how data science works in a business setting. Even if the job description looks intimidating, apply anyway. Many employers list “ideal” requirements, not must-haves.

Don’t get stuck studying forever. The best learning happens on the job. Every interview, every project, and every rejection teaches you something new.

You never know, the opportunity that looks like a long shot might be the one that launches your data science career. The more practical experience you gain, the deeper your knowledge becomes.

Becoming a Data Scientist FAQs

I know what you might be thinking: Is it still worth pursuing a career in data science? Will AI replace data scientists, or will the role evolve with it? What skills do I actually need to keep up?

I get these questions a lot, and since I was once in your shoes, let me share what I’ve learned and help you find the right path.

Is data science still a good career choice?

Yes, a data science career is still a fantastic choice. Demand for data scientists is high, and the world is generating a massive (and increasing) amount of data every day.

We don't claim to have a crystal ball or know what the future holds, but data science is a fast-growing field with high demand and lucrative salaries.

Will AI replace data scientists?

AI won’t replace data scientists, but it will definitely change what they do. As AI tools become more advanced, data scientists will use them to make decisions faster and with greater accuracy. Instead of doing only technical work, they’ll focus more on strategy and big-picture analysis.

Data scientists will also work closely with AI engineers and machine learning specialists to develop and improve AI models. This includes tasks like choosing the right algorithms, engineering features, and making sure systems are fair and reliable.

To stay relevant, data scientists will need to expand their skills into areas such as machine learning, deep learning, and natural language processing. They’ll also play an important role in ethical AI, helping prevent bias, protect data privacy, and promote responsible use of technology.

Continuous learning will be essential as the field evolves, but AI isn’t replacing data scientists. It’s helping them become even more powerful problem solvers.

What are the AI skills a data scientist needs?

Every data scientist should have a knowledge of the basics, but as artificial intelligence becomes part of nearly every industry, learning AI-related skills is essential.

Start with a strong understanding of machine learning and the ability to use deep learning frameworks like TensorFlow and PyTorch. Learn natural language processing (NLP) for analyzing text data, and make sure you understand AI ethics, especially how to recognize and reduce bias in models.

It also helps to be comfortable with AI development tools and libraries, build some data engineering skills, and learn to work effectively in cross-disciplinary teams.

Continuous learning is key. AI evolves quickly, and the best data scientists keep experimenting, exploring new methods, and adapting their skills to stay ahead.

You’ve Got This!

Studying to become a data scientist or data engineer isn't easy, but the key is to stay motivated and enjoy what you're doing. If you're consistently building projects and sharing them, you'll build your expertise and get the data scientist job that you want.

After years of being frustrated with how conventional sites taught data science, I created Dataquest, a better way to learn data science online. Dataquest solves the problems of MOOCs, where you never know what course to take next, and you're never motivated by what you're learning.

Dataquest is just the lessons I've learned from helping thousands of people learn data science, and it focuses on making the learning experience engaging. Here, you'll build dozens of projects, and you’ll learn all the skills you need to be a successful data scientist. Dataquest students have been hired at companies like Accenture and SpaceX .

I wish you all the best on your path to becoming a data scientist!

Dataquest
Understanding, Generating, and Visualizing Embeddings 27 October 2025 at 23:01

Understanding, Generating, and Visualizing Embeddings

Dataquest

27 October 2025 at 23:01

Imagine you're searching through a massive library of data science papers looking for content about "cleaning messy datasets." A traditional keyword search returns papers that literally contain those exact words. But it completely misses an excellent paper about "handling missing values and duplicates" and another about "data validation techniques." Even though these papers teach exactly what you're looking for, you'll never see them because they're using different words.

This is the fundamental problem with keyword-based searches: they match words, not meaning. When you search for "neural network training," it won't connect you to papers about "optimizing deep learning models" or "improving model convergence," despite these being essentially the same topic.

Embeddings solve this by teaching machines to understand meaning instead of just matching text. And if you're serious about building AI systems, generating embeddings is a fundamental concept you need to master.

What Are Embeddings?

Embeddings are numerical representations that capture semantic meaning. Instead of treating text as a collection of words to match, embeddings convert text into vectors (a list of numbers) where similar meanings produce similar vectors. Think of it like translating human language into a mathematical language that computers can understand and compare.

When we convert two pieces of text that mean similar things into embeddings, those embedding vectors will be mathematically close to each other in the embedding space. Think of the embedding space as a multi-dimensional map where meaning determines location. Papers about machine learning will cluster together. Papers about data cleaning will form their own group. And papers about data visualization? They'll gather in a completely different region. In a moment, we'll create a visualization that clearly demonstrates this.

Setting Up Your Environment

Before we start working directly with embeddings, let's install the libraries we'll need. We'll use sentence-transformers from Hugging Face to generate embeddings, sklearn for dimensionality reduction, matplotlib for visualization, and numpy to handle the numerical arrays we'll be working with.

# Developed with: Python 3.12.12
# sentence-transformers==5.1.1
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2

pip install sentence-transformers scikit-learn matplotlib numpy

Run the command above in your terminal to install all required libraries. This will work whether you're using a Python script, Jupyter notebook, or any other development environment.

For this tutorial series, we'll work with research paper abstracts from arXiv.org, a repository where researchers publish cutting-edge AI and machine learning papers. If you're building AI systems, arXiv is a great resource to have. It's where you'll find the latest research on new architectures, techniques, and approaches that can help you implement the latest techniques in your projects.

arXiv is pronounced as "archive" because the X represents the Greek letter chi ⟨χ⟩

For this tutorial, we've manually created 12 abstracts for papers spanning machine learning, data engineering, and data visualization. These abstracts are stored directly in our code as Python strings, keeping things simple for now. We'll work with APIs and larger datasets in the next tutorial to automate this process.

# Abstracts from three data science domains
papers = [
    # Machine Learning Papers
    {
        'title': 'Building Your First Neural Network with PyTorch',
        'abstract': '''Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the fundamentals of defining layers, activation functions, and forward propagation. You'll build a multi-layer perceptron for classification tasks, understand backpropagation, and implement gradient descent optimization. By the end, you'll have a working model that achieves over 90% accuracy on the MNIST dataset.'''
    },
    {
        'title': 'Preventing Overfitting: Regularization Techniques Explained',
        'abstract': '''Overfitting is one of the most common challenges in machine learning. This guide explores practical regularization methods including L1 and L2 regularization, dropout layers, and early stopping. You'll learn how to detect overfitting by monitoring training and validation loss, implement regularization in both scikit-learn and TensorFlow, and tune regularization hyperparameters to improve model generalization on unseen data.'''
    },
    {
        'title': 'Hyperparameter Tuning with Grid Search and Random Search',
        'abstract': '''Selecting optimal hyperparameters can dramatically improve model performance. This paper demonstrates systematic approaches to hyperparameter optimization using grid search and random search. You'll learn how to define hyperparameter spaces, implement cross-validation during tuning, and use scikit-learn's GridSearchCV and RandomizedSearchCV. We'll compare both methods and discuss when to use each approach for efficient model optimization.'''
    },
    {
        'title': 'Transfer Learning: Using Pre-trained Models for Image Classification',
        'abstract': '''Transfer learning lets you leverage pre-trained models to solve new problems with limited data. This paper shows how to use pre-trained convolutional neural networks like ResNet and VGG for custom image classification tasks. You'll learn how to freeze layers, fine-tune network weights, and adapt pre-trained models to your specific domain. We'll build a classifier that achieves high accuracy with just a few hundred training images.'''
    },

    # Data Engineering/ETL Papers
    {
        'title': 'Handling Missing Data: Strategies and Best Practices',
        'abstract': '''Missing data can derail your analysis if not handled properly. This comprehensive guide covers detection methods for missing values, statistical techniques for understanding missingness patterns, and practical imputation strategies. You'll learn when to use mean imputation, forward fill, and more sophisticated approaches like KNN imputation. We'll work through real datasets with missing values and implement robust solutions using pandas.'''
    },
    {
        'title': 'Data Validation Techniques for ETL Pipelines',
        'abstract': '''Building reliable data pipelines requires thorough validation at every stage. This paper teaches you how to implement data quality checks, define validation rules, and catch errors before they propagate downstream. You'll learn schema validation, outlier detection, and referential integrity checks. We'll build a validation framework using Great Expectations and integrate it into an automated ETL workflow for production data systems.'''
    },
    {
        'title': 'Cleaning Messy CSV Files: A Practical Guide',
        'abstract': '''Real-world CSV files are rarely clean and analysis-ready. This hands-on paper walks through common data quality issues: inconsistent formatting, duplicate records, invalid entries, and encoding problems. You'll master pandas techniques for standardizing column names, removing duplicates, handling date parsing errors, and dealing with mixed data types. We'll transform a messy CSV with multiple quality issues into a clean dataset ready for analysis.'''
    },
    {
        'title': 'Building Scalable ETL Workflows with Apache Airflow',
        'abstract': '''Apache Airflow helps you build, schedule, and monitor complex data pipelines. This paper introduces Airflow's core concepts including DAGs, operators, and task dependencies. You'll learn how to define pipeline workflows, implement retry logic and error handling, and schedule jobs for automated execution. We'll build a complete ETL pipeline that extracts data from APIs, transforms it, and loads it into a data warehouse on a daily schedule.'''
    },

    # Data Visualization Papers
    {
        'title': 'Creating Interactive Dashboards with Plotly Dash',
        'abstract': '''Interactive dashboards make data exploration intuitive and engaging. This paper teaches you how to build web-based dashboards using Plotly Dash. You'll learn to create interactive charts with dropdowns, sliders, and date pickers, implement callbacks for dynamic updates, and design responsive layouts. We'll build a complete dashboard for exploring sales data with filters, multiple chart types, and real-time updates.'''
    },
    {
        'title': 'Matplotlib Best Practices: Making Publication-Quality Plots',
        'abstract': '''Creating clear, professional visualizations requires attention to design principles. This guide covers matplotlib best practices for publication-quality plots. You'll learn about color palette selection, font sizing and typography, axis formatting, and legend placement. We'll explore techniques for reducing chart clutter, choosing appropriate chart types, and creating consistent styling across multiple plots for research papers and presentations.'''
    },
    {
        'title': 'Data Storytelling: Designing Effective Visualizations',
        'abstract': '''Good visualizations tell a story and guide viewers to insights. This paper focuses on the principles of visual storytelling and effective chart design. You'll learn how to choose the right visualization for your data, apply pre-attentive attributes to highlight key information, and structure narratives through sequential visualizations. We'll analyze both effective and ineffective visualizations, discussing what makes certain design choices successful.'''
    },
    {
        'title': 'Building Real-Time Visualization Streams with Bokeh',
        'abstract': '''Visualizing streaming data requires specialized techniques and tools. This paper demonstrates how to create real-time updating visualizations using Bokeh. You'll learn to implement streaming data sources, update plots dynamically as new data arrives, and optimize performance for continuous updates. We'll build a live monitoring dashboard that displays streaming sensor data with smoothly updating line charts and real-time statistics.'''
    }
]

print(f"Loaded {len(papers)} paper abstracts")
print(f"Topics covered: Machine Learning, Data Engineering, and Data Visualization")

Loaded 12 paper abstracts
Topics covered: Machine Learning, Data Engineering, and Data Visualization

Generating Your First Embeddings

Now let's transform these paper abstracts into embeddings. We'll use a pre-trained model from the sentence-transformers library called all-MiniLM-L6-v2. We're using this model because it's fast and efficient for learning purposes, perfect for understanding the core concepts. In our next tutorial, we'll explore more recent production-grade models used in real-world applications.

The model will convert each abstract into an n-dimensional vector, where the value of n depends on the specific model architecture. Different embedding models produce vectors of different sizes. Some models create compact 128-dimensional embeddings, while others produce larger 768 or even 1024-dimensional vectors. Generally, larger embeddings can capture more nuanced semantic information, but they also require more computational resources and storage space.

Think of each dimension in the vector as capturing some aspect of the text's meaning. Maybe one dimension responds strongly to "machine learning" concepts, another to "data cleaning" terminology, and another to "visualization" language. The model learned these representations automatically during training.

Let's see what dimensionality our specific model produces.

from sentence_transformers import SentenceTransformer

# Load the pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract just the abstracts for embedding
abstracts = [paper['abstract'] for paper in papers]

# Generate embeddings for all abstracts
embeddings = model.encode(abstracts)

# Let's examine what we've created
print(f"Shape of embeddings: {embeddings.shape}")
print(f"Each abstract is represented by a vector of {embeddings.shape[1]} numbers")
print(f"\nFirst few values of the first embedding:")
print(embeddings[0][:10])

Shape of embeddings: (12, 384)
Each abstract is represented by a vector of 384 numbers

First few values of the first embedding:
[-0.06071806 -0.13064863  0.00328695 -0.04209436 -0.03220841  0.02034248
  0.0042156  -0.01300791 -0.1026612  -0.04565621]

Perfect! We now have 12 embeddings, one for each paper abstract. Each embedding is a 384-dimensional vector, represented as a NumPy array of floating-point numbers.

These numbers might look random at first, but they encode meaningful information about the semantic content of each abstract. When we want to find similar documents, we measure the cosine similarity between their embedding vectors. Cosine similarity looks at the angle between vectors. Vectors pointing in similar directions (representing similar meanings) have high cosine similarity, even if their magnitudes differ. In a later tutorial, we'll compute vector similarity using cosine, Euclidean, and dot-product methods to compare different approaches.

Before we move on, let's verify we can retrieve the original text:

# Let's look at one paper and its embedding
print("Paper title:", papers[0]['title'])
print("\nAbstract:", papers[0]['abstract'][:100] + "...")
print("\nEmbedding shape:", embeddings[0].shape)
print("Embedding type:", type(embeddings[0]))

Paper title: Building Your First Neural Network with PyTorch

Abstract: Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the ...

Embedding shape: (384,)
Embedding type: <class 'numpy.ndarray'>

Great! We can still access the original paper text alongside its embedding. Throughout this tutorial, we'll work with these embeddings while keeping track of which paper each one represents.

Making Sense of High-Dimensional Spaces

We now have 12 vectors, each with 384 dimensions. But here's the issue: humans can't visualize 384-dimensional space. We struggle to imagine even four dimensions! To understand what our embeddings have learned, we need to reduce them to two dimensions so that we can plot them on a graph.

This is where dimensionality reduction is a good skill to have. We'll use Principal Component Analysis (PCA), a technique we can use to find the two most important dimensions (the ones that capture the most variation in our data). It's like taking a 3D object and finding the best angle to photograph it in 2D while preserving as much information as possible.

While we're definitely going to lose some detail during this compression, our original 384-dimensional embeddings capture rich, nuanced information about semantic meaning. When we squeeze them down to 2D, some subtleties are bound to get lost. But the major patterns (which papers belong to which topic) will still be clearly visible.

from sklearn.decomposition import PCA

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

print(f"Original embedding dimensions: {embeddings.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
print(f"\nVariance explained by these 2 dimensions: {pca.explained_variance_ratio_.sum():.2%}")

Original embedding dimensions: 384
Reduced embedding dimensions: 2

Variance explained by these 2 dimensions: 41.20%

The variance explained tells us how much of the variation in the original data is preserved in these 2 dimensions. Think of it this way: if all our papers were identical, they'd have zero variance. The more different they are, the more variance. We've kept about 41% of that variation, which is plenty to see the major patterns. The clustering itself depends on whether papers use similar vocabulary, not on how much variance we've retained. So even though 41% might seem relatively low, the major patterns separating different topics will still be very clear in our embedding visualization.

Understanding Our Tutorial Topics

Before we create our embeddings visualization, let's see how the 12 papers are organized by topic. This will help us understand the patterns we're about to see in the embeddings:

# Print papers grouped by topic
print("=" * 80)
print("PAPER REFERENCE GUIDE")
print("=" * 80)

topics = [
    ("Machine Learning", list(range(0, 4))),
    ("Data Engineering/ETL", list(range(4, 8))),
    ("Data Visualization", list(range(8, 12)))
]

for topic_name, indices in topics:
    print(f"\n{topic_name}:")
    print("-" * 80)
    for idx in indices:
        print(f"  Paper {idx+1}: {papers[idx]['title']}")

================================================================================
PAPER REFERENCE GUIDE
================================================================================

Machine Learning:
--------------------------------------------------------------------------------
  Paper 1: Building Your First Neural Network with PyTorch
  Paper 2: Preventing Overfitting: Regularization Techniques Explained
  Paper 3: Hyperparameter Tuning with Grid Search and Random Search
  Paper 4: Transfer Learning: Using Pre-trained Models for Image Classification

Data Engineering/ETL:
--------------------------------------------------------------------------------
  Paper 5: Handling Missing Data: Strategies and Best Practices
  Paper 6: Data Validation Techniques for ETL Pipelines
  Paper 7: Cleaning Messy CSV Files: A Practical Guide
  Paper 8: Building Scalable ETL Workflows with Apache Airflow

Data Visualization:
--------------------------------------------------------------------------------
  Paper 9: Creating Interactive Dashboards with Plotly Dash
  Paper 10: Matplotlib Best Practices: Making Publication-Quality Plots
  Paper 11: Data Storytelling: Designing Effective Visualizations
  Paper 12: Building Real-Time Visualization Streams with Bokeh

Now that we know which tutorials belong to which topic, let's visualize their embeddings.

Visualizing Embeddings to Reveal Relationships

We're going to create a scatter plot where each point represents one paper abstract. We'll color-code them by topic so we can see how the embeddings naturally group similar content together.

import matplotlib.pyplot as plt
import numpy as np

# Create the visualization
plt.figure(figsize=(8, 6))

# Define colors for different topics
colors = ['#0066CC', '#CC0099', '#FF6600']
categories = ['Machine Learning', 'Data Engineering/ETL', 'Data Visualization']

# Create color mapping for each paper
color_map = []
for i in range(12):
    if i < 4:
        color_map.append(colors[0])  # Machine Learning
    elif i < 8:
        color_map.append(colors[1])  # Data Engineering
    else:
        color_map.append(colors[2])  # Data Visualization

# Plot each paper
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=color_map[i], s=275, alpha=0.7, edgecolors='black', linewidth=1)
    # Add paper numbers as labels
    plt.annotate(str(i+1), (x, y), fontsize=10, fontweight='bold',
                ha='center', va='center')

plt.xlabel('First Principal Component', fontsize=14)
plt.ylabel('Second Principal Component', fontsize=14)
plt.title('Paper Embeddings from Three Data Science Topics\n(Papers close together have similar semantic meaning)',
          fontsize=15, fontweight='bold', pad=20)

# Add a legend showing which colors represent which topics
legend_elements = [plt.Line2D([0], [0], marker='o', color='w',
                              markerfacecolor=colors[i], markersize=12,
                              label=categories[i]) for i in range(len(categories))]
plt.legend(handles=legend_elements, loc='best', fontsize=12)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What the Visualization Reveals About Semantic Similarity

Take a look at the visualization below that was generated using the code above. As you can see, the results are pretty striking! The embeddings have naturally organized themselves into three distinct regions based purely on semantic content.

Keep in mind that we deliberately chose papers from very distinct topics to make the clustering crystal clear. This is perfect for learning, but real-world datasets are messier. When you're working with papers that bridge multiple topics or have overlapping vocabulary, you'll see more gradual transitions between clusters rather than these sharp separations. We'll encounter that reality in the next tutorial when we work with hundreds of real arXiv papers.

Paper Embeddings from Data Science Topics

The Machine Learning cluster (blue, papers 1-4) dominates the lower-left side of the plot. These four points sit close together because they all discuss neural networks, training, and model optimization. Look at papers 1 and 4. They're positioned very near each other despite one focusing on building networks from scratch and the other on transfer learning. The embedding model recognizes that they both use the core language of deep learning: layers, weights, training, and model architectures.
The Data Engineering/ETL cluster (magenta, papers 5-8) occupies the upper portion of the plot. These papers share vocabulary around data quality, pipelines, and validation. Notice how papers 5, 6, and 7 form a tight grouping. They all discuss data quality issues using terms like "missing values," "validation," and "cleaning." Paper 8 (about Airflow) sits slightly apart from the others, which makes sense: while it's definitely about data engineering, it focuses more on workflow orchestration than data quality, giving it a slightly different semantic fingerprint.
The Data Visualization cluster (orange, papers 9-12) is gathered on the lower-right side. These four papers are packed close together because they all use visualization-specific vocabulary: "charts," "dashboards," "colors," and "interactive elements." The tight clustering here shows just how distinct visualization terminology is from both ML and data engineering language.

What's remarkable is the clear separation between all three clusters. The distance between the ML papers on the left and the visualization papers on the right tells us that these topics use fundamentally different vocabularies. There's minimal semantic overlap between "neural networks" and "dashboards," so they end up far apart in the embedding space.

How the Model Learned to Understand Meaning

The all-MiniLM-L6-v2 embedding model was trained on millions of text pairs, learning which words tend to appear together. When it sees a tutorial full of words like "layers," "training," and "optimization," it produces an embedding vector that's mathematically similar to other texts with that same vocabulary pattern. The clustering emerges naturally from those learned associations.

Why This Matters for Your Work as an AI Engineer

Embeddings are foundational to the modern AI systems you'll build as an AI Engineer. Let's look at how embeddings enable the core technologies you'll work with:

Building Intelligent Search Systems

Traditional keyword search has a fundamental limitation: it can only find exact matches. If a user searches for "handling null values," they won't find documents about "missing data strategies" or "imputation techniques," even though these are exactly what they need. Embeddings solve this by understanding semantic similarity. When you embed both the search query and your documents, you can find relevant content based on meaning rather than word matching. The result is a search system that actually understands what you're looking for.
Working with Vector Databases

Vector databases are specialized databases that are built to store and query embeddings efficiently. Instead of SQL queries that match exact values, vector databases let you ask "find me all documents similar to this one" and get results ranked by semantic similarity. They're optimized for the mathematical operations that embeddings require, like calculating distances between high-dimensional vectors, which makes them essential infrastructure for AI applications. Modern systems often use hybrid search approaches that combine semantic similarity with traditional keyword matching to get the best of both worlds.
Implementing Retrieval-Augmented Generation (RAG)

RAG systems are one of the most powerful patterns in modern AI engineering. Here's how they work: you embed a large collection of documents (like company documentation, research papers, or knowledge bases). When a user asks a question, you embed their question and use that embedding to find the most relevant documents from your collection. Then you pass those documents to a language model, which generates an informed answer grounded in your specific data. Embeddings make the retrieval step possible because they're how the system knows which documents are relevant to the question.
Creating AI Agents with Long-Term Memory

AI agents that can remember past interactions and learn from experience need a way to store and retrieve relevant memories. Embeddings enable this. When an agent has a conversation or completes a task, you can embed the key information and store it in a vector database. Later, when the agent encounters a similar situation, it can retrieve relevant past experiences by finding embeddings close to the current context. This gives agents the ability to learn from history and make better decisions over time. In practice, long-term agent memory often uses similarity thresholds and time-weighted retrieval to prevent irrelevant or outdated information from being recalled.

These four applications (search, vector databases, RAG, and AI agents) are foundational tools for any aspiring AI Engineer's toolkit. Each builds on embeddings as a core technology. Understanding how embeddings capture semantic meaning is the first step toward building production-ready AI systems.

Advanced Topics to Explore

As you continue learning about embeddings, you'll encounter several advanced techniques that are widely used in production systems:

Multimodal Embeddings allow you to embed different types of content (text, images, audio) into the same embedding space. This enables powerful cross-modal search capabilities, like finding images based on text descriptions or vice versa. Models like CLIP demonstrate how effective this approach can be.
Instruction-Tuned Embeddings are models fine-tuned to better understand specific types of queries or instructions. These specialized models often outperform general-purpose embeddings for domain-specific tasks like legal document search or medical literature retrieval.
Quantization reduces the precision of embedding values (from 32-bit floats to 8-bit integers, for example), which can dramatically reduce storage requirements and speed up similarity calculations with minimal impact on search quality. This becomes crucial when working with millions of embeddings.
Dimension Truncation takes advantage of the fact that the most important information in embeddings is often concentrated in the first dimensions. By keeping only the first 256 dimensions of a 768-dimensional embedding, you can achieve significant efficiency gains while preserving most of the semantic information.

These techniques become increasingly important as you scale from prototype to production systems handling real-world data volumes.

Building Toward Production Systems

You've now learned the following core foundational embedding concepts:

Embeddings convert text into numerical vectors that capture meaning
Similar content produces similar vectors
These relationships can be visualized to understand how the model organizes information

But we've only worked with 12 handwritten paper abstracts. This is perfect for getting the core concept, but real applications need to handle hundreds or thousands of documents.

In the next tutorial, we'll scale up dramatically. You'll learn how to collect documents programmatically using APIs, generate embeddings at scale, and make strategic decisions about different embedding approaches.

You'll also face the practical challenges that come with real data: rate limits on APIs, processing time for large datasets, the tradeoff between embedding quality and speed, and how to handle edge cases like empty documents or very long texts. These considerations separate a learning exercise from a production system.

By the end of the next tutorial, you'll be equipped to build an embedding system that handles real-world data at scale. That foundation will prepare you for our final embeddings tutorial, where we'll implement similarity search and build a complete semantic search engine.

Next Steps

For now, experiment with the code above:

Try replacing one of the paper abstracts with content from your own learning.
- Where does it appear on the visualization?
- Does it cluster with one of our three topics, or does it land somewhere in between?
Add a paper abstract that bridges two topics, like "Using Machine Learning to Optimize ETL Pipelines."
- Does it position itself between the ML and data engineering clusters?
- What does this tell you about how embeddings handle multi-topic content?
Try changing the embedding model to see how it affects the visualization.
- Models like all-mpnet-base-v2 produce different dimensional embeddings.
- Do the clusters become tighter or more spread out?
Experiment with adding a completely unrelated abstract, like a cooking recipe or news article.
- Where does it land relative to our three data science clusters?
- How far away is it from the nearest cluster?

This hands-on exploration and experimentation will deepen your intuition about how embeddings work.

Ready to scale things up? In the next tutorial, we'll work with real arXiv data and build an embedding system that can handle thousands of papers. See you there!

Key Takeaways:

Embeddings convert text into numerical vectors that capture semantic meaning
Similar meanings produce similar vectors, enabling mathematical comparison of concepts
Papers from different topics cluster separately because they use distinct vocabulary
Dimensionality reduction (like PCA) helps visualize high-dimensional embeddings in 2D
Embeddings power modern AI systems, including semantic search, vector databases, RAG, and AI agents

Received — 27 October 2025 ⏭ Data Science

Dataquest
20 Fun (and Unique) Data Analyst Projects for Beginners 26 October 2025 at 21:23

20 Fun (and Unique) Data Analyst Projects for Beginners

Dataquest

26 October 2025 at 21:23

You're here because you're serious about becoming a data analyst. You’ve probably noticed that just about every data analytics job posting asks for experience. But how do individuals get experience if they’re just starting out?! The answer: you do it by building a solid portfolio of data analytic projects so that you can land a job as a junior data analyst, even with no experience.

Data Analyst with a magnifying glass examining large chart graphics in the background.

Your portfolio is your ticket to proving your capabilities to a potential employer. Even without previous job experience, a well-curated collection of data analytics projects can set you apart from the competition. They demonstrate your ability to tackle real-world problems with real data, showcasing your ability to clean datasets, create compelling visualizations, and extract meaningful insights—skills that are in high demand.

You just have to pick the ones that speak to you and get started!

Getting started with data analytics projects

So, you're ready to tackle your first data analytics project? Awesome! Let's break down what you need to know to set yourself up for success.

Our curated list of 20 projects below will help you develop the most sought-after data analysis skills and practice using the most frequently used data analysis tools. Namely:

Setting up an effective development environment is also vital. Begin by creating a Python environment with Conda or venv. Use version control like Git to track project changes. Combine an IDE like Jupyter Notebook with core Python libraries to boost your productivity.

Remember, Rome wasn't built in a day! Start your data analysis journey with bite-sized projects to steadily build your skills. Keep learning, stay curious, and enjoy the ride. Before you know it, you'll be tackling real-world data challenges like the professionals do.

20 Data Analyst Projects for Beginners

Each project listed below will help you apply what you've learned to real data, growing your abilities one step at a time. While they are tailored towards beginners, some will be more challenging than others. By working through them, you'll create a portfolio that shows a potential employer you have the practical skills to analyze data on the job.

The data analytics projects below cover a range of analysis techniques, applications, and tools:

In the following sections, you'll find step-by-step guides to walk you through each project. These detailed instructions will help you apply what you've learned and solidify your data analytics skills.

1. Learn and Install Jupyter Notebook

Overview

In this beginner-level project, you'll assume the role of a Jupyter Notebook novice aiming to gain the essential skills for real-world data analytics projects. You'll practice running code cells, documenting your work with Markdown, navigating Jupyter using keyboard shortcuts, mitigating hidden state issues, and installing Jupyter locally. By the end of the project, you'll be equipped to use Jupyter Notebook to work on data analytics projects and share compelling, reproducible notebooks with others.

Tools and Technologies

Jupyter Notebook
Python

Prerequisites

Before you take on this project, it's recommended that you have some foundational Python skills in place first, such as:

Step-by-Step Instructions

Get acquainted with the Jupyter Notebook interface and its components
Practice running code cells and learn how execution order affects results
Use keyboard shortcuts to efficiently navigate and edit notebooks
Create Markdown cells to document your code and communicate your findings
Install Jupyter locally to work on projects on your own machine

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

Familiarity with the core components and workflow of Jupyter Notebook
Ability to use Jupyter Notebook to run code, perform analysis, and share results
Understanding of how to structure and document notebooks for real-world reproducibility
Proficiency in navigating Jupyter Notebook using keyboard shortcuts to boost productivity
Readiness to apply Jupyter Notebook skills to real-world data projects and collaborate with others

Relevant Links and Resources

How to Use Jupyter Notebook: A Beginner's Tutorial

Additional Resources

2. Profitable App Profiles for the App Store and Google Play Markets

Overview

In this guided project, you'll assume the role of a data analyst for a company that builds ad-supported mobile apps. By analyzing historical data from the Apple App Store and Google Play Store, you'll identify app profiles that attract the most users and generate the most revenue. Using Python and Jupyter Notebook, you'll clean the data, analyze it using frequency tables and averages, and make practical recommendations on the app categories and characteristics the company should target to maximize profitability.

Tools and Technologies

Python
Data Analytics
Jupyter Notebook

Prerequisites

This is a beginner-level project, but you should be comfortable working with Python functions and Jupyter Notebook:

Writing functions with arguments, return statements, and control flow
Debugging functions to ensure proper execution
Using conditional logic and loops within functions
Working with Jupyter Notebook to write and run code

Step-by-Step Instructions

Open and explore the App Store and Google Play datasets
Clean the datasets by removing non-English apps and duplicate entries
Isolate the free apps for further analysis
Determine the most common app genres and their characteristics using frequency tables
Make recommendations on the ideal app profiles to maximize users and revenue

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Cleaning real-world data to prepare it for analysis
Analyzing app market data to identify trends and success factors
Applying data analysis techniques like frequency tables and calculating averages
Using data insights to inform business strategy and decision-making
Communicating your findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

3. Exploring Hacker News Posts

Overview

In this project, you'll explore and analyze a dataset from Hacker News, a popular tech-focused community site. Using Python, you'll apply skills in string manipulation, object-oriented programming, and date management to uncover trends in user submissions and identify factors that drive community engagement. This hands-on project will strengthen your ability to interpret real-world datasets and enhance your data analysis skills.

Tools and Technologies

Python
Data cleaning
Object-oriented programming
Data Analytics
Jupyter Notebook

Prerequisites

To get the most out of this project, you should have some foundational Python and data cleaning skills, such as:

Employing loops in Python to explore CSV data
Utilizing string methods in Python to clean data for analysis
Processing dates from strings using the datetime library
Formatting dates and times for analysis using strftime

Step-by-Step Instructions

Remove headers from a list of lists
Extract 'Ask HN' and 'Show HN' posts
Calculate the average number of comments for 'Ask HN' and 'Show HN' posts
Find the number of 'Ask HN' posts and average comments by hour created
Sort and print values from a list of lists

Expected Outcomes

After completing this project, you'll have gained practical experience and skills, including:

Applying Python string manipulation, OOP, and date handling to real-world data
Analyzing trends and patterns in user submissions on Hacker News
Identifying factors that contribute to post popularity and engagement
Communicating insights derived from data analysis

Relevant Links and Resources

Additional Resources

4. Clean and Analyze Employee Exit Surveys

Overview

In this hands-on project, you'll play the role of a data analyst for the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Your task is to clean and analyze employee exit surveys from both institutes to identify insights into why employees resign. Using Python and pandas, you'll combine messy data from multiple sources, clean column names and values, analyze the data, and share your key findings.

Tools and Technologies

Python
Pandas
Data cleaning
Data Analytics
Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with:

Exploring and analyzing data using pandas
Aggregating data with pandas groupby operations
Combining datasets using pandas concat and merge functions
Manipulating strings and handling missing data in pandas

Step-by-Step Instructions

Load and explore the DETE and TAFE exit survey data
Identify missing values and drop unnecessary columns
Clean and standardize column names across both datasets
Filter the data to only include resignation reasons
Verify data quality and create new columns for analysis
Combine the cleaned datasets into one for further analysis
Analyze the cleaned data to identify trends and insights

Expected Outcomes

By completing this project, you will:

Clean real-world, messy HR data to prepare it for analysis
Apply core data cleaning techniques in Python and pandas
Combine multiple datasets and conduct exploratory analysis
Analyze employee exit surveys to understand key drivers of resignations
Summarize your findings and share data-driven recommendations

Relevant Links and Resources

Pandas and NumPy Fundamentals Course

Additional Resources

5. Star Wars Survey

Overview

In this project designed for beginners, you'll become a data analyst exploring FiveThirtyEight's Star Wars survey data. Using Python and pandas, you'll clean messy data, map values, compute statistics, and analyze the data to uncover fan film preferences. By comparing results between demographic segments, you'll gain insights into how Star Wars fans differ in their opinions. This project provides hands-on practice with key data cleaning and analysis techniques essential for data analyst roles across industries.

Tools and Technologies

Python
Pandas
Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with the following:

Exploring and cleaning data using pandas
Combining datasets and performing joins in pandas
Applying functions over columns in pandas DataFrames
Analyzing survey data using pandas

Step-by-Step Instructions

Map Yes/No columns to Boolean values to standardize the data
Convert checkbox columns to lists and get them into a consistent format
Clean and rename the ranking columns to make them easier to analyze
Identify the highest-ranked and most-viewed Star Wars films
Analyze the data by key demographic segments like gender, age, and location
Summarize your findings on fan preferences and differences between groups

Expected Outcomes

After completing this project, you will have gained:

Experience cleaning and analyzing a real-world, messy dataset
Hands-on practice with pandas data manipulation techniques
Insights into the preferences and opinions of Star Wars fans
An understanding of how to analyze survey data for business insights

Relevant Links and Resources

Additional Resources

6. Word Raider

Overview

In this beginner-level Python project, you'll step into the role of a developer to create "Word Raider," an interactive word-guessing game. Although this project won't have you perform any explicit data analysis, it will sharpen your Python skills and make you a better data analyst. Using fundamental programming skills, you'll apply concepts like loops, conditionals, and file handling to build the game logic from the ground up. This hands-on project allows you to consolidate your Python knowledge by integrating key techniques into a fun application.

Tools and Technologies

Python
Jupyter Notebook

Prerequisites

Before diving into this project, you should have some foundational Python skills, including:

Familiarity with Python basics like variables, data types, and functions
Ability to work with loops and conditional statements to control program flow
Experience with reading data from files and manipulating strings
Basic understanding of object-oriented programming concepts

Step-by-Step Instructions

Build the word bank by reading words from a text file into a Python list
Set up variables to track the game state, like the hidden word and remaining attempts
Implement functions to receive and validate user input for their guesses
Create the game loop, checking guesses against the hidden word and providing feedback
Update the game state after each guess and check for a win or loss condition

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Strengthened proficiency in fundamental Python programming concepts
Experience building an interactive, text-based game from scratch
Practice with file I/O, data structures, and basic object-oriented design
Improved problem-solving skills and ability to translate ideas into code

Relevant Links and Resources

Additional Resources

7. Install RStudio

Overview

In this beginner-level project, you'll take the first steps in your data analysis journey by installing R and RStudio. As an aspiring data analyst, you'll set up a professional programming environment and explore RStudio's features for efficient R coding and analysis. Through guided exercises, you'll write scripts, import data, and create visualizations, building key foundations for your career.

Tools and Technologies

R
RStudio

Prerequisites

To complete this project, it's recommended to have basic knowledge of:

R syntax and programming fundamentals
Variables, data types, and arithmetic operations in R
Logical and relational operators in R expressions
Importing, exploring, and visualizing datasets in R

Step-by-Step Instructions

Install the latest version of R and RStudio on your computer
Practice writing and executing R code in the Console
Import a dataset into RStudio and examine its contents
Write and save R scripts to organize your code
Generate basic data visualizations using ggplot2

Expected Outcomes

By completing this project, you'll gain essential skills including:

Setting up an R development environment with RStudio
Navigating RStudio's interface for data science workflows
Writing and running R code in scripts and the Console
Installing and loading R packages for analysis and visualization
Importing, exploring, and visualizing data in RStudio

Relevant Links and Resources

Additional Resources

8. Creating An Efficient Data Analysis Workflow

Overview

In this hands-on project, you'll step into the role of a data analyst hired by a company selling programming books. Your mission is to analyze their sales data to determine which titles are most profitable. You'll apply key R programming concepts like control flow, loops, and functions to develop an efficient data analysis workflow. This project provides valuable practice in data cleaning, transformation, and analysis, culminating in a structured report of your findings and recommendations.

Tools and Technologies

R
RStudio
Data Analytics

Prerequisites

To successfully complete this project, you should have the following foundational control flow, iteration, and functions in R skills:

Implementing control flow using if-else statements
Employing for loops and while loops for iteration
Writing custom functions to modularize code
Combining control flow, loops, and functions in R

Step-by-Step Instructions

Get acquainted with the provided book sales dataset
Transform and prepare the data for analysis
Analyze the cleaned data to identify top performing titles
Summarize your findings in a structured report
Provide data-driven recommendations to stakeholders

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Applying R programming concepts to real-world data analysis
Developing an efficient, reproducible data analysis workflow
Cleaning and preparing messy data for analysis
Analyzing sales data to derive actionable business insights
Communicating findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

9. Creating An Efficient Data Analysis Workflow, Part 2

Overview

In this hands-on project, you'll step into the role of a data analyst at a book company tasked with evaluating the impact of a new program launched on July 1, 2019 to encourage customers to buy more books. Using your data analysis skills in R, you'll clean and process the company's 2019 sales data to determine if the program successfully boosted book purchases and improved review quality. This project allows you to apply key R packages like dplyr, stringr, and lubridate to efficiently analyze a real-world business dataset and deliver actionable insights.

Tools and Technologies

R
RStudio
dplyr
stringr
lubridate

Prerequisites

To successfully complete this project, you should have some specialized data processing in R skills:

Manipulating strings using stringr functions
Working with dates and times using lubridate
Applying the map function to vectorize custom functions
Understanding and employing regular expressions for pattern matching

Step-by-Step Instructions

Load and explore the book company's 2019 sales data
Clean the data by handling missing values and inconsistencies
Process the text reviews to determine positive/negative sentiment
Compare key sales metrics like purchases and revenue before and after the July 1 program launch date
Analyze differences in sales between customer segments

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Cleaning and preparing a real-world business dataset for analysis
Applying powerful R packages to manipulate and process data efficiently
Analyzing sales data to quantify the impact of a new initiative
Translating analysis findings into meaningful business insights

Relevant Links and Resources

Additional Resources

10. Preparing Data with Excel

Overview

In this hands-on project for beginners, you'll step into the role of a data professional in a marine biology research organization. Your mission is to prepare a raw dataset on shark attacks for an analysis team to study trends in attack locations and frequency over time. Using Excel, you'll import the data, organize it in worksheets and tables, handle missing values, and clean the data by removing duplicates and fixing inconsistencies. This project provides practical experience in the essential data preparation skills required for real-world analysis projects.

Tools and Technologies

Excel

Prerequisites

This project is designed for beginners. To complete it, you should be familiar with preparing data in Excel:

Importing data into Excel from various sources
Organizing spreadsheet data using worksheets and tables
Cleaning data by removing duplicates, fixing inconsistencies, and handling missing values
Consolidating data from multiple sources into a single table

Step-by-Step Instructions

Import the raw shark attack data into an Excel workbook
Organize the data into worksheets and tables with a logical structure
Clean the data by removing duplicate entries and fixing inconsistencies
Consolidate shark attack data from multiple sources into a single table

Expected Outcomes

By completing this project, you will gain:

Hands-on experience in data preparation and cleaning techniques using Excel
Foundational skills for importing, organizing, and cleaning data for analysis
An understanding of how to handle missing values and inconsistencies in a dataset
Ability to consolidate data from disparate sources into an analysis-ready format
Practical experience working with a real-world dataset on shark attacks
A solid foundation for data analysis projects and further learning in Excel

Relevant Links and Resources

Additional Resources

Dataquest community where you can view and share this project with others

11. Visualizing the Answer to Stock Questions Using Spreadsheet Charts

Overview

In this hands-on project, you'll step into the shoes of a business analyst to explore historical stock market data using Excel. By applying information design concepts, you'll create compelling visualizations and craft an insightful report – building valuable skills for communicating data-driven insights that are highly sought-after by employers across industries.

Tools and Technologies

Excel
Data visualization
Information design principles

Prerequisites

To successfully complete this project, it's recommended to have foundational visualizing data in Excel skills, such as:

Creating various chart types in Excel to visualize data
Selecting appropriate chart types to effectively present data
Applying design principles to create clear and informative charts
Designing charts for an audience using Gestalt principles

Step-by-Step Instructions

Import the dataset to an Excel spreadsheet
Create a report using data visualizations and tabular data
Represent the data using effective data visualizations
Apply Gestalt principles and pre-attentive attributes to all visualizations
Maximize data-ink ratio in all visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Analyzing real-world stock market data in Excel
Applying information design principles to create effective visualizations
Selecting the best chart types to answer specific questions about the data
Combining multiple charts into a cohesive, insightful report
Developing in-demand data visualization and communication skills

Relevant Links and Resources

Additional Resources

Dataquest community where you can view and share this project with others

12. Identifying Customers Likely to Churn for a Telecommunications Provider

Overview

In this beginner project, you'll take on the role of a data analyst at a telecommunications company. Your challenge is to explore customer data in Excel to identify profiles of those likely to churn. Retaining customers is crucial for telecom providers, so your insights will help inform proactive retention efforts. You'll conduct exploratory data analysis, calculating key statistics, building PivotTables to slice the data, and creating charts to visualize your findings. This project provides hands-on experience with core Excel skills for data-driven business decisions that will enhance your analyst portfolio.

Tools and Technologies

Excel

Prerequisites

To complete this project, you should feel comfortable exploring data in Excel:

Calculating descriptive statistics in Excel
Analyzing data with descriptive statistics
Creating PivotTables in Excel to explore and analyze data
Visualizing data with histograms and boxplots in Excel

Step-by-Step Instructions

Import the customer dataset into Excel
Calculate descriptive statistics for key metrics
Create PivotTables, histograms, and boxplots to explore data differences
Analyze and identify profiles of likely churners
Compile a report with your data visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Hands-on practice analyzing a real-world customer dataset in Excel
Ability to calculate and interpret key statistics to profile churn risks
Experience building PivotTables and charts to slice data and uncover insights
Skill in translating analysis findings into an actionable report for stakeholders

Relevant Links and Resources

Additional Resources

Dataquest community where you can view and share this project with others

13. Data Prep in Tableau

Overview

In this hands-on project, you'll take on the role of a data analyst for Dataquest to prepare their online learning platform data for analysis. You'll connect to Excel data, import tables into Tableau, and define table relationships to build a data model for uncovering insights on student engagement and performance. This project focuses on essential data preparation steps in Tableau, providing you with a robust foundation for data visualization and analysis.

Tools and Technologies

Tableau

Prerequisites

To successfully complete this project, you should have some foundational skills in preparing data in Tableau, such as:

Connecting to data sources in Tableau to access the required data
Importing data tables into the Tableau canvas
Defining relationships between tables in Tableau to combine data
Cleaning and filtering imported data in Tableau to prepare it for use

Step-by-Step Instructions

Connect to the provided Excel file containing key tables on student engagement, course performance, and content completion rates
Import the tables into Tableau and define the relationships between tables to create a unified data model
Clean and filter the imported data to handle missing values, inconsistencies, or irrelevant information
Save the prepared data source to use for creating visualizations and dashboards
Reflect on the importance of proper data preparation for effective analysis

Expected Outcomes

By completing this project, you will gain valuable skills and experience, including:

Hands-on practice with essential data preparation techniques in Tableau
Ability to connect to, import, and combine data from multiple tables
Understanding of how to clean and structure data for analysis
Readiness to progress to creating visualizations and dashboards to uncover insights

Relevant Links and Resources

Additional Resources

14. Business Intelligence Plots

Overview

In this hands-on project, you'll step into the role of a data visualization consultant for Adventure Works. The company's leadership team wants to understand the differences between their online and offline sales channels. You'll apply your Tableau skills to build insightful, interactive data visualizations that provide clear comparisons and enable data-driven business decisions. Key techniques include creating calculated fields, applying filters, utilizing dual-axis charts, and embedding visualizations in tooltips. By the end, you'll have a set of powerful Tableau dashboards ready to share with stakeholders.

Tools and Technologies

Tableau

Prerequisites

To successfully complete this project, you should have a solid grasp of data visualization fundamentals in Tableau:

Navigating the Tableau interface and distinguishing between dimensions and measures
Constructing various foundational chart types in Tableau
Developing and interpreting calculated fields to enhance analysis
Employing filters to improve visualization interactivity

Step-by-Step Instructions

Compare online vs offline orders using visualizations
Analyze products across channels with scatter plots
Embed visualizations in tooltips for added insight
Summarize findings and identify next steps

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

Practical experience building interactive business intelligence dashboards in Tableau
Ability to create calculated fields to conduct tailored analysis
Understanding of how to use filters and tooltips to enable data exploration
Skill in developing visualizations that surface actionable insights for stakeholders

Relevant Links and Resources

Data Visualization with Tableau Skill Path

Additional Resources

15. Data Presentation

Overview

In this project, you'll step into the role of a data analyst exploring conversion funnel trends for a company's leadership team. Using Tableau, you'll build interactive dashboards that uncover insights about which marketing channels, locations, and customer personas drive the most value in terms of volume and conversion rates. By applying data visualization best practices and incorporating dashboard actions and filters, you'll create a professional, usable dashboard ready to present your findings to stakeholders.

Tools and Technologies

Tableau

Prerequisites

To successfully complete this project, you should be comfortable sharing insights in Tableau, such as:

Building basic charts like bar charts and line graphs in Tableau
Employing color, size, trend lines and forecasting to emphasize insights
Combining charts, tables, text and images into dashboards
Creating interactive dashboards with filters and quick actions

Step-by-Step Instructions

Import and clean the conversion funnel data in Tableau
Build basic charts to visualize key metrics
Create interactive dashboards with filters and actions
Add annotations and highlights to emphasize key insights
Compile a professional dashboard to present findings

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

Analyzing conversion funnel data to surface actionable insights
Visualizing trends and comparisons using Tableau charts and graphs
Applying data visualization best practices to create impactful dashboards
Adding interactivity to enable exploration of the data
Communicating data-driven findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

16. Modeling Data in Power BI

Overview

In this hands-on project, you'll step into the role of an analyst at a company that sells scale model cars. Your mission is to model and analyze data from their sales records database using Power BI to extract insights that drive business decision-making. Power BI is a powerful business analytics tool that enables you to connect to, model, and visualize data. By applying data cleaning, transformation, and modeling techniques in Power BI, you'll prepare the sales data for analysis and develop practical skills in working with real-world datasets. This project provides valuable experience in extracting meaningful insights from raw data to inform business strategy.

Tools and Technologies

Power BI

Prerequisites

To successfully complete this project, you should know how to model data in Power BI, such as:

Designing a basic data model in Power BI
Configuring table and column properties in Power BI
Creating calculated columns and measures using DAX in Power BI
Reviewing the performance of measures, relationships, and visuals in Power BI

Step-by-Step Instructions

Import the sales data into Power BI
Clean and transform the data for analysis
Design a basic data model in Power BI
Create calculated columns and measures using DAX
Build visualizations to extract insights from the data

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

Hands-on practice modeling and analyzing real-world sales data in Power BI
Ability to clean, transform and prepare data for analysis
Experience extracting meaningful business insights from raw data
Developing practical skills in data modeling and analysis using Power BI

Relevant Links and Resources

Additional Resources

17. Visualization of Life Expectancy and GDP Variation Over Time

Overview

In this project, you'll step into the role of a data analyst tasked with visualizing life expectancy and GDP data over time to uncover trends and regional differences. Using Power BI, you'll apply data cleaning, transformation, and visualization skills to create interactive scatter plots and stacked column charts that reveal insights from the Gapminder dataset. This hands-on project allows you to practice the full life-cycle of report and dashboard development in Power BI. You'll load and clean data, create and configure visualizations, and publish your work to showcase your skills. By the end, you'll have an engaging, interactive dashboard to add to your portfolio.

Tools and Technologies

Power BI

Prerequisites

To complete this project, you should be able to visualize data in Power BI, such as:

Creating basic Power BI visuals
Designing accessible report layouts
Customizing report themes and visual markers
Publishing Power BI reports and dashboards

Step-by-Step Instructions

Import the life expectancy and GDP data into Power BI
Clean and transform the data for analysis
Create interactive scatter plots and stacked column charts
Design an accessible report layout in Power BI
Customize visual markers and themes to enhance insights

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

Applying data cleaning, transformation, and visualization techniques in Power BI
Creating interactive scatter plots and stacked column charts to uncover data insights
Developing an engaging dashboard to showcase your data visualization skills
Practicing the full life-cycle of Power BI report and dashboard development

Relevant Links and Resources

Additional Resources

Why Business Analysts Need to Learn Power BI

18. Building a BI App

Overview

In this hands-on project, you'll step into the role of a business intelligence analyst at Dataquest, an online learning platform. Using Power BI, you'll import and model data on course completion rates and Net Promoter Scores (NPS) to assess course quality. You'll create insightful visualizations like KPI metrics, line charts, and scatter plots to analyze trends and compare courses. Leveraging this analysis, you'll provide data-driven recommendations on which courses Dataquest should improve.

Tools and Technologies

Power BI

Prerequisites

To successfully complete this project, you should have some foundational skills in Power BI, such as how to manage workspaces and datasets in Power BI:

Creating and managing workspaces
Importing and updating assets within a workspace
Developing dynamic reports using parameters
Implementing static and dynamic row-level security

Step-by-Step Instructions

Import and explore the course completion and NPS data, looking for data quality issues
Create a data model relating the fact and dimension tables
Write calculations for key metrics like completion rate and NPS, and validate the results
Design and build visualizations to analyze course performance trends and comparisons

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

Importing, modeling, and analyzing data in Power BI to drive decisions
Creating calculated columns and measures to quantify key metrics
Designing and building insightful data visualizations to convey trends and comparisons
Developing impactful reports and dashboards to summarize findings
Sharing data stories and recommending actions via Power BI apps

Relevant Links and Resources

Additional Resources

19. Analyzing Kickstarter Projects

Overview

In this hands-on project, you'll step into the role of a data analyst to explore and analyze Kickstarter project data using SQL. You'll start by importing and exploring the dataset, followed by cleaning the data to ensure accuracy. Then, you'll write SQL queries to uncover trends and insights within the data, such as success rates by category, funding goals, and more. By the end of this project, you'll be able to use SQL to derive meaningful insights from real-world datasets.

Tools and Technologies

Prerequisites

To successfully complete this project, you should be comfortable working with SQL and databases, such as:

Basic SQL commands and querying
Data manipulation and joins in SQL
Experience with cleaning data and handling missing values

Step-by-Step Instructions

Import and explore the Kickstarter dataset to understand its structure
Clean the data to handle missing values and ensure consistency
Write SQL queries to analyze the data and uncover trends
Visualize the results of your analysis using SQL queries

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

Proficiency in using SQL for data analysis
Experience with cleaning and analyzing real-world datasets
Ability to derive insights from Kickstarter project data

Relevant Links and Resources

Additional Resources

SQL Joins Tutorial: Working with Databases

20. Analyzing Startup Fundraising Deals from Crunchbase

Overview

In this beginner-level guided project, you'll step into the role of a data analyst to explore a dataset of startup investments from Crunchbase. By applying your pandas and SQLite skills, you'll work with a large dataset to uncover insights on fundraising trends, successful startups, and active investors. This project focuses on developing techniques for handling memory constraints, selecting optimal data types, and leveraging SQL databases. You'll strengthen your ability to apply the pandas-SQLite workflow to real-world scenarios.

Tools and Technologies

Python
Pandas
SQLite
Jupyter Notebook

Prerequisites

Although this is a beginner-level SQL project, you'll need some solid skills in Python and data analysis before taking it on:

Python fundamentals, including variables, data types, and basic syntax
Familiarity with pandas for data manipulation and analysis
Basics of data cleaning techniques to handle missing data and inconsistencies
Exposure to SQL databases and querying data using SQLite

Step-by-Step Instructions

Explore the structure and contents of the Crunchbase startup investments dataset
Process the large dataset in chunks and load into an SQLite database
Analyze fundraising rounds data to identify trends and derive insights
Examine the most successful startup verticals based on total funding raised
Identify the most active investors by number of deals and total amount invested

Expected Outcomes

Upon completing this guided project, you'll gain practical skills and experience, including:

Applying pandas and SQLite to analyze real-world startup investment data
Handling large datasets effectively through chunking and efficient data types
Integrating pandas DataFrames with SQL databases for scalable data analysis
Deriving actionable insights from fundraising data to understand startup success
Building a project for your portfolio showcasing pandas and SQLite skills

Relevant Links and Resources

Additional Resources

Choosing the right data analyst projects

Since the list of data analytics projects on the internet is exhaustive (and can be exhausting!), no one can be expected to build them all. So, how do you pick the right ones for your portfolio, whether they're guided or independent projects? Let's go over the criteria you should use to make this decision.

Passions vs. Interests vs. In-Demand skills

When selecting projects, it’s essential to strike a balance between your passions, interests, and in-demand skills. Here’s how to navigate these three factors:

Passions: Choose projects that genuinely excite you and align with your long-term goals. Passions are often areas you are deeply committed to and are willing to invest significant time and effort in. Working on something you are passionate about can keep you motivated and engaged, which is crucial for learning and completing the project.
Interests: Pick projects related to fields or a topic that sparks your curiosity or enjoyment. Interests might not have the same level of commitment as passions, but they can still make the learning process more enjoyable and meaningful. For instance, if you're curious about sports analytics or healthcare data, these interests can guide your project choices.
In-Demand Skills: Focus on projects that help you develop skills currently sought after in the job market. Research job postings and industry trends to identify which skills are in demand and tailor your projects to develop those competencies.

Steps to picking the right data analytics projects

Assess your current skill level
- If you’re a beginner, start with projects that focus on data cleaning (an essential skill), exploration, and visualization. Using Python libraries like Pandas and Matplotlib is an efficient way to build these foundational skills.
- Utilize structured resources that provide both a beginner data analyst learning path and support to guide you through your first projects.
Plan before you code
- Clearly define your topic, project objectives, and key questions upfront to stay focused and aligned with your goals.
- Choose appropriate data sources early in the planning process to streamline the rest of the project.
Focus on the fundamentals
- Clean your data thoroughly to ensure accuracy.
- Use analytical techniques that align with your objectives.
- Create clear, impactful visualizations of your findings.
- Document your process for reproducibility and effective communication.
Start small and scale up
- Begin with small, manageable projects to build your confidence and skills.
- When you're ready, take on more challenging projects that use machine learning or web scraping, like Predicting Heart Disease or Web Scraping Football Matches.
Seek feedback and iterate
- Share your projects with peers, mentors, or online communities to get feedback.
- Use this feedback to improve and refine your work.

Remember, it’s okay to start small and gradually take on bigger challenges. Each project you complete, no matter how simple, helps you gain skills and learn valuable lessons. Tackling a series of focused projects is one of the best ways to grow your abilities as a data professional. With each one, you’ll get better at planning, execution, and communication.

Conclusion

If you're serious about landing a data analytics job, project-based learning is key.

There’s a lot of data out there and a lot you can do with it. Trying to figure out where to start can be overwhelming. If you want a more structured approach to reaching your goal, consider enrolling in Dataquest’s Data Analyst in Python career path. It offers exactly what you need to land your first job as a data analyst or to grow your career by adding one of the most popular programming languages, in-demand data skills, and projects to your CV.

But if you’re confident in doing this on your own, the list of projects we’ve shared in this post will definitely help you get there. To continue improving, we encourage you to take on additional projects and share them in the Dataquest Community. This provides valuable peer feedback, helping you refine your projects to become more advanced and join the group of professionals who do this for a living.

Received — 24 October 2025 ⏭ Data Science

Dataquest
Python Projects: 60+ Ideas for Beginners to Advanced (2025) 23 October 2025 at 18:46

Python Projects: 60+ Ideas for Beginners to Advanced (2025)

Dataquest

23 October 2025 at 18:46

Quick Answer: The best Python projects for beginners include building an interactive word game, analyzing your Netflix data, creating a password generator, or making a simple web scraper. These projects teach core Python skills like loops, functions, data manipulation, and APIs while producing something you can actually use. Below, you'll find 60+ project ideas organized by skill level, from beginner to advanced.

Completing Python projects is the ultimate way to learn the language. When you work on real-world projects, you not only retain more of the lessons you learn, but you'll also find it super motivating to push yourself to pick up new skills. Because let's face it, no one actually enjoys sitting in front of a screen learning random syntax for hours on end―particularly if it's not going to be used right away.

Python projects don't have this problem. Anything new you learn will stick because you're immediately putting it into practice. There's just one problem: many Python learners struggle to come up with their own Python project ideas to work on. But that's okay, we can help you with that!

Best Starter Python Projects

Here are a few beginner-friendly Python projects from the list below that are perfect for getting hands-on experience right away:

Interactive Word Game — Practice loops, logic, and basic Python by creating a word-guessing game.
Analyze Your Netflix Data — Work with personal data while learning to import, filter, and explore data.
Predict Heart Disease — Analyze health data and build a simple model to predict medical outcomes.
Explore Hacker News Posts — Analyze post popularity and trends on a major tech news site.
Explore eBay Car Sales — Clean and analyze real data from eBay listings to uncover pricing patterns.

Choose one that excites you and just go with it! You’ll learn more by building than by reading alone.

Are You Ready for This?

If you have some programming experience, you might be ready to jump straight into building a Python project. However, if you’re just starting out, it’s vital you have a solid foundation in Python before you take on any projects. Otherwise, you run the risk of getting frustrated and giving up before you even get going. For those in need, we recommend taking either:

Introduction to Python Programming course: meant for those looking to become a data professional while learning the fundamentals of programming with Python.
- This course is part of the Data Analyst in Python career path
Introduction to Python Programming course: meant for those looking to leverage the power of AI while learning the fundamentals of programming with Python.
- This course is part of the Generative AI Fundamentals in Python skill path

In both courses, the goal is to quickly learn the basics of Python so you can start working on a project as soon as possible. You'll learn by doing, not by passively watching videos.

Selecting a Project

Our list below has 60+ fun and rewarding Python projects for learners at all levels. Some are free guided projects that you can complete directly in your browser via the Dataquest platform. Others are more open-ended, serving as inspiration as you build your Python skills. The key is to choose a project that resonates with you and just go for it!

Now, let’s take a look at some Python project examples. There is definitely something to get you started in this list.

Animated GIF of a smiling blue robot interacting with a mobile app interface

Free Python Projects (Recommended):

These free Dataquest guided projects are a great place to start. They provide an embedded code editor directly in your browser, step-by-step instructions to help you complete the project, and community support if you happen to get stuck.

Building an Interactive Word Game — In this guided project, you’ll use basic Python programming concepts to create a functional and interactive word-guessing game.
Profitable App Profiles for the App Store and Google Play Markets — In this one, you’ll work as a data analyst for a company that builds mobile apps. You’ll use Python to analyze real app market data to find app profiles that attract the most users.
Exploring Hacker News Posts — Use Python string manipulation, OOP, and date handling to analyze trends driving post popularity on Hacker News, a popular technology site.
Learn and Install Jupyter Notebook — A guide to using and setting up Jupyter Notebook locally to prepare you for real-world data projects.
Predicting Heart Disease — We're tasked with using a dataset from the World Health Organization to accurately predict a patient’s risk of developing heart disease based on their medical data.
Analyzing Accuracy in Data Presentation — In this project, we'll step into the role of data journalists to analyze movie ratings data and determine if there’s evidence of bias in Fandango’s rating system.

Animated GIF of a laptop displaying a bar chart with a plant in the background

More Projects to Help Build Your Portfolio:

Finding Heavy Traffic Indicators on I-94 — Explore how using the pandas plotting functionality along with the Jupyter Notebook interface allows us to analyze data quickly using visualizations to determine indicators of heavy traffic.
Storytelling Data Visualization on Exchange Rates — You'll assume the role of a data analyst tasked with creating an explanatory data visualization about Euro exchange rates to inform and engage an audience.
Clean and Analyze Employee Exit Surveys — Work with exit surveys from employees of the Department of Education in Queensland, Australia. Play the role of a data analyst to analyze employee exit surveys and uncover insights about why employees resign.
Star Wars Survey — In this data cleaning project, you’ll work with Jupyter Notebook to analyze data on the Star Wars movies to answer the hotly contested question, "Who shot first?"
Analyzing NYC High School Data — For this project, you’ll assume the role of a data scientist analyzing relationships between SAT scores and demographic factors in NYC public schools to determine if the SAT is a fair test.
Predicting the Weather Using Machine Learning — For this project, you’ll step into the role of a data scientist to predict tomorrow’s weather using historical data and machine learning, developing skills in data preparation, time series analysis, and model evaluation.
Credit Card Customer Segmentation — For this project, we’ll play the role of a data scientist at a credit card company to segment customers into groups using K-means clustering in Python, allowing the company to tailor strategies for each segment.

Python Projects for AI Enthusiasts:

Building an AI Chatbot with Streamlit — Build a simple website with an AI chatbot user interface similar to the OpenAI Playground in this intermediate-level project using Streamlit.
Developing a Dynamic AI Chatbot — Create your very own AI-powered chatbot that can take on different personalities, keep track of conversation history, and provide coherent responses in this intermediate-level project.
Building a Food Ordering App — Create a functional application using Python dictionaries, loops, and functions to create an interactive system for viewing menus, modifying carts, and placing orders.

Fun Python Projects for Building Data Skills:

Exploring eBay Car Sales Data — Use Python to work with a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
Find out How Much Money You’ve Spent on Amazon — Dig into your own spending habits with this beginner-level tutorial!
Analyze Your Personal Netflix Data — Another beginner-to-intermediate tutorial that gets you working with your own personal dataset.
Analyze Your Personal Facebook Data with Python — Are you spending too much time posting on Facebook? The numbers don’t lie, and you can find them in this beginner-to-intermediate Python project.
Analyze Survey Data — This walk-through will show you how to set up Python and how to filter survey data from any dataset (or just use the sample data linked in the article).
All of Dataquest’s Guided Projects — These guided data science projects walk you through building real-world data projects of increasing complexity, with suggestions for how to expand each project.
Analyze Everything — Grab a free dataset that interests you, and start poking around! If you get stuck or aren’t sure where to start, our introduction to Python lessons are here to help, and you can try them for free!

Animated GIF of a person playing a space-themed game on a computer, illustrating cool Python projects for game development.

Cool Python Projects for Game Devs:

Rock, Paper, Scissors — Learn Python with a simple-but-fun game that everybody knows.
Build a Text Adventure Game — This is a classic Python beginner project (it also pops up in this book) that’ll teach you many basic game setup concepts that are useful for more advanced games.
Guessing Game — This is another beginner-level project that’ll help you learn and practice the basics.
Mad Libs — Use Python code to make interactive Python Mad Libs!
Hangman — Another childhood classic that you can make to stretch your Python skills.
Snake — This is a bit more complex, but it’s a classic (and surprisingly fun) game to make and play.

Simple Python Projects for Web Devs:

URL shortener — This free video course will show you how to build your own URL shortener like Bit.ly using Python and Django.
Build a Simple Web Page with Django — This is a very in-depth, from-scratch tutorial for building a website with Python and Django, complete with cartoon illustrations!

Easy Python Projects for Aspiring Developers:

Password generator — Build a secure password generator in Python.
Use Tweepy to create a Twitter bot — This Python project idea is a bit more advanced, as you’ll need to use the Twitter API, but it’s definitely fun!
Build an Address Book — This could start with a simple Python dictionary or become as advanced as something like this!
Create a Crypto App with Python — This free video course walks you through using some APIs and Python to build apps with cryptocurrency data.

Additional Python Project Ideas

Still haven’t found a project idea that appeals to you? Here are many more, separated by experience level.

These aren’t tutorials; they’re just Python project ideas that you’ll have to dig into and research on your own, but that’s part of the fun! And it’s also part of the natural process of learning to code and working as a programmer.

The pros use Google and AI tools for answers all the time — so don’t be afraid to dive in and get your hands dirty!

Graphic illustration of the Python logo with orange and brown wings, representing python projects for beginners.

Beginner Python Project Ideas

Create a text encryption generator. This would take text as input, replaces each letter with another letter, and outputs the “encoded” message.
Build a countdown calculator. Write some code that can take two dates as input, and then calculate the amount of time between them. This will be a great way to familiarize yourself with Python’s datetime module.
Write a sorting method. Given a list, can you write some code that sorts it alphabetically, or numerically? Yes, Python has this functionality built-in, but see if you can do it without using the sort() function!
Build an interactive quiz application. Which Avenger are you? Build a personality or recommendation quiz that asks users some questions, stores their answers, and then performs some kind of calculation to give the user a personalized result based on their answers
Tic-Tac-Toe by Text. Build a Tic-Tac-Toe game that’s playable like a text adventure. Can you make it print a text-based representation of the board after each move?
Make a temperature/measurement converter. Write a script that can convert Fahrenheit (℉) to Celcius (℃) and back, or inches to centimeters and back, etc. How far can you take it?
Build a counter app. Take your first steps into the world of UI by building a very simple app that counts up by one each time a user clicks a button.
Build a number-guessing game. Think of this as a bit like a text adventure, but with numbers. How far can you take it?
Build an alarm clock. This is borderline beginner/intermediate, but it’s worth trying to build an alarm clock for yourself. Can you create different alarms? A snooze function?

Graphic illustration of the Python logo with blue and light blue wings, representing intermediate python projects.

Intermediate Python Project Ideas

Build an upgraded text encryption generator. Starting with the project mentioned in the beginner section, see what you can do to make it more sophisticated. Can you make it generate different kinds of codes? Can you create a “decoder” app that reads encoded messages if the user inputs a secret key? Can you create a more sophisticated code that goes beyond simple letter-replacement?
Make your Tic-Tac-Toe game clickable. Building off the beginner project, now make a version of Tic-Tac-Toe that has an actual UI you’ll use by clicking on open squares. Challenge: can you write a simple “AI” opponent for a human player to play against?
Scrape some data to analyze. This could really be anything, from any website you like. The web is full of interesting data. If you learn a little about web-scraping, you can collect some really unique datasets.
Build a clock website. How close can you get it to real-time? Can you implement different time zone selectors, and add in the “countdown calculator” functionality to calculate lengths of time?
Automate some of your job. This will vary, but many jobs have some kind of repetitive process that you can automate! This intermediate project could even lead to a promotion.
Automate your personal habits. Do you want to remember to stand up once every hour during work? How about writing some code that generates unique workout plans based on your goals and preferences? There are a variety of simple apps you can build to automate or enhance different aspects of your life.
Create a simple web browser. Build a simple UI that accepts URLs and loads webpages. PyWt will be helpful here! Can you add a “back” button, bookmarks, and other cool features?
Write a notes app. Create an app that helps people write and store notes. Can you think of some interesting and unique features to add?
Build a typing tester. This should show the user some text, and then challenge them to type it quickly and accurately. Meanwhile, you time them and score them on accuracy.
Create a “site updated” notification system. Ever get annoyed when you have to refresh a website to see if an out-of-stock product has been relisted? Or to see if any news has been posted? Write a Python script that automatically checks a given URL for updates and informs you when it identifies one. Be careful not to overload the servers of whatever site you’re checking, though. Keep the time interval reasonable between each check.
Recreate your favorite board game in Python. There are tons of options here, from something simple like Checkers all the way up to Risk. Or even more modern and advanced games like Ticket to Ride or Settlers of Catan. How close can you get to the real thing?
Build a Wikipedia explorer. Build an app that displays a random Wikipedia page. The challenge here is in the details: can you add user-selected categories? Can you try a different “rabbit hole” version of the app, wherein each article is randomly selected from the articles linked in the previous article? This might seem simple, but it can actually require some serious web-scraping skills.

Graphic illustration of the Python logo with purple and blue wings, representing advanced python projects.

Advanced Python Project Ideas

Build a stock market prediction app. For this one, you’ll need a source of stock market data and some machine learning and data analytics skills. Fortunately, many people have tried this, so there’s plenty of source code out there to work from.
Build a chatbot. The challenge here isn’t so much making the chatbot as it is making it good. Can you, for example, implement some natural language processing techniques to make it sound more natural and spontaneous?
Program a robot. This requires some hardware (which isn’t usually free), but there are many affordable options out there — and many learning resources, too. Definitely look into Raspberry Pi if you’re not already thinking along those lines.
Build an image recognition app. Starting with handwriting recognition is a good idea — Dataquest has a guided data science project to help with that! Once you’ve learned it, you can take it to the next level.
Create a sentiment analysis tool for social media. Collect data from various social media platforms, preprocess it, and then train a deep learning model to analyze the sentiment of each post (positive, negative, neutral).
Make a price prediction model. Select an industry or product that interests you, and build a machine learning model that predicts price changes.
Create an interactive map. This will require a mix of data skills and UI creation skills. Your map can display whatever you’d like — bird migrations, traffic data, crime reports — but it should be interactive in some way. How far can you take it?

Next Steps

Each of the examples in the previous sections built on the idea of choosing a great Python project for a beginner and then enhancing it as your Python skills progress. Next, you can advance to the following:

Think about what interests you, and choose a project that overlaps with your interests.
Think about your Python learning goals, and make sure your project moves you closer to achieving those goals.
Start small. Once you’ve built a small project, you can either expand it or build another one.

Now you’re ready to get started. If you haven’t learned the basics of Python yet, I recommend diving in with Dataquest’s Introduction to Python Programming course.

If you already know the basics, there’s no reason to hesitate! Now is the time to get in there and find your perfect Python project.

Dataquest
11 Must-Have Skills for Data Analysts in 2025 22 October 2025 at 19:06

11 Must-Have Skills for Data Analysts in 2025

Dataquest

22 October 2025 at 19:06

Data is everywhere. Every click, purchase, or social media like creates mountains of information, but raw numbers do not tell a story. That is where data analysts come in. They turn messy datasets into actionable insights that help businesses grow.

Whether you're looking to become a junior data analyst or looking to level up, here are the top 11 data analyst skills every professional needs in 2025, including one optional skill that can help you stand out.

1. SQL

SQL (Structured Query Language) is the language of databases and is arguably the most important technical skill for analysts. It allows you to efficiently query and manage large datasets across multiple systems—something Excel cannot do at scale.

Example in action: Want last quarter's sales by region? SQL pulls it in seconds, no matter how huge the dataset.

Learning Tip: Start with basic queries, then explore joins, aggregations, and subqueries. Practicing data analytics exercises with SQL will help you build confidence and precision.

2. Excel

Since it’s not going anywhere, it’s still worth it to learn Microsoft Excel. Beyond spreadsheets, it offers pivot tables, macros, and Power Query, which are perfect for quick analysis on smaller datasets. Many startups or lean teams still rely on Excel as their first database.

Example in action: Summarize thousands of rows of customer feedback in minutes with pivot tables, then highlight trends visually.

Learning Tip: Focus on pivot tables, logical formulas, and basic automation. Once comfortable, try linking Excel to SQL queries or automating repetitive tasks to strengthen your technical skills in data analytics.

3. Python or R

Python and R are essential for handling big datasets, advanced analytics, and automation. Python is versatile for cleaning data, automation, and integrating analyses into workflows, while R excels at exploratory data analysis and statistical analysis.

Example in action: Clean hundreds of thousands of rows with Python’s pandas library in seconds, something that would take hours in Excel.

Learning Tip: Start with data cleaning and visualization, then move to complex analyses like regression or predictive modeling. Building these data analyst skills is critical for anyone working in data science. Of course, which is better to learn is still up for debate.

4. Data Visualization

Numbers alone rarely persuade anyone. Data visualization is how you make your insights clear and memorable. Tools like Tableau, Power BI, or Python/R libraries help you tell a story that anyone can understand.

Example in action: A simple line chart showing revenue trends can be far more persuasive than a table of numbers.

Learning Tip: Design visuals with your audience in mind. Recreate dashboards from online tutorials to practice clarity, storytelling, and your soft skills in communicating data analytics results.

5. Statistics & Analytics

Strong statistical analysis knowledge separates analysts who report numbers from those who generate insights. Skills like regression, correlation, hypothesis testing, and A/B testing help you interpret trends accurately.

Example in action: Before recommending a new marketing campaign, test whether the increase in sales is statistically significant or just random fluctuation.

Learning Tip: Focus on core probability and statistics concepts first, then practice applying them in projects. Our Probability and Statistics with Python skill path is a great way to learn theoretical concepts in a hands-on way.

6. Data Cleaning & Wrangling

Data rarely comes perfect, so data cleaning skills will always be in demand. Cleaning and transforming datasets, removing duplicates, handling missing values, and standardizing formats are often the most time-consuming but essential parts of the job.

Example in action: You want to analyze customer reviews, but ratings are inconsistent and some entries are blank. Cleaning the data ensures your insights are accurate and actionable.

Learning Tip: Practice on free datasets or public data repositories to build real-world data analyst skills.

7. Communication & Presentation Skills

Analyzing data is only half the battle. Sharing your findings clearly is just as important. Being able to present insights in reports, dashboards, or meetings ensures your work drives decisions.

Example in action: Presenting a dashboard to a marketing team that highlights which campaigns brought the most new customers can influence next-quarter strategy.

Learning Tip: Practice explaining complex findings to someone without a technical background. Focus on clarity, storytelling, and visuals rather than technical jargon. Strong soft skills are just as valuable as your technical skills in data analytics.

8. Dashboard & Report Creation

Beyond visualizations, analysts need to build dashboards and reports that allow stakeholders to interact with data. A dashboard is not just a fancy chart. It is a tool that empowers teams to make data-driven decisions without waiting for you to interpret every number.

Example in action: A sales dashboard with filters for region, product line, and time period can help managers quickly identify areas for improvement.

Learning Tip: Start with simple dashboards in Tableau, Power BI, or Google Data Studio. Focus on making them interactive, easy to understand, and aligned with business goals. This is an essential part of professional data analytics skills.

9. Domain Knowledge

Understanding the industry or context of your data makes you exponentially more effective. Metrics and trends mean different things depending on the business.

Example in action: Knowing e-commerce metrics like cart abandonment versus subscription churn metrics can change how you interpret the same type of data.

Learning Tip: Study your company’s industry, read case studies, or shadow colleagues in different departments to build context. The more you know, the better your insights and analysis will be.

10. Critical Thinking & Problem-Solving

Numbers can be misleading. Critical thinking lets analysts ask the right questions, identify anomalies, and uncover hidden insights.

Example in action: Revenue drops in one region. Critical thinking helps you ask whether it is seasonal, a data error, or a genuine trend.

Learning Tip: Challenge assumptions and always ask “why” multiple times when analyzing a dataset. Practice with open-ended case studies to sharpen your analytical thinking and overall data analyst skills.

11. Machine Learning Basics

Not every analyst uses machine learning daily, but knowing the basics—predictive modeling, clustering, or AI-powered insights—can help you stand out. You do not need this skill to get started as an analyst, but familiarity with it is increasingly valuable for advanced roles.

Example in action: Using a simple predictive model to forecast next month’s sales trends can help your team allocate resources more effectively.

Learning Tip: Start small with beginner-friendly tools like Python’s scikit-learn library, then explore more advanced models as you grow. Treat it as an optional skill to explore once you are confident in SQL, Python/R, and statistical analysis.

Where to Learn These Skills

Want to become a data analyst? Dataquest makes it easy to learn the skills you need to get hired.

With our Data Analyst in Python and Data Analyst in R career paths, you’ll learn by doing real projects, not just watching videos. Each course helps you build the technical and practical skills employers look for.

By the end, you’ll have the knowledge, experience, and confidence to start your career in data analysis.

Wrapping It Up

Being a data analyst is not just about crunching numbers. It is about turning data into actionable insights that drive decisions. Master these data analytics and data analyst skills, and you will be prepared to handle the challenges of 2025 and beyond.

Received — 17 October 2025 ⏭ Data Science

Dataquest
Getting Started with Claude Code for Data Scientists 16 October 2025 at 23:39

Getting Started with Claude Code for Data Scientists

Dataquest

16 October 2025 at 23:39

If you've spent hours debugging a pandas KeyError, or writing the same data validation code for the hundredth time, or refactoring a messy analysis script, you know the frustration of tedious coding work. Real data science work involves analytical thinking and creative problem-solving, but it also requires a lot of mechanical coding: boilerplate writing, test generation, and documentation creation.

What if you could delegate the mechanical parts to an AI assistant that understands your codebase and handles implementation details while you focus on the analytical decisions?

That's what Claude Code does for data scientists.

What Is Claude Code?

Claude Code is Anthropic's terminal-based AI coding assistant that helps you write, refactor, debug, and document code through natural language conversations. Unlike autocomplete tools that suggest individual lines as you type, Claude Code understands project context, makes coordinated multi-file edits, and can execute workflows autonomously.

Claude Code excels at generating boilerplate code for data loading and validation, refactoring messy scripts into clean modules, debugging obscure errors in pandas or numpy operations, implementing standard patterns like preprocessing pipelines, and creating tests and documentation. However, it doesn't replace your analytical judgment, make methodological decisions about statistical approaches, or fix poorly conceived analysis strategies.

In this tutorial, you'll learn how to install Claude Code, understand its capabilities and limitations, and start using it productively for data science work. You'll see the core commands, discover tips that improve efficiency, and see concrete examples of how Claude Code handles common data science tasks.

Key Benefits for Data Scientists

Before we get into installation, let's establish what Claude Code actually does for data scientists:

Eliminate boilerplate code writing for repetitive patterns that consume time without requiring creative thought. File loading with error handling, data validation checks that verify column existence and types, preprocessing pipelines with standard transformations—Claude Code generates these in seconds rather than requiring manual implementation of logic you've written dozens of times before.
Generate test suites for data processing functions covering normal operation, edge cases with malformed or missing data, and validation of output characteristics. Testing data pipelines becomes straightforward rather than work you postpone.
Accelerate documentation creation for data analysis workflows by generating detailed docstrings, README files explaining project setup, and inline comments that explain complex transformations.
Debug obscure errors more efficiently in pandas operations, numpy array manipulations, or scikit-learn pipeline configurations. Claude Code interprets cryptic error messages, suggests likely causes based on common patterns, and proposes fixes you can evaluate immediately.
Refactor exploratory code into production-quality modules with proper structure, error handling, and maintainability standards. The transition from research notebook to deployable pipeline becomes faster and less painful.

These benefits translate directly to time savings on mechanical tasks, allowing you to focus on analysis, modeling decisions, and generating insights rather than wrestling with implementation details.

Installation and Setup

Let's get Claude Code installed and configured. The process takes about 10-15 minutes, including account creation and verification.

Step 1: Obtain Your Anthropic API Key

Navigate to console.anthropic.com and create an account if you don't have one. Once logged in, access the API keys section from the navigation menu on the left, and generate a new API key by clicking on + Create Key.

While you can generate a new key anytime from the console, you won’t be able to retrieve any existing API keys once they have been created. For this reason, you’ll want to copy your API key immediately and store it somewhere safe—you'll need it for authentication.

Always keep your API keys secure. Treat them like passwords and never commit them to version control or share them publicly.

Step 2: Install Claude Code

Claude Code installs via npm (Node Package Manager). If you don't have Node.js installed on your system, download it from nodejs.org before proceeding.

Once Node.js is installed, open your terminal and run:

npm install -g @anthropic-ai/claude-code

The -g flag installs Claude Code globally, making it available from any directory on your system.

Common installation issues:

"npm: command not found": You need to install Node.js first. Download it from nodejs.org and restart your terminal after installation.
Permission errors on Mac/Linux: Try sudo npm install -g @anthropic-ai/claude-code to install with administrator privileges.
PATH issues: If Claude Code installs successfully but the claude command isn't recognized, you may need to add npm's global directory to your system PATH. Run npm config get prefix to find the location, then add [that-location]/bin to your PATH environment variable.

Step 3: Configure Authentication

Set your API key as an environment variable so Claude Code can authenticate with Anthropic's servers:

export ANTHROPIC_API_KEY=your_key_here

Replace your_key_here with the actual API key you copied earlier from the Anthropic console.

To make this permanent (so you don't need to set your API key every time you open a terminal), add the export line above to your shell configuration file:

For bash: Add to ~/.bashrc or ~/.bash_profile
For zsh: Add to ~/.zshrc
For fish: Add to ~/.config/fish/config.fish

You can edit your shell configuration file using nano config_file_name. After adding the line, reload your configuration by running source ~/.bashrc (or whichever file you edited), or simply open a new terminal window.

Step 4: Verify Installation

Confirm that Claude Code is properly installed and authenticated:

claude --version

You should see version information displayed. If you get an error, review the installation steps above.

Try running Claude Code for the first time:

claude

This launches the Claude Code interface. You should see a welcome message and a prompt asking you to select the text style that looks best with your terminal:

Use the arrow keys on your keyboard to select a text style and press Enter to continue.

Next, you’ll be asked to select a login method:

If you have an eligible subscription, select option 1. Otherwise, select option 2. For this tutorial, we will use option 2 (API usage billing).

Once your account setup is complete, you’ll see a welcome message showing the email address for your account:

To exit the setup of Claude Code at any point, press Control+C twice.

Security Note

Claude Code can read files you explicitly include and generate code that loads data from files or databases. However, it doesn't automatically access your data without your instruction. You maintain full control over what files and information Claude Code can see. When working with sensitive data, be mindful of what files you include in conversation context and review all generated code before execution, especially code that connects to databases or external systems. For more details, see Anthropic’s Security Documentation.

Understanding the Costs

Claude Code itself is free software, but using it requires an Anthropic API key that operates on usage-based pricing:

Free tier: Limited testing suitable for evaluation
Pro plan (\$20/month): Reasonable usage for individual data scientists conducting moderate development work
Pay-as-you-go: For heavy users working intensively on multiple projects, typically \$6-12 daily for active development

Most practitioners doing regular but not continuous development work find the \$20 Pro plan provides good balance between cost and capability. Start with the free tier to evaluate effectiveness on your actual work, then upgrade based on demonstrated value.

Your First Commands

Now that Claude Code is installed and configured, let's walk through basic usage with hands-on examples.

Starting a Claude Code Session

Navigate to a project directory in your terminal:

cd ~/projects/customer_analysis

Launch Claude Code:

claude

You'll see the Claude Code interface with a prompt where you can type natural language instructions.

Understanding Your Project

Before asking Claude Code to make changes, it needs to understand your project context. Try starting with this exploratory command:

Explain the structure of this project and identify the key files.

Claude Code will read through your directory, examine files, and provide a summary of what it found. This shows that Claude Code actively explores and comprehends codebases before acting.

Your First Refactoring Task

Let's demonstrate Claude Code's practical value with a realistic example. Create a simple file called load_data.py with some intentionally messy code:

import pandas as pd

# Load customer data
data = pd.read_csv('/Users/yourname/Desktop/customers.csv')
print(data.head())

This works but has obvious problems: hardcoded absolute path, no error handling, poor variable naming, and no documentation.

Now ask Claude Code to improve it:

Refactor load_data.py to use best practices: configurable paths, error handling, descriptive variable names, and complete docstrings.

Claude Code will analyze the file and propose improvements. Instead of the hardcoded path, you'll get configurable file paths through command-line arguments. The error handling expands to catch missing files, empty files, and CSV parsing errors. Variable names become descriptive (customer_df or customer_data instead of generic data). A complete docstring appears documenting parameters, return values, and potential exceptions. The function adds proper logging to track what's happening during execution.

Claude Code asks your permission before making these changes. Always review its proposal; if it looks good, approve it. If something seems off, ask for modifications or reject the changes entirely. This permission step ensures you stay in control while delegating the mechanical work.

What Just Happened

This demonstrates Claude Code's workflow:

You describe what you want in natural language
Claude Code analyzes the relevant files and context
Claude Code proposes specific changes with explanations
You review and approve or request modifications
Claude Code applies approved changes

The entire refactoring took 90 seconds instead of 20-30 minutes of manual work. More importantly, Claude Code caught details you might have forgotten, such as adding logging, proper type hints, and handling multiple error cases. The permission-based approach ensures you maintain control while delegating implementation work.

Core Commands and Patterns

Claude Code provides several slash (/) commands that control its behavior and help you work more efficiently.

Important Slash Commands

@filename: Reference files directly in your prompts using the @ symbol. Example: @src/preprocessing.py or Explain the logic in @data_loader.py. Claude Code automatically includes the file's content in context. Use tab completion after typing @ to quickly navigate and select files.

/clear: Reset conversation context entirely, removing all history and file references. Use this when switching between different analyses, datasets, or project areas. Accumulated conversation history consumes tokens and can cause Claude Code to inappropriately reference outdated context. Think of /clear as starting a fresh conversation when you switch tasks.

/help: Display available commands and usage information. Useful when you forget command syntax or want to discover capabilities.

Context Management for Data Science Projects

Claude Code has token limits determining how much code it can consider simultaneously. For small projects with a few files, this rarely matters. For larger data science projects with dozens of notebooks and scripts, strategic context management becomes important.

Reference only files relevant to your current task using @filename syntax. If you're working on data validation, reference the validation script and related utilities (like @validation.py and @utils/data_checks.py) but exclude modeling and visualization code that won't influence the current work.

Effective Prompting Patterns

Claude Code responds best to clear, specific instructions. Compare these approaches:

Vague: "Make this code better"
Specific: "Refactor this preprocessing function to handle missing values using median imputation for numerical columns and mode for categorical columns, add error handling for unexpected data types, and include detailed docstrings"
Vague: "Add tests"
Specific: "Create pytest tests for the data_loader function covering successful loading, missing file errors, empty file handling, and malformed CSV detection"
Vague: "Fix the pandas error"
Specific: "Debug the KeyError in line 47 of data_pipeline.py and suggest why it's failing on the 'customer_id' column"

Specific prompts produce focused, useful results. Vague prompts generate generic suggestions that may not address your actual needs.

Iteration and Refinement

Treat Claude Code's initial output as a starting point rather than expecting perfection on the first attempt. Review what it generates, identify improvements needed, and make follow-up requests:

"The validation function you created is good, but it should also check that dates are within reasonable ranges. Add validation that start_date is after 2000-01-01 and end_date is not in the future."

This iterative approach produces better results than attempting to specify every requirement in a single massive prompt.

Advanced Features

Beyond basic commands, several features improve your Claude Code experience for complex work.

Activate plan mode: Press Shift+Tab before sending your prompt to enable plan mode, which creates an explicit execution plan before implementing changes. Use this for workflows with three or more distinct steps—like loading data, preprocessing, and generating outputs. The planning phase helps Claude maintain focus on the overall objective.
Run commands with bash mode: Prefix prompts with an exclamation mark to execute shell commands and inject their output into Claude Code's context:
```
! python analyze_sales.py
```
This runs your analysis script and adds complete output to Claude Code's context. You can then ask questions about the output or request interpretations of the results. This creates a tight feedback loop for iterative data exploration.
Use extended thinking for complex problems: Include "think", "think harder", or "ultrathink" in prompts for thorough analysis:
```
think harder: why does my linear regression show high R-squared but poor prediction on validation data?
```
Extended thinking produces more careful analysis but takes longer (ultrathink can take several minutes). Apply this when debugging subtle statistical issues or planning sophisticated transformations.
Resume previous sessions: Launch Claude Code with claude --resume to continue your most recent session with complete context preserved, including conversation history, file references, and established conventions all intact. This proves valuable for ongoing analysis where you want to continue today without re-explaining your entire analytical approach.

Optional Power User Setting

For personal projects where you trust all operations, launch with claude --dangerously-skip-permissions to bypass constant approval prompts. This carries risk if Claude Code attempts destructive operations, so use it only on projects where you maintain version control and can recover from mistakes. Never use this on production systems or shared codebases.

Configuring Claude Code for Data Science Projects

The CLAUDE.md file provides project-specific context that improves Claude Code's suggestions by explaining your conventions, requirements, and domain specifics.

Quick Setup with `/init`

The easiest way to create your CLAUDE.md file is using Claude Code's built-in /init command. From your project directory, launch Claude Code and run:

/init

Claude Code will analyze your project structure and ask you questions about your setup: what kind of project you're working on, your coding conventions, important files and directories, and domain-specific context. It then generates a CLAUDE.md file tailored to your project.

This interactive approach is faster than writing from scratch and ensures you don't miss important details. You can always edit the generated file later to refine it.

Understanding Your CLAUDE.md

Whether you used /init or prefer to create it manually, here's what a typical CLAUDE.md file looks like for a data science project on customer churn. In your project root directory, the file named CLAUDE.md uses markdown format and describes project information:

# Customer Churn Analysis Project

## Project Overview
Predict customer churn for a telecommunications company using historical
customer data and behavior patterns. The goal is identifying at-risk
customers for proactive retention efforts.

## Data Sources
- **Customer demographics**: data/raw/customer_info.csv
- **Usage patterns**: data/raw/usage_data.csv
- **Churn labels**: data/raw/churn_labels.csv

Expected columns documented in data/schemas/column_descriptions.md

## Directory Structure
- `data/raw/`: Original unmodified data files
- `data/processed/`: Cleaned and preprocessed data ready for modeling
- `notebooks/`: Exploratory analysis and experimentation
- `src/`: Production code for data processing and modeling
- `tests/`: Pytest tests for all src/ modules
- `outputs/`: Generated reports, visualizations, and model artifacts

## Coding Conventions
- Use pandas for data manipulation, scikit-learn for modeling
- All scripts should accept command-line arguments for file paths
- Include error handling for data quality issues
- Follow PEP 8 style guidelines
- Write pytest tests for all data processing functions

## Domain Notes
Churn is defined as customer canceling service within 30 days. We care
more about catching churners (recall) than minimizing false positives
because retention outreach is relatively low-cost.

This upfront investment takes 10-15 minutes but improves every subsequent interaction by giving Claude Code context about your project structure, conventions, and requirements.

Hierarchical Configuration for Complex Projects

CLAUDE.md files can be hierarchical. You might maintain a root-level CLAUDE.md describing overall project structure, plus subdirectory-specific files for different analysis areas.

For example, a project analyzing both customer behavior and financial performance might have:

Root CLAUDE.md: General project description, directory structure, and shared conventions
customer_analysis/CLAUDE.md: Specific details about customer data sources, relevant metrics like lifetime value and engagement scores, and analytical approaches for behavioral patterns
financial_analysis/CLAUDE.md: Financial data sources, accounting principles used, and approaches for revenue and cost analysis

Claude Code prioritizes the most specific configuration, so subdirectory files take precedence when working within those areas.

Custom Slash Commands

For frequently used patterns specific to your workflow, you can create custom slash commands. Create a .claude/commands directory in your project and add markdown files named for each slash command you want to define.

For example, .claude/commands/test.md:

Create pytest tests for: $ARGUMENTS

Requirements:
- Test normal operation with valid data
- Test edge cases: empty inputs, missing values, invalid types
- Test expected exceptions are raised appropriately
- Include docstrings explaining what each test validates
- Use descriptive test names that explain the scenario

Then /test my_preprocessing_function generates tests following your specified patterns.

These custom commands represent optional advanced customization. Start with basic CLAUDE.md configuration, and consider custom commands only after you've identified repetitive patterns in your prompting.

Practical Data Science Applications

Let's see Claude Code in action across some common data science tasks.

1. Data Loading and Validation

Generate robust data loading code with error handling:

Create a data loading function for customer_data.csv that:
- Accepts configurable file paths
- Validates expected columns exist with correct types
- Detects and logs missing value patterns
- Handles common errors like missing files or malformed CSV
- Returns the dataframe with a summary of loaded records

Claude Code generates a function that handles all these requirements. The code uses pathlib for cross-platform file paths, includes try-except blocks for multiple error scenarios, validates that required columns exist in the dataframe, logs detailed information about data quality issues like missing values, and provides clear exception messages when problems occur. This handles edge cases you might forget: missing files, parsing errors, column validation, and missing value detection with logging.

2. Exploratory Data Analysis Assistance

Generate EDA code:

Create an EDA script for the customer dataset that generates:
- Distribution plots for numerical features (age, income, tenure)
- Count plots for categorical features (plan_type, region)
- Correlation heatmap for numerical variables
- Summary statistics table
Save all visualizations to outputs/eda/

Claude Code produces a complete analysis script with proper plot styling, figure organization, and file saving—saving 30-45 minutes of matplotlib configuration work.

3. Data Preprocessing Pipeline

Build a preprocessing module:

Create preprocessing.py with functions to:
- Handle missing values: median for numerical, mode for categorical
- Encode categorical variables using one-hot encoding
- Scale numerical features using StandardScaler
- Include type hints, docstrings, and error handling

The generated code includes proper sklearn patterns and documentation, and it handles edge cases like unseen categories during transform.

4. Test Generation

Generate pytest tests:

Create tests for the preprocessing functions covering:
- Successful preprocessing with valid data
- Handling of various missing value patterns
- Error cases like all-missing columns
- Verification that output shapes match expectations

Claude Code generates thorough test coverage including fixtures, parametrized tests, and clear assertions—work that often gets postponed due to tedium.

5. Documentation Generation

Add docstrings and project documentation:

Add docstrings to all functions in data_pipeline.py following NumPy style

Create a README.md explaining:
- Project purpose and business context
- Setup instructions for the development environment
- How to run the preprocessing and modeling pipeline
- Description of output artifacts and their interpretation

Generated documentation captures technical details while remaining readable for collaborators.

6. Maintaining Analysis Documentation

For complex analyses, use Claude Code to maintain living documentation:

Create analysis_log.md and document our approach to handling missing income data, including:
- The statistical justification for using median imputation rather than deletion
- Why we chose median over mean given the right-skewed distribution we observed
- Validation checks we performed to ensure imputation didn't bias results

This documentation serves dual purposes. First, it provides context for Claude Code in future sessions when you resume work on this analysis, as it explains the preprocessing you applied and why those specific choices were methodologically appropriate. Second, it creates stakeholder-ready explanations communicating both technical implementation and analytical reasoning.

As your analysis progresses, continue documenting key decisions:

Add to analysis_log.md: Explain why we chose random forest over logistic regression after observing significant feature interactions in the correlation analysis, and document the cross-validation approach we used given temporal dependencies in our customer data.

This living documentation approach transforms implicit analytical reasoning into explicit written rationale, increasing both reproducibility and transparency of your data science work.

Common Pitfalls and How to Avoid Them

Insufficient context leads to generic suggestions that miss project-specific requirements. Claude Code doesn't automatically know your data schema, project conventions, or domain constraints. Maintain a detailed CLAUDE.md file and reference relevant files using @filename syntax in your prompts.
Accepting generated code without review risks introducing bugs or inappropriate patterns. Claude Code produces good starting points but isn't perfect. Treat all output as first drafts requiring validation through testing and inspection, especially for statistical computations or data transformations.
Attempting overly complex requests in single prompts produces confused or incomplete results. When you ask Claude Code to "build the entire analysis pipeline from scratch," it gets overwhelmed. Break large tasks into focused steps—first create data loading, then validation, then preprocessing—building incrementally toward the desired outcome.
Ignoring error messages when Claude Code encounters problems prevents identifying root causes. Read errors carefully and ask Claude Code for specific debugging assistance: "The preprocessing function failed with KeyError on 'customer_id'. What might cause this and how should I fix it?"

Understanding Claude Code's Limitations

Setting realistic expectations about what Claude Code cannot do well builds trust through transparency.

Domain-specific understanding requires your input. Claude Code generates code based on patterns and best practices but cannot validate whether analytical approaches are appropriate for your research questions or business problems. You must provide domain expertise and methodological judgment.

Subtle bugs can slip through. Generated code for advanced statistical methods, custom loss functions, or intricate data transformations requires careful validation. Always test generated code thoroughly against known-good examples.

Large project understanding is limited. Claude Code works best on focused tasks within individual files rather than system-wide refactoring across complex architectures with dozens of interconnected files.

Edge cases may not be handled. Preprocessing code might handle clean training data perfectly but break on production data with unexpected null patterns or outlier distributions that weren't present during development.

Expertise is not replaceable. Claude Code accelerates implementation but does not replace fundamental understanding of data science principles, statistical methods, or domain knowledge.

Security Considerations

When Claude Code accesses external data sources, malicious actors could potentially embed instructions in data that Claude Code interprets as commands. This concern is known as prompt injection.

Maintain skepticism about Claude Code suggestions when working with untrusted external sources. Never grant Claude Code access to production databases, sensitive customer information, or critical systems without careful review of proposed operations.

For most data scientists working with internal datasets and trusted sources, this risk remains theoretical, but awareness becomes important as you expand usage into more automated workflows.

Frequently Asked Questions

How much does Claude Code cost for typical data science usage?

Claude Code itself is free to install, but it requires an Anthropic API key with usage-based pricing. The free tier allows limited testing suitable for evaluation. The Pro plan at \$20/month handles moderate daily development—generating preprocessing code, debugging errors, refactoring functions. Heavy users working intensively on multiple projects may prefer pay-as-you-go pricing, typically \$6-12 daily for active development. Start with the free tier to evaluate effectiveness, then upgrade based on value.

Does Claude Code work with Jupyter notebooks?

Claude Code operates as a command-line tool and works best with Python scripts and modules. For Jupyter notebooks, use Claude Code to build utility modules that your notebooks import, creating cleaner separation between exploratory analysis and reusable logic. You can also copy code cells into Python files, improve them with Claude Code, then bring the enhanced code back to the notebook.

Can Claude Code access my data files or databases?

Claude Code reads files you explicitly include through context and generates code that loads data from files or databases. It doesn't automatically access your data without instruction. You maintain full control over what files and information Claude Code can see. When you ask Claude Code to analyze data patterns, it reads the data through code execution, not by directly accessing databases or files independently.

How does Claude Code compare to GitHub Copilot?

GitHub Copilot provides inline code suggestions as you type within an IDE, excelling at completing individual lines or functions. Claude Code offers more substantial assistance with entire file transformations, debugging sessions, and refactoring through conversational interaction. Many practitioners use both—Copilot for writing code interactively, Claude Code for larger refactoring and debugging work. They complement each other rather than compete.

Next Steps

You now have Claude Code installed, understand its capabilities and limitations, and have seen concrete examples of how it handles data science tasks.

Start by using Claude Code for low-risk tasks where mistakes are easily corrected: generating documentation for existing functions, creating test cases for well-understood code, or refactoring non-critical utility scripts. This builds confidence without risking important work. Gradually increase complexity as you become comfortable.

Maintain a personal collection of effective prompts for data science tasks you perform regularly. When you discover a prompt pattern that produces excellent results, save it for reuse. This accelerates work on similar future tasks.

For technical details and advanced features, explore Anthropic's Claude Code documentation. The official docs cover advanced topics like Model Context Protocol servers, custom hooks, and integration patterns.

To systematically learn generative AI across your entire practice, check out our Generative AI Fundamentals in Python skill path. For deeper understanding of effective prompt design, our Prompting Large Language Models in Python course teaches frameworks for crafting prompts that consistently produce useful results.

Getting Started

AI-assisted development requires practice and iteration. You'll experience some awkwardness as you learn to communicate effectively with Claude Code, but this learning curve is brief. Most practitioners feel productive within their first week of regular use.

Install Claude Code, work through the examples in this tutorial with your own projects, and discover how AI assistance fits into your workflow.

Have questions or want to share your Claude Code experience? Join the discussion in the Dataquest Community where thousands of data scientists are exploring AI-assisted development together.

Dataquest
Python Practice: 91 Exercises, Projects, and Tutorials 16 October 2025 at 23:26

Python Practice: 91 Exercises, Projects, and Tutorials

Dataquest

By:Charlie Custer

16 October 2025 at 23:26

This guide gives you 91 ways to practice Python — from quick exercises to real projects and helpful courses. Whether you’re a beginner or preparing for a job, there’s something here for you.

Hands-On Courses

Some Python programming courses let you learn and code at the same time. You read a short lesson, then solve a problem in your browser. It’s a fast, hands-on way to learn.

Each course below includes at least one free lesson you can try.

Python Courses

Python Basics Courses

Data Analysis & Visualization Courses

Data Cleaning Courses

Machine Learning Courses

AI & Deep Learning Courses

Probability & Statistics Courses

Hypothesis Testing

These courses are a great way to practice Python online, and they're all free to start. If you're looking for more Python courses, you can find them on Dataquest's course page.

Free Python Exercises

Exercises are a great way to focus on a specific skill. For example, if you have a job interview coming up, practicing Python dictionaries will refresh your knowledge and boost your confidence.

Each lesson is free to start.

Coding Exercises

Beginner Python Exercises

Basic mathematical operators — Use Python to perform calculations and printing results to the screen.
Variables and data types — Work with variables and doing calculations with variables.
Lists and loops — Practice using Python lists and for loops.
Dictionaries 1 — Use dictionaries in Python.
Dictionaries 2 — More practice with dictionaries and frequency tables.
Lists — Practice using lists in Python.
Conditional statements (if-else) — Use conditional statements in Python.
Sets — Practice using sets in Python.

Intermediate Python Programming

Python functions — Define and call functions.
Intermediate Python functions — Practice more advanced usage of functions like returning multiple values.
Object oriented programming (OOP) — Define classes, methods, and attributes.

Data Handling and Manipulation with NumPy

Index selection — Select values from ndarrays.
Creating ndarrays — Create ndarrays with specific values and shapes.
Ndarray methods — Use ndarray methods to perform calculations.
Broadcasting — Work with ndarrays with different shapes and using broadcasting to create ndarrays.
Boolean masks — Select data from ndarrays using boolean masks
Datatypes — Work with NumPy datatypes
Sorting — Practice sorting ndarrays
Stacking and splitting — Stack and split ndarrays

Data Handling and Manipulation with pandas

Pandas series — Use and build pandas series.
Creating and manipulating dataframes — Create and manipulate pandas dataframes.
Selecting data with pandas — Select data from dataframes.
Loading and exploring data — Load data into dataframes and explore it.
Pandas boolean masks — Use boolean masks to select data from dataframes.
Pandas data cleaning — Clean data in a dataframe.

Data Analysis

Cleaning and preparing data — Write functions to remove incorrect characters and fill missing values.
Data analysis basics — Manipulate data from CSV files using Python dictionaries and functions.
Working with dates and times — Practice with the datetime module in Python.

Complexity and Algorithms

Time complexity of algorithms — Identify the type of time complexity of Python functions.
Constant time complexity — Find the constant time complexity of functions.
Logarithmic complexity — Practice finding the logarithmic time complexity of functions.
Sorting algorithms — Create and work with sorting algorithms in Python.
Space complexity — Practice space complexity by writing Python functions.
Explore our full library of Python practice problems to continue improving your skills.

Python Projects

Projects are one of the best ways to practice Python. Doing projects helps you remember syntax, apply what you’ve learned, and build a portfolio to show employers.

Here are some projects you can start with right away:

Beginner Projects

App Market Analysis (free) — Analyze historical data to see which apps attract the most users.
Word Guessing Game (free) — Create an interactive word-guessing game in Python.
Garden Simulator Game (free) — Build a text-based game using object-oriented programming and randomness.
Food Ordering App — Create a simple, interactive food ordering app with Python.

Data Analysis Projects

Movie Ratings Analysis (free) — Check for bias in movie ratings data.
Hacker News Data (free) — Analyze Hacker News posts with Python string handling and data management.
eBay Car Sales — Work with scraped car sales data using Python.
Traffic Analysis on I-94 — Visualize traffic data using pandas and Jupyter Notebook.
Exchange Rate Visualizations — Create multiple plots to tell a story with data.
Employee Exit Surveys — Analyze survey data to answer stakeholder questions.
NYC High School SAT Analysis — Explore SAT performance by demographics using scatter plots and maps.

Data Engineering Projects

Fast CSV Queries (free) — Build an inventory system with efficient Python data structures.
Wikipedia Data Analysis (free) — Search and process large datasets using Python and MapReduce.
Crime Reports Database — Build a PostgreSQL database with proper tables and schemas.

Machine Learning & AI Projects

Predict Heart Disease (free) — Build a model to predict patient risk.
Predict Insurance Costs — Use Python linear regression to estimate insurance costs.
Dynamic AI Chatbot — Create a chatbot that can track conversation history and switch personas.

If none of these spark your interest, there are plenty of other Python projects to try.

Online Python Tutorials

If exercises, courses, or projects aren’t your thing, blog-style tutorials are another way to learn Python. They’re great for reading on your phone or when you can’t code directly.

Core Python Concepts (Great for Beginners)

Strings
Dictionaries
Data Structures
Classes
Lists
If Statements
Complete Python Guide — A collection of tutorials, practice problems, cheat sheets, guided projects, and FAQs.

Intermediate Techniques

Data Analysis & Data Science

Reset Index in pandas
GroupBy in pandas
Getting Started with APIs
Introduction to Keras
Support Vector Machines (SVMs)
NumPy, pandas & Data Visualization Guide — A collection of tutorials, practice problems, cheat sheets, guided projects, and FAQs.
Data Cleaning Guide — A collection of tutorials, practice problems, cheat sheets, guided projects, and FAQs.

The web is full of thousands of beginner Python tutorials. Once you know the basics, you can find endless ways to practice Python online.

FAQs

Where can I practice Python programming online?

Dataquest.io: Offers dozens of free interactive practice questions, lessons, project ideas, walkthroughs, tutorials, and more.
HackerRank: A popular site for interactive coding practice and challenges.
CodingGame: A fun platform that lets you practice Python through games and coding puzzles.
Edabit: Provides Python challenges that are great for practice or self-testing.
LeetCode: Helps you test your skills and prepare for technical interviews with Python coding problems.

How can I practice Python at home?

Install Python on your machine.

You can download Python directly here, or use a program like Anaconda Individual Edition that makes the process easier. If you don’t want to install anything, you can use an interactive online platform like Dataquest and write code right in your browser.

Work on projects or practice problems.

Find a good Python project or some practice problems to apply what you’re learning. Hands-on coding is one of the best ways to improve.

Make a schedule.

Plan your practice sessions and stick to them. Regular, consistent practice is key to learning Python effectively.

Join an online community.

It's always great to get help from a real person. Reddit has great Python communities, and Dataquest's Community is great if you're learning Python data skills.

Can I practice Python on mobile?

Yes! There are many apps that let you practice Python on both iOS and Android.

However, mobile practice shouldn’t be your main way of learning if you want to use Python professionally. It’s important to practice installing and working with Python on a desktop or laptop, since that’s how most real-world programming is done.

If you’re looking for an app to practice on the go, a great option is Mimo.

With AI advancing so quickly, should I still practice Python?

Absolutely! While AI is a powerful support tool, we can’t always rely on it blindly. AI can sometimes give incorrect answers or generate code that isn’t optimal.

Python is still essential, especially in the AI field. It’s a foundational language for developing AI technologies and is constantly updated to work with the latest AI advancements.

Popular Python libraries like TensorFlow and PyTorch make it easier to build and train complex AI models efficiently. Learning Python also helps you understand how AI tools work under the hood, making you a more skilled and knowledgeable developer.

Received — 16 October 2025 ⏭ Data Science

Dataquest
Build Your First ETL Pipeline with PySpark 15 October 2025 at 23:57

Build Your First ETL Pipeline with PySpark

Dataquest

15 October 2025 at 23:57

You've learned PySpark basics: RDDs, DataFrames, maybe some SQL queries. You can transform data and run aggregations in notebooks. But here's the thing: data engineering is about building pipelines that run reliably every single day, handling the messy reality of production data.

Today, we're building a complete ETL pipeline from scratch. This pipeline will handle the chaos you'll actually encounter at work: inconsistent date formats, prices with dollar signs, test data that somehow made it to production, and customer IDs that follow different naming conventions.

Here's the scenario: You just started as a junior data engineer at an online grocery delivery service. Your team lead drops by your desk with a problem. "Hey, we need help. Our daily sales report is a mess. The data comes in as CSVs from three different systems, nothing matches up, and the analyst team is doing everything manually in Excel. Can you build us an ETL pipeline?"

She shows you what she's dealing with:

Order files that need standardized date formatting
Product prices stored as "$12.99" in some files, "12.99" in others
Customer IDs that are sometimes numbers, sometimes start with "CUST_"
Random blank rows and test data mixed in ("TEST ORDER - PLEASE IGNORE")

"Just get it into clean CSV files," she says. "We'll worry about performance and parquet later. We just need something that works."

Your mission? Build an ETL pipeline that takes this mess and turns it into clean, reliable data the analytics team can actually use. No fancy optimizations needed, just something that runs every morning without breaking.

Setting Up Your First ETL Project

Let's start with structure. One of the biggest mistakes new data engineers make is jumping straight into writing transformation code without thinking about organization. You end up with a single massive Python file that's impossible to debug, test, or explain to your team.

We're going to build this the way professionals do it, but keep it simple enough that you won't get lost in abstractions.

Project Structure That Makes Sense

Here's what we're creating:

grocery_etl/
├── data/
│   ├── raw/         # Your messy input CSVs
│   ├── processed/   # Clean output files
├── src/
│   └── etl_pipeline.py
├── main.py
└── requirements.txt

Why this structure? Three reasons:

First, it separates concerns. Your main.py handles orchestration; starting Spark, calling functions, handling errors. Your src/etl_pipeline.py contains all the actual ETL logic. When something breaks, you'll know exactly where to look.

Second, it mirrors the organizational pattern you'll use in production pipelines (even though the specifics will differ). Whether you're deploying to Databricks, AWS EMR, or anywhere else, you'll separate concerns the same way: orchestration code (main.py), ETL logic (src/etl_pipeline.py), and clear data boundaries. The actual file paths will change (e.g., production uses distributed filesystems like s3://data-lake/raw/ or /mnt/efs/raw/ instead of local folders), but the structure scales.

Third, it keeps your local development organized. Raw data stays raw. Processed data goes to a different folder. This makes debugging easier and mirrors the input/output separation you'll have in production, just on your local machine.

Ready to start? Get the sample CSV files and project skeleton from our starter repository. You can either:

# Clone the full tutorials repo and navigate to this project
git clone https://github.com/dataquestio/tutorials.git
cd tutorials/pyspark-etl-tutorial

Or download just the pyspark-etl-tutorial folder as a ZIP from the GitHub page.

Getting Started

We'll build this project in two files:

src/etl_pipeline.py: All our ETL functions (extract, transform, load)
main.py: Orchestration logic that calls those functions

Let's set up the basics. You'll need Python 3.9+ and Java 11 or 17 installed (required for Spark 4.0). Note: In production, you'd match your PySpark version to whatever your cluster is running (Databricks, EMR, etc.).

# requirements.txt
pyspark==4.0.1

# main.py - Just the skeleton for now
from pyspark.sql import SparkSession
import logging
import sys

def main():
    # We'll complete this orchestration logic later
    pass

if __name__ == "__main__":
    main()

That's it for setup. Notice we're not installing dozens of dependencies or configuring complex build tools. We're keeping it minimal because the goal is to understand ETL patterns, not fight with tooling.

Optional: Interactive Data Exploration

Before we dive into writing pipeline code, you might want to poke around the data interactively. This is completely optional. If you prefer to jump straight into building, skip to the next section, but if you want to see what you're up against, fire up the PySpark shell:

pyspark

Now you can explore interactively from the command line:

df = spark.read.csv("data/raw/online_orders.csv", header=True)

# See the data firsthand
df.show(5, truncate=False)
df.printSchema()
df.describe().show()

# Count how many weird values we have
df.filter(df.price.contains("$")).count()
df.filter(df.customer_id.contains("TEST")).count()

This exploration helps you understand what cleaning you'll need to build into your pipeline. Real data engineers do this all the time: you load a sample, poke around, discover the problems, then write code to fix them systematically.

But interactive exploration is for understanding the data. The actual pipeline needs to be scripted, testable, and able to run without you babysitting it. That's what we're building next.

Extract: Getting Data Flexibly

The Extract phase is where most beginner ETL pipelines break. You write code that works perfectly with your test file, then the next day's data arrives with a slightly different format, and everything crashes.

We're going to read CSVs the defensive way: assume everything will go wrong, capture the problems, and keep the pipeline running.

Reading Messy CSV Files

Let's start building src/etl_pipeline.py. We'll begin with imports and a function to create our Spark session:

# src/etl_pipeline.py

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import logging

# Set up logger for this module
logger = logging.getLogger(__name__)

def create_spark_session():
    """Create a Spark session for our ETL job"""
    return SparkSession.builder \
        .appName("Grocery_Daily_ETL") \
        .config("spark.sql.adaptive.enabled", "true") \
        .getOrCreate()

This is a basic local configuration. Real production pipelines need more: time zone handling, memory allocation tuned to your cluster, policies for parsing dates, which we’ll cover in a future tutorial on production deployment. For now, we're focusing on the pattern.

If you're new to the logging module, logger.info() writes to log files with timestamps and severity levels. When something breaks, you can check the logs to see exactly what happened. It's a small habit that saves debugging time.

Now let's read the data:

def extract_sales_data(spark, input_path):
    """Read sales CSVs with all their real-world messiness"""

    logger.info(f"Reading sales data from {input_path}")

    expected_schema = StructType([
        StructField("order_id", StringType(), True),
        StructField("customer_id", StringType(), True),
        StructField("product_name", StringType(), True),
        StructField("price", StringType(), True),
        StructField("quantity", StringType(), True),
        StructField("order_date", StringType(), True),
        StructField("region", StringType(), True)
    ])

StructType and StructField let you define exactly what columns you expect and what data types they should have. The True at the end means the field can be null. You could let Spark infer the schema automatically, but explicit schemas catch problems earlier. If someone adds a surprise column next week, you'll know immediately instead of discovering it three steps downstream.

Notice everything is StringType(). You might think "wait, customer_id has numbers, shouldn't that be IntegerType?" Here's the thing: some customer IDs are "12345" and some are "CUST_12345". If we used IntegerType(), Spark would convert "CUST_12345" to null and we'd lose data.

The strategy is simple: prevent data loss by preserving everything as strings in the Extract phase, then clean and convert in the Transform phase, where we have control over error handling.

Now let's read the file defensively:

    df = spark.read.csv(
        input_path,
        header=True,
        schema=expected_schema,
        mode="PERMISSIVE"
    )

    total_records = df.count()
    logger.info(f"Found {total_records} total records")

    return df

The PERMISSIVE mode tells Spark to be lenient with malformed data. When it encounters rows that don't match the schema, it sets unparseable fields to null instead of crashing the entire job. This keeps production pipelines running even when data quality takes a hit. We'll validate and handle data quality issues in the Transform phase, where we have better control.

Dealing with Multiple Files

Real data comes from multiple systems. Let's combine them:

def extract_all_data(spark):
    """Combine data from multiple sources"""

    # Each system exports differently
    online_orders = extract_sales_data(spark, "data/raw/online_orders.csv")
    store_orders = extract_sales_data(spark, "data/raw/store_orders.csv")
    mobile_orders = extract_sales_data(spark, "data/raw/mobile_orders.csv")

    # Union them all together
    all_orders = online_orders.unionByName(store_orders).unionByName(mobile_orders)

    logger.info(f"Combined dataset has {all_orders.count()} orders")
    return all_orders

In production, you'd often use wildcards like "data/raw/online_orders*.csv" to process multiple files at once (like daily exports). Spark reads them all and combines them automatically. We're keeping it simple here with one file per source.

The .unionByName() method stacks DataFrames vertically, matching columns by name rather than position. This prevents silent data corruption if schemas don't match perfectly, which is a common issue when combining data from different systems. Since we defined the same schema for all three sources, this works cleanly.

You've now built the Extract phase: reading data defensively and combining multiple sources. The data isn't clean yet, but at least we didn't lose any of it. That's what matters in Extract.

Transform: Fixing the Data Issues

This is where the real work happens. You've got all your data loaded, good and bad separated. Now we need to turn those messy strings into clean, usable data types.

The Transform phase is where you fix all the problems you discovered during extraction. Each transformation function handles one specific issue, making the code easier to test and debug.

Standardizing Customer IDs

Remember how customer IDs come in two formats? Some are just numbers, some have the "CUST_" prefix. Let's standardize them:

# src/etl_pipeline.py (continuing in same file)

def clean_customer_id(df):
    """Standardize customer IDs (some are numbers, some are CUST_123 format)"""

    df_cleaned = df.withColumn(
        "customer_id_cleaned",
        when(col("customer_id").startswith("CUST_"), col("customer_id"))
        .when(col("customer_id").rlike("^[0-9]+$"), concat(lit("CUST_"), col("customer_id")))
        .otherwise(col("customer_id"))
    )

    return df_cleaned.drop("customer_id").withColumnRenamed("customer_id_cleaned", "customer_id")

The logic here: if it already starts with "CUST_", keep it. If it's just numbers (rlike("^[0-9]+$") checks for that), add the "CUST_" prefix. Everything else stays as-is for now. This gives us a consistent format to work with downstream.

Cleaning Price Data

Prices are often messy. Dollar signs, commas, who knows what else:

# src/etl_pipeline.py (continuing in same file)

def clean_price_column(df):
    """Fix the price column"""

    # Remove dollar signs, commas, etc. (keep digits, decimals, and negatives)
    df_cleaned = df.withColumn(
        "price_cleaned",
        regexp_replace(col("price"), r"[^0-9.\-]", "")
    )

    # Convert to decimal, default to 0 if it fails
    df_final = df_cleaned.withColumn(
        "price_decimal",
        when(col("price_cleaned").isNotNull(),
             col("price_cleaned").cast(DoubleType()))
        .otherwise(0.0)
    )

    # Flag suspicious values for review
    df_flagged = df_final.withColumn(
        "price_quality_flag",
        when(col("price_decimal") == 0.0, "CHECK_ZERO_PRICE")
        .when(col("price_decimal") > 1000, "CHECK_HIGH_PRICE")
        .when(col("price_decimal") < 0, "CHECK_NEGATIVE_PRICE")
        .otherwise("OK")
    )

    bad_price_count = df_flagged.filter(col("price_quality_flag") != "OK").count()
    logger.warning(f"Found {bad_price_count} orders with suspicious prices")

    return df_flagged.drop("price", "price_cleaned")

The regexp_replace function strips out everything that isn't a digit or decimal point. Then we convert to a proper decimal type. The quality flag column helps us track suspicious values without throwing them out. This is important: we're not perfect at cleaning, so we flag problems for humans to review later.

Note that we're assuming US price format here (periods as decimal separators). European formats with commas would need different logic, but for this tutorial, we're keeping it focused on the ETL pattern rather than international number handling.

Standardizing Dates

Date parsing is one of those things that looks simple but gets complicated fast. Different systems export dates in different formats: some use MM/dd/yyyy, others use dd-MM-yyyy, and ISO standard is yyyy-MM-dd.

def standardize_dates(df):
    """Parse dates in multiple common formats"""

    # Try each format - coalesce returns the first non-null result
    fmt1 = to_date(col("order_date"), "yyyy-MM-dd")
    fmt2 = to_date(col("order_date"), "MM/dd/yyyy")
    fmt3 = to_date(col("order_date"), "dd-MM-yyyy")

    df_parsed = df.withColumn(
        "order_date_parsed",
        coalesce(fmt1, fmt2, fmt3)
    )

    # Check how many we couldn't parse
    unparsed = df_parsed.filter(col("order_date_parsed").isNull()).count()
    if unparsed > 0:
        logger.warning(f"Could not parse {unparsed} dates")

    return df_parsed.drop("order_date")

We use coalesce() to try each format in order, taking the first one that successfully parses. This handles the most common date format variations you'll encounter.

Note: This approach works for simple date strings but doesn't handle datetime strings with times or timezones. For production systems dealing with international data or precise timestamps, you'd need more sophisticated parsing logic. For now, we're focusing on the core pattern.

Removing Test Data

Test data in production is inevitable. Let's filter it out:

# src/etl_pipeline.py (continuing in same file)

def remove_test_data(df):
    """Remove test orders that somehow made it to production"""

    df_filtered = df.filter(
        ~(upper(col("customer_id")).contains("TEST") |
          upper(col("product_name")).contains("TEST") |
          col("customer_id").isNull() |
          col("order_id").isNull())
    )

    removed_count = df.count() - df_filtered.count()
    logger.info(f"Removed {removed_count} test/invalid orders")

    return df_filtered

We're checking for "TEST" in customer IDs and product names, plus filtering out any rows with null order IDs or customer IDs. That tilde (~) means "not", so we're keeping everything that doesn't match these patterns.

Handling Duplicates

Sometimes the same order appears multiple times, usually from system retries:

# src/etl_pipeline.py (continuing in same file)

def handle_duplicates(df):
    """Remove duplicate orders (usually from retries)"""

    df_deduped = df.dropDuplicates(["order_id"])

    duplicate_count = df.count() - df_deduped.count()
    if duplicate_count > 0:
        logger.info(f"Removed {duplicate_count} duplicate orders")

    return df_deduped

We keep the first occurrence of each order_id and drop the rest. Simple and effective.

Bringing It All Together

Now we chain all these transformations in sequence:

# src/etl_pipeline.py (continuing in same file)

def transform_orders(df):
    """Apply all transformations in sequence"""

    logger.info("Starting data transformation...")

    # Clean each aspect of the data
    df = clean_customer_id(df)
    df = clean_price_column(df)
    df = standardize_dates(df)
    df = remove_test_data(df)
    df = handle_duplicates(df)

    # Cast quantity to integer
    df = df.withColumn(
        "quantity",
        when(col("quantity").isNotNull(), col("quantity").cast(IntegerType()))
        .otherwise(1)
    )

    # Add some useful calculated fields
    df = df.withColumn("total_amount", col("price_decimal") * col("quantity")) \
           .withColumn("processing_date", current_date()) \
           .withColumn("year", year(col("order_date_parsed"))) \
           .withColumn("month", month(col("order_date_parsed")))

    # Rename for clarity
    df = df.withColumnRenamed("order_date_parsed", "order_date") \
           .withColumnRenamed("price_decimal", "unit_price")

    logger.info(f"Transformation complete. Final record count: {df.count()}")

    return df

Each transformation returns a new DataFrame (remember, PySpark DataFrames are immutable), so we reassign the result back to df each time. The order matters here: we clean customer IDs before removing test data because the test removal logic checks for "TEST" in customer IDs. We standardize dates before extracting year and month because those extraction functions need properly parsed dates to work. If you swap the order around, transformations can fail or produce wrong results.

We also add some calculated fields that will be useful for analysis: total_amount (price times quantity), processing_date (when this ETL ran), and time partitions (year and month) for efficient querying later.

The data is now clean, typed correctly, and enriched with useful fields. Time to save it.

Load: Saving Your Work

The Load phase is when we write the cleaned data somewhere useful. We're using pandas to write the final CSV because it avoids platform-specific issues during local development. In production on a real Spark cluster, you'd use Spark's native writers for parquet format with partitioning for better performance. For now, we're focusing on getting the pipeline working reliably across different development environments. You can always swap the output format to parquet once you deploy to a production cluster.

Writing Clean Files

Let's write our data in a way that makes future queries fast:

# src/etl_pipeline.py (continuing in same file)

def load_to_csv(spark, df, output_path):
    """Save processed data for downstream use"""

    logger.info(f"Writing {df.count()} records to {output_path}")

    # Convert to pandas for local development ONLY (not suitable for large datasets)
    pandas_df = df.toPandas()

    # Create output directory if needed
    import os
    os.makedirs(output_path, exist_ok=True)

    output_file = f"{output_path}/orders.csv"
    pandas_df.to_csv(output_file, index=False)

    logger.info(f"Successfully wrote {len(pandas_df)} records")
    logger.info(f"Output location: {output_file}")

    return len(pandas_df)

Important: The .toPandas() method collects all distributed data into the driver's memory. This is dangerous for real production data! If your dataset is larger than your driver's RAM, your job will crash. We're using this approach only because:

Our tutorial dataset is tiny (85 rows)
It avoids platform-specific Spark/Hadoop setup issues on Windows
The focus is on learning ETL patterns, not deployment

In production, always use Spark's native writers (df.write.parquet(), df.write.csv()) even though they require proper cluster configuration. Never use .toPandas() for datasets larger than a few thousand rows or anything you wouldn't comfortably fit in a single machine's memory.

Quick Validation with Spark SQL

Before we call it done, let's verify our data makes sense. This is where Spark SQL comes in handy:

# src/etl_pipeline.py (continuing in same file)

def sanity_check_data(spark, output_path):
    """Quick validation using Spark SQL"""

    # Read the CSV file back
    output_file = f"{output_path}/orders.csv"
    df = spark.read.csv(output_file, header=True, inferSchema=True)
    df.createOrReplaceTempView("orders")

    # Run some quick validation queries
    total_count = spark.sql("SELECT COUNT(*) as total FROM orders").collect()[0]['total']
    logger.info(f"Sanity check - Total orders: {total_count}")

    # Check for any suspicious data that slipped through
    zero_price_count = spark.sql("""
        SELECT COUNT(*) as zero_prices
        FROM orders
        WHERE unit_price = 0
    """).collect()[0]['zero_prices']

    if zero_price_count > 0:
        logger.warning(f"Found {zero_price_count} orders with zero price")

    # Verify date ranges make sense
    date_range = spark.sql("""
        SELECT
            MIN(order_date) as earliest,
            MAX(order_date) as latest
        FROM orders
    """).collect()[0]

    logger.info(f"Date range: {date_range['earliest']} to {date_range['latest']}")

    return True

The createOrReplaceTempView() lets us query the DataFrame using SQL. This is useful for validation because SQL is often clearer for these kinds of checks than chaining DataFrame operations. We're checking the record count, looking for zero prices that might indicate cleaning issues, and verifying the date range looks reasonable.

Creating a Summary Report

Your team lead is going to ask, "How'd the ETL go today?” Let's give her the answer automatically:

# src/etl_pipeline.py (continuing in same file)

def create_summary_report(df):
    """Generate metrics about the ETL run"""

    summary = {
        "total_orders": df.count(),
        "unique_customers": df.select("customer_id").distinct().count(),
        "unique_products": df.select("product_name").distinct().count(),
        "total_revenue": df.agg(sum("total_amount")).collect()[0][0],
        "date_range": f"{df.agg(min('order_date')).collect()[0][0]} to {df.agg(max('order_date')).collect()[0][0]}",
        "regions": df.select("region").distinct().count()
    }

    logger.info("\n=== ETL Summary Report ===")
    for key, value in summary.items():
        logger.info(f"{key}: {value}")
    logger.info("========================\n")

    return summary

This generates a quick summary of what got processed. In a real production system, you might email this summary or post it to Slack so the team knows the pipeline ran successfully.

One note about performance: this summary triggers multiple separate actions on the DataFrame. Each .count() and .distinct().count() scans the data independently, which isn't optimized. We could compute all these metrics in a single pass, but that's a topic for a future tutorial on performance optimization. Right now, we're prioritizing readable code that works.

Putting It All Together

We've built all the pieces. Now let's wire them up into a complete pipeline that runs from start to finish.

Remember how we set up main.py as just a skeleton? Time to fill it in. This file orchestrates everything: starting Spark, calling our ETL functions in order, handling errors, and cleaning up when we're done.

The Complete Pipeline

# main.py
from pyspark.sql import SparkSession
import logging
import sys
import traceback
from datetime import datetime
import os

# Import our ETL functions
from src.etl_pipeline import *

def setup_logging():
    """Basic logging setup"""

    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(f'logs/etl_run_{datetime.now().strftime("%Y%m%d")}.log'),
            logging.StreamHandler(sys.stdout)
        ]
    )
    return logging.getLogger(__name__)

def main():
    """Main ETL pipeline"""

    # Create necessary directories
    os.makedirs('logs', exist_ok=True)
    os.makedirs('data/processed/orders', exist_ok=True)

    logger = setup_logging()
    logger.info("Starting Grocery ETL Pipeline")

    # Track runtime
    start_time = datetime.now()

    try:
        # Initialize Spark
        spark = create_spark_session()
        logger.info("Spark session created")

        # Extract
        raw_df = extract_all_data(spark)
        logger.info(f"Extracted {raw_df.count()} raw records")

        # Transform
        clean_df = transform_orders(raw_df)
        logger.info(f"Transformed to {clean_df.count()} clean records")

        # Load
        output_path = "data/processed/orders"
        load_to_csv(spark, clean_df, output_path)

        # Sanity check
        sanity_check_data(spark, output_path)

        # Create summary
        summary = create_summary_report(clean_df)

        # Calculate runtime
        runtime = (datetime.now() - start_time).total_seconds()
        logger.info(f"Pipeline completed successfully in {runtime:.2f} seconds")

    except Exception as e:
        logger.error(f"Pipeline failed: {str(e)}")
        logger.error(traceback.format_exc())
        raise

    finally:
        spark.stop()
        logger.info("Spark session closed")

if __name__ == "__main__":
    main()

Let's walk through what's happening here.

The setup_logging() function configures logging to write to both a file and the console. The log file gets named with today's date, so you'll have a history of every pipeline run. This is invaluable when you're debugging issues that happened last Tuesday.

The main function wraps everything in a try-except-finally block, which is important for production pipelines. The try block runs your ETL logic. If anything fails, the except block logs the error with a full traceback (that traceback.format_exc() is especially helpful when Spark's Java stack traces get messy). The finally block ensures we always close the Spark session, even if something crashed.

Notice we're using relative paths like "data/processed/orders". This is fine for local development but brittle in production. Real pipelines use environment variables or configuration files for paths. We'll cover that in a future tutorial on production deployment.

Running Your Pipeline

With everything in place, you can run your pipeline with spark-submit:

# Basic run
spark-submit main.py

# With more memory for bigger datasets
spark-submit --driver-memory 4g main.py

# See what's happening with Spark's adaptive execution
spark-submit --conf spark.sql.adaptive.enabled=true main.py

The first time you run this, you'll probably hit some issues, but that's completely normal. Let's talk about the most common ones.

Common Issues You'll Hit

No ETL pipeline works perfectly on the first try. Here are the problems everyone runs into and how to fix them.

Memory Errors

If you see java.lang.OutOfMemoryError, Spark ran out of memory. Since we're using .toPandas() to write our output, this most commonly happens if your cleaned dataset is too large to fit in the driver's memory:

# Option 1: Increase driver memory
spark-submit --driver-memory 4g main.py

# Option 2: Sample the data first to verify the pipeline works
df.sample(0.1).toPandas()  # Process 10% to test

# Option 3: Switch to Spark's native CSV writer for large data
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(output_path)

For local development with reasonably-sized data, increasing driver memory usually solves the problem. For truly massive datasets, you'd switch back to Spark's distributed writers.

Schema Mismatches

If you get "cannot resolve column name" errors, your DataFrame doesn't have the columns you think it does:

# Debug by checking what columns actually exist
df.printSchema()
print(df.columns)

This usually means a transformation dropped or renamed a column, and you forgot to update the downstream code.

Slow Performance

If your pipeline is running but taking forever, don't worry about optimization yet. That's a whole separate topic. For now, just get it working. But if it's really unbearably slow, try caching DataFrames you reference multiple times:

df.cache()  # Keep frequently used data in memory

Just remember to call df.unpersist() when you're done with it to free up memory.

What You've Accomplished

You just built a complete ETL pipeline from scratch. Here's what you learned:

You can handle messy real-world data. CSV files with dollar signs in prices, mixed date formats, and test records mixed into production data.
You can structure projects professionally. Separate functions for extract, transform, and load. Logging that helps you debug failures. Error handling that keeps the pipeline running when something goes wrong.
You know how to run production-style jobs. Code you can deploy with spark-submit that runs on a schedule.
You can spot and flag data quality issues. Suspicious prices get flagged. Test data gets filtered. Summary reports tell you what processed.

This is the foundation every data engineer needs. You're ready to build ETL pipelines for real projects.

What's Next

This pipeline works, but it's not optimized. Here's what comes after you’re comfortable with the basics:

Performance optimization - Make this pipeline 10x faster by reducing shuffles, tuning partitions, and computing metrics efficiently.
Production deployment - Run this on Databricks or EMR. Handle configuration properly, monitor with metrics, and schedule with Airflow.
Testing and validation - Write tests for your transformations. Add data quality checks. Build confidence that changes won't break production.

But those are advanced topics. For now, you've built something real. Take a break, then find a messy CSV dataset and build an ETL pipeline for it. The best way to learn is by doing, so here's a concrete exercise to cement what you've learned:

Find any CSV dataset (Kaggle has thousands)
Build an ETL pipeline for it
Add handling for three data quality issues you discover
Output clean parquet files partitioned by a date or category field
Create a summary report showing what you processed

You now have the foundation every data engineer needs. The next time you see messy data at work, you'll know exactly how to turn it into something useful.

To learn more about PySpark, check out the rest of our tutorial series:

PySpark Tutorial for Beginners - Install and Learn Apache Spark with Python
Working with RDDs in PySpark
Working with DataFrames in PySpark
Using Spark SQL in PySpark for Distributed Data Analysis
Build Your First ETL Pipeline with PySpark ― You are here

Received — 14 October 2025 ⏭ Data Science

Dataquest
Introduction to Apache Airflow 13 October 2025 at 23:26

Introduction to Apache Airflow

Dataquest

By:Brayan Opiyo

13 October 2025 at 23:26

Imagine this: you’re a data engineer at a growing company that thrives on data-driven decisions. Every morning, dashboards must refresh with the latest numbers, reports need updating, and machine learning models retrain with new data.

At first, you write a few scripts, one to pull data from an API, another to clean it, and a third to load it into a warehouse. You schedule them with cron or run them manually when needed. It works fine, until it doesn’t.

As data volumes grow, scripts multiply, and dependencies become increasingly tangled. Failures start cascading, jobs run out of order, schedules break, and quick fixes pile up into fragile automation. Before long, you're maintaining a system held together by patchwork scripts and luck. That’s where data orchestration comes in.

Data orchestration coordinates multiple interdependent processes, ensuring each task runs in the correct order, at the right time, and under the right conditions. It’s the invisible conductor that keeps data pipelines flowing smoothly from extraction to transformation to loading, reliably and automatically. And among the most powerful and widely adopted orchestration tools is Apache Airflow.

In this tutorial, we’ll use Airflow as our case study to explore how workflow orchestration works in practice. You’ll learn what orchestration means, why it matters, and how Airflow’s architecture, with its DAGs, tasks, operators, scheduler, and new event-driven features- brings order to complex data systems.

By the end, you’ll understand not just how Airflow orchestrates workflows, but why orchestration itself is the cornerstone of every scalable, reliable, and automated data engineering ecosystem.

What Workflow Orchestration Is and Why It Matters

Modern data pipelines involve multiple interconnected stages, data extraction, transformation, loading, and often downstream analytics or machine learning. Each stage depends on the successful completion of the previous one, forming a chain that must execute in the correct order and at the right time.

Many data engineers start by managing these workflows with scripts or cron jobs. But as systems grow, dependencies multiply, and processes become more complex, this manual approach quickly breaks down:

Unreliable execution: Tasks may run out of order, producing incomplete or inconsistent data.
Limited visibility: Failures often go unnoticed until reports or dashboards break.
Poor scalability: Adding new tasks or environments becomes error-prone and hard to maintain.

Workflow orchestration solves these challenges by automating, coordinating, and monitoring interdependent tasks. It ensures each step runs in the right sequence, at the right time, and under the right conditions, bringing structure, reliability, and transparency to data operations.

With orchestration, a loose collection of scripts becomes a cohesive system that can be observed, retried, and scaled, freeing engineers to focus on building insights rather than fixing failures.

Apache Airflow uses these principles and extends them with modern capabilities such as:

Deferrable sensors and the triggerer: Improve efficiency by freeing workers while waiting for external events like file arrivals or API responses.
Built-in idempotency and backfills: Safely re-run historical or failed workflows without duplication.
Data-aware scheduling: Enable event-driven pipelines that automatically respond when new data arrives.

While Airflow is not a real-time streaming engine, it excels at orchestrating batch and scheduled workflows with reliability, observability, and control. Trusted by organizations like Airbnb, Meta, and NASA, it remains the industry standard for automating and scaling complex data workflows.

Next, we’ll explore Airflow’s core concepts, DAGs, tasks, operators, and the scheduler, to see orchestration in action.

Core Airflow Concepts

To understand how Airflow orchestrates workflows, let’s explore its foundational components, the DAG, tasks, scheduler, executor, triggerer, and metadata database.

Together, these components coordinate how data flows from extraction to transformation, model training, and loading results in a seamless, automated pipeline.

We’ll use a simple ETL (Extract → Transform → Load) data workflow as our running example. Each day, Airflow will:

Collect daily event data,
Transform it into a clean format,
Upload the results to Amazon S3.

This process will help us connect each concept to a real-world orchestration scenario.

i. DAG (Directed Acyclic Graph)

A DAG is the blueprint of your workflow. It defines what tasks exist and in what order they should run.

Think of it as the pipeline skeleton that connects your data extraction, transformation, and loading steps:

collect_data → transform_data → upload_results

DAGs can be triggered by time (e.g., daily schedules) or events, such as when a new dataset or asset becomes available.

from airflow.decorators import dag
from datetime import datetime

@dag(
    dag_id="daily_ml_pipeline",
    schedule="@daily",
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    pass

The @dag line is a decorator, a Python feature that lets you add behavior or metadata to functions in a clean, readable way. In this case, it turns the pipeline() function into a fully functional Airflow DAG.

The DAG defines when and in what order your workflow runs, but the individual tasks define how each step actually happens.

If you want to learn more about Python decorators, check out our lesson on Buidling a Pipeline Class to see them in action.

Don’t worry if the code above feels overwhelming. In the next tutorial, we’ll take a closer look at them and understand how they work in Airflow. For now, we’ll keep things simple and more conceptual.

ii. Tasks: The Actions Within the Workflow

A task is the smallest unit of work in Airflow, a single, well-defined action, like fetching data, cleaning it, or training a model.

If the DAG defines the structure, tasks define the actions that bring it to life.

Using the TaskFlow API, you can turn any Python function into a task with the @task decorator:

from airflow.decorators import task

@task
def collect_data():
    print("Collecting event data...")
    return "raw_events.csv"

@task
def transform_data(file):
    print(f"Transforming {file}")
    return "clean_data.csv"

@task
def upload_to_s3(file):
    print(f"Uploading {file} to S3...")

Tasks can be linked simply by calling them in sequence:

upload_to_s3(transform_data(collect_data()))

Airflow automatically constructs the DAG relationships, ensuring that each step runs only after its dependency completes successfully.

iii. From Operators to the TaskFlow API

In earlier Airflow versions, you defined each task using explicit operators, for example, a PythonOperator or BashOperator , to tell Airflow how to execute the logic.

Airflow simplifies this with the TaskFlow API, eliminating boilerplate while maintaining backward compatibility.

# Old style (Airflow 1 & 2)
from airflow.operators.python import PythonOperator

task_transform = PythonOperator(
    task_id="transform_data",
    python_callable=transform_data
)

With the TaskFlow API, you no longer need to create operators manually. Each @task function automatically becomes an operator-backed task.

# Airflow 3
@task
def transform_data():
    ...

Under the hood, Airflow still uses operators as the execution engine, but you no longer need to create them manually. The result is cleaner, more Pythonic workflows.

iv. Dynamic Task Mapping: Scaling the Transformation

Modern data workflows often need to process multiple files, users, or datasets in parallel.

Dynamic task mapping allows Airflow to create task instances at runtime based on data inputs, perfect for scaling transformations.

@task
def get_files():
    return ["file1.csv", "file2.csv", "file3.csv"]

@task
def transform_file(file):
    print(f"Transforming {file}")

transform_file.expand(file=get_files())

Airflow will automatically create and run a separate transform_file task for each file, enabling efficient, parallel execution.

v. Scheduler and Triggerer

The scheduler decides when tasks run, either on a fixed schedule or in response to updates in data assets.

The triggerer, on the other hand, handles event-based execution behind the scenes, using asynchronous I/O to efficiently wait for external signals like file arrivals or API responses.

from airflow.assets import Asset 
events_asset = Asset("s3://data/events.csv")

@dag(
    dag_id="event_driven_pipeline",
    schedule=[events_asset],  # Triggered automatically when this asset is updated
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    ...

In this example, the scheduler monitors the asset and triggers the DAG when new data appears.

If the DAG included deferrable operators or sensors, the triggerer would take over waiting asynchronously, ensuring Airflow handles both time-based and event-driven workflows seamlessly.

vi. Executor and Workers

Once a task is ready to run, the executor assigns it to available workers, the machines or processes that actually execute your code.

For example, your ETL pipeline might look like this:

collect_data → transform_data → upload_results

Airflow decides where each of these tasks runs. It can execute everything on a single machine using the LocalExecutor, or scale horizontally across multiple nodes with the CeleryExecutor or KubernetesExecutor.

Deferrable tasks further improve efficiency by freeing up workers while waiting for long external operations like API responses or file uploads.

vii. Metadata Database and API Server: The Memory and Interface

Every action in Airflow, task success, failure, duration, or retry, is stored in the metadata database, Airflow’s internal memory.

This makes workflows reproducible, auditable, and observable.

The API server provides visibility and control:

View and trigger DAGs,
Inspect logs and task histories,
Track datasets and dependencies,
Monitor system health (scheduler, triggerer, database).

Together, they give you complete insight into orchestration, from individual task logs to system-wide performance.

Exploring the Airflow UI

Every orchestration platform needs a way to observe, manage, and interact with workflows, and in Apache Airflow, that interface is the Airflow Web UI.

The UI is served by the Airflow API Server, which exposes a rich dashboard for visualizing DAGs, checking system health, and monitoring workflow states. Even before running any tasks, it’s useful to understand the layout and purpose of this interface, since it’s where orchestration becomes visible.

Don’t worry if this section feels too conceptual; you’ll explore the Airflow UI in greater detail during the upcoming tutorial. You can also use our Setting up Apache Airflow with Docker Locally (Part I) guide if you’d like to try it right away.

The Role of the Airflow UI in Orchestration

In an orchestrated system, automation alone isn’t enough, engineers need visibility.

The UI bridges that gap. It provides an interactive window into your pipelines, showing:

Which workflows (DAGs) exist,
Their current state (active, running, or failed),
The status of Airflow’s internal components,
Historical task performance and logs.

This visibility is essential for diagnosing failures, verifying dependencies, and ensuring the orchestration system runs smoothly.

i. The Home Page Overview

The Airflow UI opens to a dashboard like the one shown below:

The Home Page Overview

At a glance, you can see:

Failed DAGs / Running DAGs / Active DAGs, A quick summary of the system’s operational state.
Health Indicators — Status checks for Airflow’s internal components:
- MetaDatabase: Confirms the metadata database connection is healthy.
- Scheduler: Verifies that the scheduler is running and monitoring DAGs.
- Triggerer: Ensures event-driven workflows can be activated.
- DAG Processor: Confirms DAG files are being parsed correctly.

These checks reflect the orchestration backbone at work, even if no DAGs have been created yet.

ii. DAG Management and Visualization

DAG Management and Visualization

In the left sidebar, the DAGs section lists all workflow definitions known to Airflow.

This doesn’t require you to run anything; it’s simply where Airflow displays every DAG it has parsed from the dags/ directory.

Each DAG entry includes:

The DAG name and description,
Schedule and next run time,
Last execution state
Controls to enable, pause, or trigger it manually.

When workflows are defined, you’ll be able to explore their structure visually through:

DAG Management and Visualization (2)

Graph View — showing task dependencies
Grid View — showing historical run outcomes

These views make orchestration transparent, every dependency, sequence, and outcome is visible at a glance.

iii. Assets and Browse

In the sidebar, the Assets and Browse sections provide tools for exploring the internal components of your orchestration environment.

Assets list all registered items, such as datasets, data tables, or connections that Airflow tracks or interacts with during workflow execution. It helps you see the resources your DAGs depend on. (Remember: in Airflow 3.x, “Datasets” were renamed to “Assets.”)
Browse allows you to inspect historical data within Airflow, including past DAG runs, task instances, logs, and job details. This section is useful for auditing and debugging since it reveals how workflows behaved over time.

Together, these sections let you explore both data assets and orchestration history, offering transparency into what Airflow manages and how your workflows evolve.

iv. Admin

The Admin section provides the configuration tools that control Airflow’s orchestration environment.

Admin

Here, administrators can manage the system’s internal settings and integrations:

Variables – store global key–value pairs that DAGs can access at runtime,
Pools – limit the number of concurrent tasks to manage resources efficiently,
Providers – list the available integration packages (e.g., AWS, GCP, or Slack providers),
Plugins – extend Airflow’s capabilities with custom operators, sensors, or hooks,
Connections – define credentials for databases, APIs, and cloud services,
Config – view configuration values that determine how Airflow components run,

This section essentially controls how Airflow connects, scales, and extends itself, making it central to managing orchestration behavior in both local and production setups.

v. Security

The Security section governs authentication and authorization within Airflow’s web interface.

Security

It allows administrators to manage users, assign roles, and define permissions that determine who can access or modify specific parts of the system.

Within this menu:

Users – manage individual accounts for accessing the UI.
Roles – define what actions users can perform (e.g., view-only vs. admin).
Actions, Resources, Permissions – provide fine-grained control over what parts of Airflow a user can interact with.

Strong security settings ensure that orchestration remains safe, auditable, and compliant, particularly in shared or enterprise environments.

vii. Documentation

At the bottom of the sidebar, Airflow provides quick links under the Documentation section.

Documentation

This includes direct access to:

Official Documentation – the complete Airflow user and developer guide,
GitHub Repository – the open-source codebase for Airflow,
REST API Reference – detailed API endpoints for programmatic orchestration control,
Version Info – the currently running Airflow version,

These links make it easy for users to explore Airflow’s architecture, extend its features, or troubleshoot issues, right from within the interface.

Airflow vs Cron

Many data engineers start automation with cron, the classic Unix schedulersimple, reliable, and perfect for a single recurring script.

But as soon as workflows involve multiple dependent steps, data triggers, or retry, logic, cron’s simplicity turns into fragility.

Apache Airflow moves beyond time-based scheduling into workflow orchestration, managing dependencies, scaling dynamically, and responding to data-driven events, all through native Python.

i. From Scheduling to Dynamic Orchestration

Cron schedules jobs strictly by time:

# Run a data cleaning script every midnight
0 0 * * * /usr/local/bin/clean_data.sh

That works fine for one job, but it breaks down when you need to coordinate a chain like:

extract → transform → train → upload

Cron can’t ensure that step two waits for step one, or that retries occur automatically if a task fails.

In Airflow, you express this entire logic natively in Python using the TaskFlow API:

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def etl_pipeline():
    @task def extract(): ...
    @task def transform(data): ...
    @task def load(data): ...
    load(transform(extract()))

Here, tasks are functions, dependencies are inferred from function calls, and Airflow handles execution, retries, and state tracking automatically.

It’s the difference between telling the system when to run and teaching it how your workflow fits together.

ii. Visibility, Reliability, and Data Awareness

Where cron runs in the background, Airflow makes orchestration observable and intelligent.

Its Web UI and API provide transparency, showing task states, logs, dependencies, and retry attempts in real time.

Failures trigger automatic retries, and missed runs can be easily backfilled to maintain data continuity.

Airflow also introduces data-aware scheduling: workflows can now run automatically when a dataset or asset updates, not just on a clock.

from airflow.assets import Asset  
sales_data = Asset("s3://data/sales.csv")

@dag(schedule=[sales_data], start_date=datetime(2025,10,7))
def refresh_dashboard():
    ...

This makes orchestration responsive, pipelines react to new data as it arrives, keeping dashboards and downstream models always fresh.

iii. Why This Matters

Cron is a timer.

Airflow is an orchestrator, coordinating complex, event-driven, and scalable data systems.

It brings structure, visibility, and resilience to automation, ensuring that each task runs in the right order, with the right data, and for the right reason.

That’s the leap from scheduling to orchestration, and why Airflow is much more than cron with an interface.

Common Airflow Use Cases

Workflow orchestration underpins nearly every data-driven system, from nightly ETL jobs to continuous model retraining.

Because Airflow couples time-based scheduling with dataset awareness and dynamic task mapping, it adapts easily to many workloads.

Below are the most common production-grade scenarios ,all achievable through the TaskFlow API and Airflow’s modular architecture.

i. ETL / ELT Pipelines

ETL (Extract, Transform, Load) remains Airflow’s core use case.

Airflow lets you express a complete ETL pipeline declaratively, with each step defined as a Python @task.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def daily_sales_etl():

    @task
    def extract_sales():
        print("Pulling daily sales from API…")
        return ["sales_us.csv", "sales_uk.csv"]

    @task
    def transform_file(file):
        print(f"Cleaning and aggregating {file}")
        return f"clean_{file}"

    @task
    def load_to_warehouse(files):
        print(f"Loading {len(files)} cleaned files to BigQuery")

    # Dynamic Task Mapping: one transform per file
    cleaned = transform_file.expand(file=extract_sales())
    load_to_warehouse(cleaned)

daily_sales_etl()

Because each transformation task is created dynamically at runtime, the pipeline scales automatically as data sources grow.

When paired with datasets or assets, ETL DAGs can trigger immediately when new data arrives, ensuring freshness without manual scheduling.

ii. Machine Learning Pipelines

Airflow is ideal for orchestrating end-to-end ML lifecycles, data prep, training, evaluation, and deployment.

@dag(schedule="@weekly", start_date=datetime(2025,10,7))
def ml_training_pipeline():

    @task
    def prepare_data():
        return ["us_dataset.csv", "eu_dataset.csv"]

    @task
    def train_model(dataset):
        print(f"Training model on {dataset}")
        return f"model_{dataset}.pkl"

    @task
    def evaluate_models(models):
        print(f"Evaluating {len(models)} models and pushing metrics")

    # Fan-out training jobs
    models = train_model.expand(dataset=prepare_data())
    evaluate_models(models)

ml_training_pipeline()

Dynamic Task Mapping enables fan-out parallel training across datasets, regions, or hyper-parameters, a common pattern in large-scale ML systems.

Airflow’s deferrable sensors can pause training until external data or signals are ready, conserving compute resources.

iii. Analytics and Reporting

Analytics teams rely on Airflow to refresh dashboards and reports automatically.

Airflow can combine time-based and dataset-triggered scheduling so that dashboards always use the latest processed data.

from airflow import Dataset

summary_dataset = Dataset("s3://data/summary_table.csv")

@dag(schedule=[summary_dataset], start_date=datetime(2025,10,7))
def analytics_refresh():

    @task
    def update_powerbi():
        print("Refreshing Power BI dashboard…")

    @task
    def send_report():
        print("Emailing daily analytics summary")

    update_powerbi() >> send_report()

Whenever the summary dataset updates, this DAG runs immediately; no need to wait for a timed window.

That ensures dashboards remain accurate and auditable.

iv. Data Quality and Validation

Trusting your data is as important as moving it.

Airflow lets you automate quality checks and validations before promoting data downstream.

Run dbt tests or Great Expectations validations as tasks.
Use deferrable sensors to wait for external confirmations (e.g., API signals or file availability) without blocking workers.
Fail fast or trigger alerts when anomalies appear.

@task
def validate_row_counts():
    print("Comparing source and target row counts…")

@task
def check_schema():
    print("Ensuring schema consistency…")

validate_row_counts() >> check_schema()

These validations can be embedded directly into the main ETL DAG, creating self-monitoring pipelines that prevent bad data from spreading.

v. Infrastructure Automation and DevOps

Beyond data, Airflow orchestrates operational workflows such as backups, migrations, or cluster scaling.

With the Task SDK and provider integrations, you can automate infrastructure the same way you orchestrate data:

@dag(schedule="@daily", start_date=datetime(2025,10,7))
def infra_maintenance():

    @task
    def backup_database():
        print("Triggering RDS snapshot…")

    @task
    def cleanup_old_files():
        print("Deleting expired objects from S3…")

    backup_database() >> cleanup_old_files()

Airflow turns these system processes into auditable, repeatable, and observable jobs, blending DevOps automation with data-engineering orchestration.

With Airflow, orchestration goes beyond timing, it becomes data-aware, event-driven, and infinitely scalable, empowering teams to automate everything from raw data ingestion to production-ready analytics.

Summary and Up Next

In this tutorial, you explored the foundations of workflow orchestration and how Apache Airflow modernizes data automation through a modular, Pythonic, and data-aware architecture. You learned how Airflow structures workflows using DAGs and the TaskFlow API, scales effortlessly through Dynamic Task Mapping, and responds intelligently to data and events using deferrable tasks and the triggerer.

You also saw how its scheduler, executor, and web UI work together to ensure observability, resilience, and scalability far beyond what traditional schedulers like cron can offer.

In the next tutorial, you’ll bring these concepts to life by installing and running Airflow with Docker, setting up a complete environment where all core services, the apiserver, scheduler, metadata database, triggerer, and workers, operate in harmony.

From there, you’ll create and monitor your first DAG using the TaskFlow API, define dependencies and schedules, and securely manage connections and secrets.

Hands-On NoSQL with MongoDB: From Theory to Practice

Dataquest

26 September 2025 at 23:33

MongoDB is the most popular NoSQL database, but if you're coming from a SQL background, it can feel like learning a completely different language. Today, we're going hands-on to see exactly how document databases solve real data engineering problems.

Here's a scenario we’ll use to see MongoDB in action: You're a data engineer at a growing e-commerce company. Your customer review system started simple: star ratings and text reviews in a SQL database. But success has brought complexity. Marketing wants verified purchase badges. The mobile team is adding photo uploads. Product management is launching video reviews. Each change requires schema migrations that take hours with millions of existing reviews.

Sound familiar? This is the schema evolution problem that drives data engineers to NoSQL. Today, you'll see exactly how MongoDB solves it. We'll build this review system from scratch, handle those evolving requirements without a single migration, and connect everything to a real analytics pipeline.

Ready to see why MongoDB powers companies from startups to Forbes? Let's get started.

Setting Up MongoDB Without the Complexity

We're going to use MongoDB Atlas, their managed cloud service. We're using Atlas because it mirrors how you'll actually deploy MongoDB in most professional environments. Alternatively, you could install MongoDB locally if you prefer. We'll use Atlas because it's quick to set up and gets us straight to learning MongoDB concepts.

1. Create your account

Go to MongoDB's Atlas page and create a free account. You won’t need to provide any credit card information — the free tier gives you 512MB of storage, which is more than enough for learning and even small production workloads. Once you're signed up, you'll create your first cluster.

Create your accout

Click "Build a Database" and select the free shared cluster option. Select any cloud provider and choose a region near you. The defaults are fine because we're learning concepts, not optimizing performance. Name your cluster something simple, like "learning-cluster," and click Create.

2. Set up the database user and network access

While MongoDB sets up your distributed database cluster (yes, even the free tier is distributed across multiple servers), you need to configure access. MongoDB requires two things: a database user and network access rules.

For the database user, click "Database Access" in the left menu and add a new user. Choose password authentication and create credentials you'll remember. For permissions, select "Read and write to any database." Note that in production you'd be more restrictive, but we're learning.

Set up the database user and network access (1)

For network access, MongoDB may have already configured this during signup through their quickstart flow. Check "Network Access" in the left menu to see your current settings. If nothing is configured yet, click "Add IP Address" and select "Allow Access from Anywhere" for now (in production, you'd restrict this to specific IP addresses for security).

Set up the database user and network access (2)

Your cluster should be ready in about three minutes. When it's done, click the "Connect" button on your cluster. You'll see several connection options.

Set up the database user and network access (3)

3. Connect to MongoDB Compass

Choose "MongoDB Compass." This is MongoDB’s GUI tool that makes exploring data visual and intuitive.

Connect to MongoDB Compass (1)

Download Compass if you don't have it, then copy your connection string. It looks like this:

mongodb+srv://myuser:<password>@learning-cluster.abc12.mongodb.net/

Replace <password> with your actual password and connect through Compass. When it connects successfully, you'll see your cluster with a few pre-populated databases like admin, local, and maybe sample_mflix (MongoDB's movie database for demos). These are system databases and sample data (we'll create our own database next).

Connect to MongoDB Compass (2)

You've just set up a distributed database system that can scale to millions of documents. The same setup process works whether you're learning or launching a startup.

Understanding Documents Through Real Data

Now let's build our review system. In MongoDB Compass, you'll see a green "Create Database" button. Click it and create a database called ecommerce_analytics with a collection called customer_reviews.

Understanding documents through real data (1)

Understanding documents through real data (2)

A quick note on terminology: In MongoDB, a database contains collections, and collections contain documents. If you're coming from SQL, think of collections like tables and documents like rows, except documents are much more flexible.

Click into your new collection. You could add data through the GUI by clicking "Add Data" → "Insert Document", but let's use the built-in shell instead to get comfortable with MongoDB's query language. At the top right of Compass, look for the shell icon (">_") and click "Open MongoDB shell.”

First, make sure we're using the right database:

use ecommerce_analytics

Now let's insert our first customer review using insertOne:

db.customer_reviews.insertOne({
  customer_id: "cust_12345",
  product_id: "wireless_headphones_pro",
  rating: 4,
  review_text: "Great sound quality, battery lasts all day. Wish they were a bit more comfortable for long sessions.",
  review_date: new Date("2024-10-15"),
  helpful_votes: 23,
  verified_purchase: true,
  purchase_date: new Date("2024-10-01")
})

MongoDB responds with confirmation that it worked:

{
  acknowledged: true,
  insertedId: ObjectId('68d31786d59c69a691408ede')
}

This is a complete review stored as a single document. In a traditional SQL database, this information might be spread across multiple tables: a reviews table, a votes table, maybe a purchases table for verification. Here, all the related data lives together in one document.

Now here's a scenario that usually breaks SQL schemas: the mobile team ships their photo feature, and instead of planning a migration, they just start storing photos:

db.customer_reviews.insertOne({
  customer_id: "cust_67890",
  product_id: "wireless_headphones_pro",
  rating: 5,
  review_text: "Perfect headphones! See the photo for size comparison.",
  review_date: new Date("2024-10-20"),
  helpful_votes: 45,
  verified_purchase: true,
  purchase_date: new Date("2024-10-10"),
  photo_url: "https://cdn.example.com/reviews/img_2024_10_20_abc123.jpg",
  device_type: "mobile_ios"
})

See the difference? We added photo_url and device_type fields, and MongoDB didn't complain about missing columns or require a migration. Each document just stores what makes sense for it. Of course, this flexibility comes with a trade-off: your application code needs to handle documents that might have different fields. When you're processing reviews, you'll need to check if a photo exists before trying to display it.

Let's add a few more reviews to build a realistic dataset (notice we’re using insertMany here):

db.customer_reviews.insertMany([
  {
    customer_id: "cust_11111",
    product_id: "laptop_stand_adjustable",
    rating: 3,
    review_text: "Does the job but feels flimsy",
    review_date: new Date("2024-10-18"),
    helpful_votes: 5,
    verified_purchase: false
  },
  {
    customer_id: "cust_22222",
    product_id: "wireless_headphones_pro",
    rating: 5,
    review_text: "Excelente producto! La calidad de sonido es increíble.",
    review_date: new Date("2024-10-22"),
    helpful_votes: 12,
    verified_purchase: true,
    purchase_date: new Date("2024-10-15"),
    video_url: "https://cdn.example.com/reviews/vid_2024_10_22_xyz789.mp4",
    video_duration_seconds: 45,
    language: "es"
  },
  {
    customer_id: "cust_33333",
    product_id: "laptop_stand_adjustable",
    rating: 5,
    review_text: "Much sturdier than expected. Height adjustment is smooth.",
    review_date: new Date("2024-10-23"),
    helpful_votes: 8,
    verified_purchase: true,
    sentiment_score: 0.92,
    sentiment_label: "very_positive"
  }
])

Take a moment to look at what we just created. Each document tells its own story: one has video metadata, another has sentiment scores, one is in Spanish. In a SQL world, you'd be juggling nullable columns or multiple tables. Here, each review just contains whatever data makes sense for it.

Querying Documents

Now that we have data, let's retrieve it. MongoDB's query language uses JSON-like syntax that feels natural once you understand the pattern.

Find matches

Finding documents by exact matches is straightforward using the find method with field names as keys:

// Find all 5-star reviews
db.customer_reviews.find({ rating: 5 })

// Find reviews for a specific product
db.customer_reviews.find({ product_id: "wireless_headphones_pro" })

You can use operators for more complex queries. MongoDB has operators like $gte (greater than or equal), $lt (less than), $ne (not equal), and many others:

// Find highly-rated reviews (4 stars or higher)
db.customer_reviews.find({ rating: { $gte: 4 } })

// Find recent verified purchase reviews
db.customer_reviews.find({
  verified_purchase: true,
  review_date: { $gte: new Date("2024-10-15") }
})

Here's something that would be painful in SQL: you can query for fields that might not exist in all documents:

// Find all reviews with videos
db.customer_reviews.find({ video_url: { $exists: true } })

// Find reviews with sentiment analysis
db.customer_reviews.find({ sentiment_score: { $exists: true } })

These queries don't fail when they encounter documents without these fields. Instead, they simply return the documents that match.

A quick note on performance

As your collection grows beyond a few thousand documents, you'll want to create indexes on fields you query frequently. Think of indexes like the index in a book — instead of flipping through every page to find "MongoDB," you can jump straight to the right section.

Let's create an index on product_id since we've been querying it:

db.customer_reviews.createIndex({ product_id: 1 })

The 1 means ascending order (you can use -1 for descending). MongoDB will now keep a sorted reference to all product_id values, making our product queries lightning fast even with millions of reviews. You don't need to change your queries at all; MongoDB automatically uses the index when it helps.

Update existing documents

Updating documents using updateOne is equally flexible. Let's say the customer service team starts adding sentiment scores to reviews:

db.customer_reviews.updateOne(
  { customer_id: "cust_12345" },
  {
    $set: {
      sentiment_score: 0.72,
      sentiment_label: "positive"
    }
  }
)

We used the $set operator, which tells MongoDB which fields to add or modify. In the output MongoDB tells us exactly what happened:

{
    acknowledged: true,
    insertedId: null,
    matchedCount: 1,
    modifiedCount: 1,
    upsertedCount: 0
}

We just added new fields to one document. The others? Completely untouched, with no migration required.

When someone finds a review helpful, we can increment the vote count using $inc:

db.customer_reviews.updateOne(
  { customer_id: "cust_67890" },
  { $inc: { helpful_votes: 1 } }
)

This operation is atomic, meaning it's safe even with multiple users voting simultaneously.

Analytics Without Leaving MongoDB

MongoDB's aggregate method lets you run analytics directly on your operational data using what's called an aggregation pipeline, which is a series of data transformations.

Average rating and review count

Let's answer a real business question: What's the average rating and review count for each product?

db.customer_reviews.aggregate([
  {
    $group: {
      _id: "$product_id",
      avg_rating: { $avg: "$rating" },
      review_count: { $sum: 1 },
      total_helpful_votes: { $sum: "$helpful_votes" }
    }
  },
  {
    $sort: { avg_rating: -1 }
  }
])

{
  _id: 'wireless_headphones_pro',
  avg_rating: 4.666666666666667,
  review_count: 3,
  total_helpful_votes: 81
}
{
  _id: 'laptop_stand_adjustable',
  avg_rating: 4,
  review_count: 2,
  total_helpful_votes: 13
}

Here's how the pipeline works: first, we group ($group) by product_id and calculate metrics for each group using operators like $avg and $sum. Then we sort ($sort) by average rating, using -1 to sort in descending order. The result gives us exactly what product managers need to understand product performance.

Trends over time

Let's try something more complex by analyzing review trends over time:

db.customer_reviews.aggregate([
  {
    $group: {
      _id: {
        month: { $month: "$review_date" },
        year: { $year: "$review_date" }
      },
      review_count: { $sum: 1 },
      avg_rating: { $avg: "$rating" },
      verified_percentage: {
        $avg: { $cond: ["$verified_purchase", 1, 0] }
      }
    }
  },
  {
    $sort: { "_id.year": 1, "_id.month": 1 }
  }
])

{
  _id: {
    month: 10,
    year: 2024
  },
  review_count: 5,
  avg_rating: 4.4,
  verified_percentage: 0.8
}

This query groups reviews by month using MongoDB's date operators like $month and $year, calculates the average rating, and computes what percentage were verified purchases. We used $cond to convert true/false values to 1/0, then averaged them to get the verification percentage. Marketing can use this to track review quality over time.

These queries answer real business questions directly on your operational data. Now let's see how to integrate this with Python for complete data pipelines.

Connecting MongoDB to Your Data Pipeline

Real data engineering connects systems. MongoDB rarely works in isolation because it's part of a larger data ecosystem. Let's connect it to Python, where you can integrate it with the rest of your pipeline.

Exporting data from MongoDB

You can export data from Compass in a few ways: export entire collections from the Documents tab, or build aggregation pipelines in the Aggregation tab and export those results. Choose JSON or CSV depending on your downstream needs.

For more flexibility with specific queries, let's use Python. First, install PyMongo, the official MongoDB driver:

pip install pymongo pandas

Here's a practical example that extracts data from MongoDB for analysis:

from pymongo import MongoClient
import pandas as pd

# Connect to MongoDB Atlas
# In production, store this as an environment variable for security
connection_string = "mongodb+srv://username:[email protected]/"
client = MongoClient(connection_string)
db = client.ecommerce_analytics

# Query high-rated reviews
high_rated_reviews = list(
    db.customer_reviews.find({
        "rating": {"$gte": 4}
    })
)

# Convert to DataFrame for analysis
df = pd.DataFrame(high_rated_reviews)

# Clean up MongoDB's internal _id field
if '_id' in df.columns:
    df = df.drop('_id', axis=1)

# Handle optional fields gracefully (remember our schema flexibility?)
df['has_photo'] = df['photo_url'].notna()
df['has_video'] = df['video_url'].notna()

# Analyze product performance
product_metrics = df.groupby('product_id').agg({
    'rating': 'mean',
    'helpful_votes': 'sum',
    'customer_id': 'count'
}).rename(columns={'customer_id': 'review_count'})

print("Product Performance (Last 30 Days):")
print(product_metrics)

# Export for downstream processing
df.to_csv('recent_positive_reviews.csv', index=False)
print(f"\nExported {len(df)} reviews for downstream processing")

This is a common pattern in data engineering: MongoDB stores operational data, Python extracts and transforms it, and the results feed into SQL databases, data warehouses, or BI tools.

Where MongoDB fits in larger data architectures

This pattern, using different databases for different purposes, is called polyglot persistence. Here's how it typically works in production:

MongoDB handles operational workloads: Flexible schemas, high write volumes, real-time applications
SQL databases handle analytical workloads: Complex queries, reporting, business intelligence
Python bridges the gap: Extracting, transforming, and loading data between systems

You might use MongoDB to capture raw user events in real-time, then periodically extract and transform that data into a PostgreSQL data warehouse where business analysts can run complex reports. Each database does what it does best.

The key is understanding that modern data pipelines aren't about choosing MongoDB OR SQL… they're about using both strategically. MongoDB excels at evolving schemas and horizontal scaling. SQL databases excel at complex analytics and mature tooling. Real data engineering combines them thoughtfully.

Review and Next Steps

You've covered significant ground today. You can now set up MongoDB, handle schema changes without migrations, write queries and aggregation pipelines, and connect everything to Python for broader data workflows.

This isn't just theoretical knowledge. You've worked through the same challenges that come up in real projects: evolving data structures, flexible document storage, and integrating NoSQL with analytical tools.

Your next steps depend on what you're trying to build:

If you want deeper MongoDB knowledge:

Learn about indexing strategies for query optimization
Explore change streams for real-time data processing
Try MongoDB's time-series collections for IoT data
Understand sharding for horizontal scaling
Practice thoughtful document design (flexibility doesn't mean "dump everything in one document")
Learn MongoDB's consistency trade-offs (it's not just "SQL but schemaless")

If you want to explore the broader NoSQL ecosystem:

Try Redis for caching. It's simpler than MongoDB and solves different problems
Experiment with Elasticsearch for full-text search across your reviews
Look at Cassandra for true time-series data at massive scale
Consider Neo4j if you need to analyze relationships between customers

If you want to build production systems:

Create a complete ETL pipeline: MongoDB → Airflow → PostgreSQL
Set up monitoring with MongoDB Atlas metrics
Implement proper error handling and retry logic
Learn about consistency levels and their trade-offs

The concepts you've learned apply beyond MongoDB. Document flexibility appears in DynamoDB and CouchDB. Aggregation pipelines exist in Elasticsearch. Using different databases for different parts of your pipeline is standard practice in modern systems.

You now understand when to choose NoSQL versus SQL, matching tools to problems. MongoDB handles flexible schemas and horizontal scaling well, whereas SQL databases excel at complex queries and transactions. Most real systems use both.

The next time you encounter rapidly changing requirements or need to scale beyond a single server, you'll recognize these as problems that NoSQL databases were designed to solve.

Received — 25 September 2025 ⏭ Data Science

Dataquest
Project Tutorial: Build a Web Interface for Your Chatbot with Streamlit (Step-by-Step) 25 September 2025 at 00:02

Project Tutorial: Build a Web Interface for Your Chatbot with Streamlit (Step-by-Step)

Dataquest

25 September 2025 at 00:02

You've built a chatbot in Python, but it only runs in your terminal. What if you could give it a sleek web interface that anyone can use? What if you could deploy it online for friends, potential employers, or clients to interact with?

In this hands-on tutorial, we'll transform a command-line chatbot into a professional web application using Streamlit. You'll learn to create an interactive interface with customizable personalities, real-time settings controls, and deploy it live on the internet—all without writing a single line of HTML, CSS, or JavaScript.

By the end of this tutorial, you'll have a deployed web app that showcases your AI development skills and demonstrates your ability to build user-facing applications.

Why Build a Web Interface for Your Chatbot?

A command-line chatbot is impressive to developers, but a web interface speaks to everyone. Portfolio reviewers, potential clients, and non-technical users can immediately see and interact with your work. More importantly, building web interfaces for AI applications is a sought-after skill as businesses increasingly want to deploy AI tools that their teams can actually use.

Streamlit makes this transition seamless. Instead of learning complex web frameworks, you'll use Python syntax you already know to create professional-looking applications in minutes, not days.

What You'll Build

Interactive web chatbot with real-time personality switching
Customizable controls for AI parameters (temperature, token limits)
Professional chat interface with user/assistant message distinction
Reset functionality and conversation management
Live deployment accessible from any web browser
Foundation for more advanced AI applications

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

1. Review the Project

Explore the goals and structure of this project: Start the project here

2. Complete Your Chatbot Foundation

Essential Prerequisite: If you haven't already, complete the previous chatbot project to build your core logic. You'll need a working Python chatbot with conversation memory and token management before starting this tutorial.

3. Set Up Your Development Environment

Required Tools:

Python IDE (VS Code or PyCharm recommended)
OpenAI API key (or Together AI for a free alternative)
GitHub account for deployment

We'll be working with standard Python files (.py format) instead of Jupyter notebooks, so make sure you're comfortable coding in your chosen IDE.

4. Install and Test Streamlit

Install the required packages:

pip install streamlit openai tiktoken

Test your installation with a simple demo:

import streamlit as st
st.write("Hello Streamlit!")

Save this as test.py and run the following in the command line:

streamlit run test.py

If a browser window opens with the message "Hello Streamlit!", you're ready to proceed.

5. Verify Your API Access

Test your OpenAI API key works:

import os
from openai import OpenAI

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

# Simple test call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello!"}],
    max_tokens=10
)

print(response.choices[0].message.content)

6. Access the Complete Solution

View and download the solution files: Solution Repository

What you'll find:

starter_code.py - The initial chatbot code we'll start with
final.py - Complete Streamlit application
requirements.txt - All necessary dependencies
Deployment configuration files

Starting Point: Your Chatbot Foundation

If you don't have a chatbot already, create a file called starter_code.py with this foundation:

import os
from openai import OpenAI
import tiktoken

# Configuration
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
TOKEN_BUDGET = 1000
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

# Token management functions (collapsed for clarity)
def get_encoding(model):
    try:
        return tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Tokenizer for model '{model}' not found. Falling back to 'cl100k_base'.")
        return tiktoken.get_encoding("cl100k_base")

ENCODING = get_encoding(MODEL)

def count_tokens(text):
    return len(ENCODING.encode(text))

def total_tokens_used(messages):
    try:
        return sum(count_tokens(msg["content"]) for msg in messages)
    except Exception as e:
        print(f"[token count error]: {e}")
        return 0

def enforce_token_budget(messages, budget=TOKEN_BUDGET):
    try:
        while total_tokens_used(messages) > budget:
            if len(messages) <= 2:
                break
            messages.pop(1)
    except Exception as e:
        print(f"[token budget error]: {e}")

# Core chat function
def chat(user_input):
    messages.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})

    enforce_token_budget(messages)
    return reply

This gives us a working chatbot with conversation memory and cost controls. Now let's transform it into a web app.

Part 1: Your First Streamlit Interface

Create a new file called app.py and copy your starter code into it. Now we'll add the web interface layer.

Add the Streamlit import at the top:

import streamlit as st

At the bottom of your file, add your first Streamlit elements:

### Streamlit Interface ###
st.title("Sassy Chatbot")

Test your app by running this in your terminal:

streamlit run app.py

Your default browser should open showing your web app with the title "Sassy Chatbot." Notice the auto-reload feature; when you save changes, Streamlit prompts you to rerun the app.

Learning Insight: Streamlit uses "magic" rendering. You don't need to explicitly display elements. Simply calling st.title() automatically renders the title in your web interface.

Part 2: Building the Control Panel

Real applications need user controls. Let's add a sidebar with personality options and parameter controls.

Building the Control Panel

Add this after your title:

# Sidebar controls
st.sidebar.header("Options")
st.sidebar.write("This is a demo of a sassy chatbot using OpenAI's API.")

# Temperature and token controls
max_tokens = st.sidebar.slider("Max Tokens", 1, 250, 100)
temperature = st.sidebar.slider("Temperature", 0.0, 1.0, 0.7)

# Personality selection
system_message_type = st.sidebar.selectbox("System Message",
    ("Sassy Assistant", "Angry Assistant", "Custom"))

Save and watch your sidebar populate with interactive controls. These sliders automatically store their values in the respective variables when users interact with them.

Adding Dynamic Personality System

Now let's make the personality selection actually work:

# Dynamic system prompt based on selection
if system_message_type == "Sassy Assistant":
    SYSTEM_PROMPT = "You are a sassy assistant that is fed up with answering questions."
elif system_message_type == "Angry Assistant":
    SYSTEM_PROMPT = "You are an angry assistant that likes yelling in all caps."
elif system_message_type == "Custom":
    SYSTEM_PROMPT = st.sidebar.text_area("Custom System Message",
        "Enter your custom system message here.")
else:
    SYSTEM_PROMPT = "You are a helpful assistant."

The custom option creates a text area where users can write their own personality instructions. Try switching between personalities and notice how the interface adapts.

Part 3: Understanding Session State

Here's where Streamlit gets tricky. Every time a user interacts with your app, Streamlit reruns the entire script from top to bottom. This would normally reset your chat history every time, which is not what we want for a conversation!

Session state solves this by persisting data across app reruns:

# Initialize session state for conversation memory
if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]

This creates a persistent messages list that survives app reruns. Now we need to modify our chat function to use session state:

def chat(user_input, temperature=TEMPERATURE, max_tokens=MAX_TOKENS):
    # Get messages from session state
    messages = st.session_state.messages
    messages.append({"role": "user", "content": user_input})

    enforce_token_budget(messages)

    # Add loading spinner for better UX
    with st.spinner("Thinking..."):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

Learning Insight: Session state is like a dictionary that persists between app reruns. Think of it as your app's memory system.

Part 4: Interactive Buttons and Controls

Interactive Buttons and Controls

Let's add buttons to make the interface more user-friendly:

# Control buttons
if st.sidebar.button("Apply New System Message"):
    st.session_state.messages[0] = {"role": "system", "content": SYSTEM_PROMPT}
    st.success("System message updated.")

if st.sidebar.button("Reset Conversation"):
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    st.success("Conversation reset.")

These buttons provide immediate feedback with success messages, creating a more polished user experience.

Part 5: The Chat Interface

The Chat Interface

Now for the main event—the actual chat interface. Add this code:

# Chat input using walrus operator
if prompt := st.chat_input("What is up?"):
    reply = chat(prompt, temperature=temperature, max_tokens=max_tokens)

# Display chat history
for message in st.session_state.messages[1:]:  # Skip system message
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

The chat_input widget creates a text box at the bottom of your app. The walrus operator (:=) assigns the user input to prompt and checks if it exists in one line.

Visual Enhancement: Streamlit automatically adds user and assistant icons to chat messages when you use the proper role names ("user" and "assistant").

Part 6: Testing Your Complete App

Save your file and test the complete interface:

Personality Test: Switch between Sassy and Angry assistants, apply the new system message, then chat to see the difference
Memory Test: Have a conversation, then reference something you said earlier
Parameter Test: Drag the max tokens slider to 1 and see how responses get cut off
Reset Test: Use the reset button to clear conversation history

Your complete working app should look something like this:

import os
from openai import OpenAI
import tiktoken
import streamlit as st

# API and model configuration
api_key = st.secrets.get("OPENAI_API_KEY") or os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
TOKEN_BUDGET = 1000
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

# [Token management functions here - same as starter code]

def chat(user_input, temperature=TEMPERATURE, max_tokens=MAX_TOKENS):
    messages = st.session_state.messages
    messages.append({"role": "user", "content": user_input})
    enforce_token_budget(messages)

    with st.spinner("Thinking..."):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

### Streamlit Interface ###
st.title("Sassy Chatbot")
st.sidebar.header("Options")
st.sidebar.write("This is a demo of a sassy chatbot using OpenAI's API.")

max_tokens = st.sidebar.slider("Max Tokens", 1, 250, 100)
temperature = st.sidebar.slider("Temperature", 0.0, 1.0, 0.7)
system_message_type = st.sidebar.selectbox("System Message",
    ("Sassy Assistant", "Angry Assistant", "Custom"))

if system_message_type == "Sassy Assistant":
    SYSTEM_PROMPT = "You are a sassy assistant that is fed up with answering questions."
elif system_message_type == "Angry Assistant":
    SYSTEM_PROMPT = "You are an angry assistant that likes yelling in all caps."
elif system_message_type == "Custom":
    SYSTEM_PROMPT = st.sidebar.text_area("Custom System Message",
        "Enter your custom system message here.")

if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]

if st.sidebar.button("Apply New System Message"):
    st.session_state.messages[0] = {"role": "system", "content": SYSTEM_PROMPT}
    st.success("System message updated.")

if st.sidebar.button("Reset Conversation"):
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    st.success("Conversation reset.")

if prompt := st.chat_input("What is up?"):
    reply = chat(prompt, temperature=temperature, max_tokens=max_tokens)

for message in st.session_state.messages[1:]:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

Part 7: Deploying to the Internet

Running locally is great for development, but deployment makes your project shareable and accessible to others. Streamlit Community Cloud offers free hosting directly from your GitHub repository.

Prepare for Deployment

First, create the required files in your project folder:

requirements.txt:

openai
streamlit
tiktoken

.gitignore:

.streamlit/

Note that if you’ve stored your API key in a .env file you should add this to .gitignore as well.

Secrets Management: Create a .streamlit/secrets.toml file locally:

OPENAI_API_KEY = "your-api-key-here"

Important: Add .streamlit/ to your .gitignore so you don't accidentally commit your API key to GitHub.

GitHub Setup

Create a new GitHub repository
Push your code (the .gitignore will protect your secrets)
Your repository should contain: app.py, requirements.txt, and .gitignore

Deploy to Streamlit Cloud

Go to share.streamlit.io
Connect your GitHub account
Select your repository and main branch
Choose your app file (app.py)
In Advanced settings, add your API key as a secret:
```
OPENAI_API_KEY = "your-api-key-here"
```
Click "Deploy"

Within minutes, your app will be live at a public URL that you can share with anyone!

Security Note: The secrets you add in Streamlit Cloud are encrypted and secure. Never put API keys directly in your code files.

Understanding Key Concepts

Session State Deep Dive

Session state is Streamlit's memory system. Without it, every user interaction would reset your app completely. Think of it as a persistent dictionary that survives app reruns:

# Initialize once
if "my_data" not in st.session_state:
    st.session_state.my_data = []

# Use throughout your app
st.session_state.my_data.append("new item")

The Streamlit Execution Model

Streamlit reruns your entire script on every interaction. This "reactive" model means:

Your app always shows the current state
You need session state for persistence
Expensive operations should be cached or minimized

Widget State Management

Widgets (sliders, inputs, buttons) automatically manage their state:

Slider values are always current
Button presses trigger reruns
Form inputs update in real-time

Troubleshooting Common Issues

"No module named 'streamlit'": Install Streamlit with pip install streamlit
API key errors: Verify your environment variables or Streamlit secrets are set correctly
App won't reload: Check for Python syntax errors in your terminal output
Session state not working: Ensure you're checking if "key" not in st.session_state: before initializing
Deployment fails: Verify your requirements.txt includes all necessary packages

Extending Your Chatbot App

Immediate Enhancements

File Upload: Let users upload documents for the chatbot to reference
Export Conversations: Add a download button for chat history
Usage Analytics: Track token usage and costs
Multiple Chat Sessions: Support multiple conversation threads

Advanced Features

User Authentication: Require login for personalized experiences
Database Integration: Store conversations permanently
Voice Interface: Add speech-to-text and text-to-speech
Multi-Model Support: Let users choose different AI models

Business Applications

Customer Service Bot: Deploy for client support with company-specific knowledge
Interview Prep Tool: Create domain-specific interview practice bots
Educational Assistant: Build tutoring bots for specific subjects
Content Generator: Develop specialized writing assistants

Key Takeaways

Building web interfaces for AI applications demonstrates that you can bridge the gap between technical capability and user accessibility. Through this tutorial, you've learned:

Technical Skills:

Streamlit fundamentals and reactive programming model
Session state management for persistent applications
Web deployment from development to production
Integration patterns for AI APIs in web contexts

Professional Skills:

Creating user-friendly interfaces for technical functionality
Managing secrets and security in deployed applications
Building portfolio-worthy projects that demonstrate real-world skills
Understanding the path from prototype to production application

Strategic Understanding:

Why web interfaces matter for AI applications
How to make technical projects accessible to non-technical users
The importance of user experience in AI application adoption

You now have a deployed chatbot application that showcases multiple in-demand skills: AI integration, web development, user interface design, and cloud deployment. This foundation prepares you to build more sophisticated applications and demonstrates your ability to create complete, user-facing AI solutions.

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

Received — 20 September 2025 ⏭ Data Science

Dataquest
Introduction to NoSQL: What It Is and Why You Need It 19 September 2025 at 22:25

Introduction to NoSQL: What It Is and Why You Need It

Dataquest

19 September 2025 at 22:25

Picture yourself as a data engineer at a fast-growing social media company. Every second, millions of users are posting updates, uploading photos, liking content, and sending messages. Your job is to capture all of this activity—billions of events per day—store it somewhere useful, and transform it into insights that the business can actually use.

You set up a traditional SQL database, carefully designing tables for posts, likes, and comments. Everything works great... for about a week. Then the product team launches "reactions," adding hearts and laughs to "likes". Next week, story views. The week after, live video metrics. Each change means altering your database schema, and with billions of rows, these migrations take hours while your server struggles with the load.

This scenario isn't hypothetical. It's exactly what companies like Facebook, Amazon, and Google faced in the early 2000s. The solution they developed became what we now call NoSQL.

These are exactly the problems NoSQL databases solve, and understanding them will change how you think about data storage. By the end of this tutorial, you'll be able to:

Understand what NoSQL databases are and how they differ from traditional SQL databases
Identify the four main types of NoSQL databases—document, key-value, column-family, and graph—and when to use each one
Make informed decisions about when to choose NoSQL vs SQL for your data engineering projects
See real-world examples from companies like Netflix and Uber showing how these databases work together in production
Get hands-on experience with MongoDB to cement these concepts with practical skills

Let's get started!

What NoSQL Really Means (And Why It Exists)

Let's clear up a common confusion right away: NoSQL originally stood for "No SQL" when developers were frustrated with the limitations of relational databases. But as these new databases matured, the community realized that throwing away SQL entirely was like throwing away a perfectly good hammer just because you also needed a screwdriver. Today, NoSQL means "Not Only SQL." These databases complement traditional SQL databases rather than replacing them.

To understand why NoSQL emerged, we need to understand what problem it was solving. Traditional SQL databases were designed when storage was expensive, data was small, and schemas were stable. They excel at maintaining consistency but scale vertically—when you need more power, you buy a bigger server.

By the 2000s, this broke down. Companies faced massive, messy, constantly changing data. Buying bigger servers wasn't sustainable, and rigid table structures couldn't handle the variety.

NoSQL databases were designed from the ground up for this new reality. Instead of scaling up by buying bigger machines, they scale out by adding more commodity servers. Instead of requiring you to define your data structure upfront, they let you store data first and figure out its structure later. And instead of keeping all data on one machine for consistency, they spread it across many machines for resilience and performance.

Understanding NoSQL Through a Data Pipeline Lens

As a data engineer, you'll rarely use just one database. Instead, you'll build pipelines where different databases serve different purposes. Think of it like cooking a complex meal: you don't use the same pot for everything. You use a stockpot for soup, a skillet for searing, and a baking dish for the oven. Each tool has its purpose.

Let's walk through a typical data pipeline to see where NoSQL fits.

The Ingestion Layer

At the very beginning of your pipeline, you have raw data landing from everywhere. This is often messy. When you're pulling data from mobile apps, web services, IoT devices, and third-party APIs, each source has its own format and quirks. Worse, these formats change without warning.

A document database like MongoDB thrives here because it doesn't force you to know the exact structure beforehand. If the mobile team adds a new field to their events tomorrow, MongoDB will simply store it. No schema migration, no downtime.

The Processing Layer

Moving down the pipeline, you're transforming, aggregating, and enriching your data. Some happens in real-time (recommendation feeds) and some in batches (daily metrics).

For lightning-fast lookups, Redis keeps frequently accessed data in memory. User preferences load instantly rather than waiting for complex database queries.

The Serving Layer

Finally, there's where cleaned, processed data becomes available for analysis and applications. This is often where SQL databases shine with their powerful query capabilities and mature tooling. But even here, NoSQL plays a role. Time-series data might live in Cassandra where it can be queried efficiently by time range. Graph relationships might be stored in Neo4j for complex network analysis.

The key insight is that modern data architectures are polyglot. They use multiple database technologies, each chosen for its strengths. NoSQL databases don't replace SQL; they handle the workloads that SQL struggles with.

The Four Flavors of NoSQL (And When to Use Each)

NoSQL isn't a single technology but rather four distinct database types, each optimized for different patterns. Understanding these differences is essential because choosing the wrong type can lead to performance headaches, operational complexity, and frustrated developers.

Document Databases: The Flexible Containers

Document databases store data as documents, typically in JSON format. If you've worked with JSON before, you already understand the basic concept. Each document is self-contained, with its own structure that can include nested objects and arrays.

Imagine you're building a product catalog for an e-commerce site:

A shirt has size and color attributes
A laptop has RAM and processor speed
A digital download has file format and license type

In a SQL database, you'd need separate tables for each product type or a complex schema with many nullable columns. In MongoDB, each product is just a document with whatever fields make sense for that product.

Best for:

Content management systems
Event logging and analytics
Mobile app backends
Any application with evolving data structures

This flexibility makes document databases perfect for situations where your data structure evolves frequently or varies between records. But remember: flexibility doesn't mean chaos. You still want consistency within similar documents, just not the rigid structure SQL demands.

Key-Value Stores: The Speed Demons

Key-value stores are the simplest NoSQL type: just keys mapped to values. Think of them like a massive Python dictionary or JavaScript object that persists across server restarts. This simplicity is their superpower. Without complex queries or relationships to worry about, key-value stores can be blazingly fast.

Redis, the most popular key-value store, keeps data in memory for extremely fast access times, often under a millisecond for simple lookups. Consider these real-world uses:

Netflix showing you personalized recommendations
Uber matching you with a nearby driver
Gaming leaderboards updating in real-time
Shopping carts persisting across sessions

The pattern here is clear: when you need simple lookups at massive scale and incredible speed, key-value stores deliver.

The trade-off: You can only look up data by its key. No querying by other attributes, no relationships, no aggregations. You wouldn't build your entire application on Redis, but for the right use case, nothing else comes close to its performance.

Column-Family Databases: The Time-Series Champions

Column-family databases organize data differently than you might expect. Instead of rows with fixed columns like SQL, they store data in column families — groups of related columns that can vary between rows. This might sound confusing, so let's use a concrete example.

Imagine you're storing temperature readings from thousands of IoT sensors:

Each sensor reports at different intervals (some every second, others every minute)
Some sensors report temperature only
Others also report humidity, pressure, or both
You need to query millions of readings by time range

In a column-family database like Cassandra, each sensor becomes a row with different column families. You might have a "measurements" family containing temperature, humidity, and pressure columns, and a "metadata" family with location and sensor_type. This structure makes it extremely efficient to query all measurements for a specific sensor and time range, or to retrieve just the metadata without loading the measurement data.

Perfect for:

Application logs and metrics
IoT sensor data
Financial market data
Any append-heavy, time-series workload

This design makes column-family databases exceptional at handling write-heavy workloads and scenarios where you're constantly appending new data.

Graph Databases: The Relationship Experts

Graph databases take a completely different approach. Instead of tables or documents, they model data as nodes (entities) and edges (relationships). This might seem niche, but when relationships are central to your queries, graph databases turn complex problems into simple ones.

Consider LinkedIn's "How you're connected" feature. To find the path between you and another user using SQL would require recursive joins that become exponentially complex as the network grows.
In a graph database like Neo4j, this is a basic traversal operation that can handle large networks efficiently. While performance depends on query complexity and network structure, graph databases excel at these relationship-heavy problems that would be nearly impossible to solve efficiently in SQL.

Graph databases excel at:

Recommendation engines ("customers who bought this also bought...")
Fraud detection (finding connected suspicious accounts)
Social network analysis (identifying influencers)
Knowledge graphs (mapping relationships between concepts)
Supply chain optimization (tracing dependencies)

They're specialized tools, but for the right problems, they're invaluable. If your core challenge involves understanding how things connect and influence each other, graph databases provide elegant solutions that would be nightmarish in other systems.

Making the NoSQL vs SQL Decision

One of the most important skills you'll develop as a data engineer is knowing when to use NoSQL versus SQL. The key is matching each database type to the problems it solves best.

When NoSQL Makes Sense

If your data structure changes frequently (like those social media events we talked about earlier), the flexibility of document databases can save you from constant schema migrations. When you're dealing with massive scale, NoSQL's ability to distribute data across many servers becomes critical. Traditional SQL databases can scale to impressive sizes, but when you're talking about petabytes of data or millions of requests per second, NoSQL's horizontal scaling model is often more cost-effective.

NoSQL also shines when your access patterns are simple:

Looking up records by ID
Retrieving entire documents
Querying time-series data by range
Caching frequently accessed data

These databases achieve incredible performance by optimizing for specific patterns rather than trying to be everything to everyone.

When SQL Still Rules

SQL databases remain unbeatable for complex queries. The ability to join multiple tables, perform aggregations, and write sophisticated analytical queries is where SQL's decades of development really show. If your application needs to answer questions like "What's the average order value for customers who bought product A but not product B in the last quarter?", SQL makes this straightforward, while NoSQL might require multiple queries and application-level processing.

Another SQL strength is keeping your data accurate and reliable. When you're dealing with financial transactions, inventory management, or any scenario where consistency is non-negotiable, traditional SQL databases ensure your data stays correct. Many NoSQL databases offer "eventual consistency." This means your data will be consistent across all nodes eventually, but there might be brief moments where different nodes show different values. For many applications this is fine, but for others it's a deal-breaker.

The choice between SQL and NoSQL often comes down to your specific needs rather than one being universally better. SQL databases have had decades to mature their tooling and build deep integrations with business intelligence platforms. But NoSQL databases have caught up quickly, especially with the rise of managed cloud services that handle much of the operational complexity.

Common Pitfalls and How to Avoid Them

As you start working with NoSQL, there are some common mistakes that almost everyone makes. Let’s help you avoid them.

The "Schemaless" Trap

The biggest misconception is that "schemaless" means "no design required." Just because MongoDB doesn't enforce a schema doesn't mean you shouldn't have one. In fact, NoSQL data modeling often requires more upfront thought than SQL. You need to understand your access patterns and design your data structure around them.

In document databases, you might denormalize data that would be in separate SQL tables. In key-value stores, your key design determines your query capabilities. It's still careful design work, just focused on access patterns rather than normalization rules.

Underestimating Operations

Many newcomers underestimate the operational complexity of NoSQL. While managed services have improved this significantly, running your own Cassandra cluster or MongoDB replica set requires understanding concepts like:

Consistency levels and their trade-offs
Replication strategies and failure handling
Partition tolerance and network splits
Backup and recovery procedures
Performance tuning and monitoring

Even with managed services, you need to understand these concepts to use the databases effectively.

The Missing Joins Problem

In SQL, you can easily combine data from multiple tables with joins. Most NoSQL databases don't support this, which surprises people coming from SQL. So how do you handle relationships between your data? You have three options:

Denormalize your data: Store redundant copies where needed
Application-level joins: Multiple queries assembled in your code
Choose a different database: Sometimes SQL is simply the right choice

The specifics of these approaches go beyond what we'll cover here, but being aware that joins don't exist in NoSQL will save you from some painful surprises down the road.

Getting Started: Your Path Forward

So where do you begin with all of this? The variety of NoSQL databases can feel overwhelming, but you don't need to learn everything at once.

Start with a Real Problem

Don't choose a database and then look for problems to solve. Instead, identify a concrete use case:

Have JSON data with varying structure? Try MongoDB
Need to cache data for faster access? Experiment with Redis
Working with time-series data? Set up a Cassandra instance
Analyzing relationships? Consider Neo4j

Having a concrete use case makes learning much more effective than abstract tutorials.

Focus on One Type First

Pick one NoSQL type and really understand it before moving to others. Document databases like MongoDB are often the most approachable if you're coming from SQL. The document model is intuitive, and the query language is relatively familiar.

Use Managed Services

While you're learning, use managed services like MongoDB Atlas, Amazon DynamoDB, or Redis Cloud instead of running your own clusters. Setting up distributed databases is educational, but it's a distraction when you're trying to understand core concepts.

Remember the Bigger Picture

Most importantly, remember that NoSQL is a tool in your toolkit, not a replacement for everything else. The most successful data engineers understand both SQL and NoSQL, knowing when to use each and how to make them work together.

Next Steps

You've covered a lot of ground today. You now:

Understand what NoSQL databases are and why they exist
Know the four main types and their strengths
Can identify when to choose NoSQL vs SQL for different use cases
Recognize how companies use multiple databases together in real systems
Understand the common pitfalls to avoid as you start working with NoSQL

With this conceptual foundation, you're ready to get hands-on and see how these databases actually work. You understand the big picture of where NoSQL fits in modern data engineering, but there's nothing like working with real data to make it stick.

The best way to build on what you've learned is to pick one database and start experimenting:

Get hands-on with MongoDB by setting up a database, loading real data, and practicing queries. Document databases are often the most approachable starting point.
Design a multi-database project for your portfolio. Maybe an e-commerce analytics pipeline that uses MongoDB for raw events, Redis for caching, and PostgreSQL for final reports.
Learn NoSQL data modeling to understand how to structure documents, design effective keys, and handle relationships without joins.
Explore stream processing patterns to see how Kafka works with NoSQL databases to handle real-time data flows.
Try cloud NoSQL services like DynamoDB, Cosmos DB, or Cloud Firestore to understand managed database offerings.
Study polyglot architectures by researching how companies like Netflix, Spotify, or GitHub combine different database types in their systems.

Each of these moves you toward the kind of hands-on experience that employers value. Modern data teams expect you to understand both SQL and NoSQL, and more importantly, to know when and why to use each.

The next time you're faced with billions of rapidly changing events, evolving data schemas, or the need to scale beyond a single server, you'll have the knowledge to choose the right tool for the job. That's the kind of systems thinking that makes great data engineers.

Dataquest
Project Tutorial: Build an AI Chatbot with Python and the OpenAI API 19 September 2025 at 22:03

Project Tutorial: Build an AI Chatbot with Python and the OpenAI API

Dataquest