Normal view

Running and Managing Apache Airflow with Docker (Part II)

8 November 2025 at 02:17

In the previous tutorial, we set up Apache Airflow inside Docker, explored its architecture, and built our first real DAG using the TaskFlow API. We simulated an ETL process with two stages — Extract and Transform, demonstrating how Airflow manages dependencies, task retries, and dynamic parallel execution through Dynamic Task Mapping. By the end, we had a functional, scalable workflow capable of processing multiple datasets in parallel, a key building block for modern data pipelines.

In this tutorial, we’ll build on what you created earlier and take a significant step toward production-style orchestration. You’ll complete the ETL lifecycle by adding the Load stage and connecting Airflow to a local MySQL database. This will allow you to load transformed data directly from your pipeline and manage database connections securely using Airflow’s Connections and Environment Variables.

Beyond data loading, you’ll integrate Git and Git Sync into your Airflow environment to enable version control, collaboration, and continuous deployment of DAGs. These practices mirror how data engineering teams manage Airflow projects in real-world settings, promoting consistency, reliability, and scalability, while still keeping the focus on learning and experimentation.

By the end of this part, your Airflow setup will move beyond a simple sandbox and start resembling a production-aligned environment. You’ll have a complete ETL pipeline, from extraction and transformation to loading and automation, and a clear understanding of how professional teams structure and manage their workflows.

Working with MYSQL in Airflow

In the previous section, we built a fully functional Airflow pipeline that dynamically extracted and transformed market data from multiple regions , us, europe, asia, and africa. Each branch of our DAG handled its own extract and transform tasks independently, creating separate CSV files for each region under /opt/airflow/tmp. This setup mimics a real-world data engineering workflow where regional datasets are processed in parallel before being stored or analyzed further.

Now that our transformed datasets are generated, the next logical step is to load them into a database, a critical phase in any ETL pipeline. This not only centralizes your processed data but also allows for downstream analysis, reporting, and integration with BI tools like Power BI or Looker.

While production pipelines often write to cloud-managed databases such as Amazon RDS, Google Cloud SQL, or Azure Database for MySQL, we’ll keep things local and simple by using a MySQL instance on your machine. This approach allows you to test and validate your Airflow workflows without relying on external cloud resources or credentials. The same logic, however, can later be applied seamlessly to remote or cloud-hosted databases.

Prerequisite: Install and Set Up MySQL Locally

Before adding the Load step to our DAG, ensure that MySQL is installed and running on your machine.

Install MySQL

  • Windows/macOS: Download and install MySQL Community Server.

  • Linux (Ubuntu):

    sudo apt update
    sudo apt install mysql-server -y
    sudo systemctl start mysql
    
  • Verify installation by running:

mysql -u root -p

Create a Database and User for Airflow

Inside your MySQL terminal or MySQL Workbench, run the following commands:

CREATE DATABASE IF NOT EXISTS airflow_db;
CREATE USER IF NOT EXISTS 'airflow'@'%' IDENTIFIED BY 'airflow';
GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow'@'%';
FLUSH PRIVILEGES;

This creates a simple local database called airflow_db and a user airflow with full access, perfect for development and testing.

Create a Database and User for Airflow

Network Configuration for Linux Users

When running Airflow in Docker and MySQL locally on Linux, Docker containers can’t automatically access localhost.

To fix this, you need to make your local machine reachable from inside Docker.

Open your docker-compose.yaml file and add the following line under the x-airflow-common service definition:

extra_hosts:
  - "host.docker.internal:host-gateway"

This line creates a bridge that allows Airflow containers to communicate with your local MySQL instance using the hostname host.docker.internal.

Switching to LocalExecutor

In part one of this tutorial, we worked with CeleryExecutor to run our Airflow. By default, the Docker Compose file uses CeleryExecutor, which requires additional components such as Redis, Celery workers, and the Flower dashboard for distributed task execution.

Since we’re running Airflow to make it production-ready, we can simplify things by using LocalExecutor, which runs tasks in parallel on a single machine, eliminating the need for an external queue or worker system.

Find this line in your docker-compose.yaml:

AIRFLOW__CORE__EXECUTOR: CeleryExecutor 

Change it to:

AIRFLOW__CORE__EXECUTOR: LocalExecutor

Removing Unnecessary Services

Because we’re no longer using Celery, we can safely remove related components from the configuration. These include Redis, airflow-worker, and Flower.

You can search for the following sections and delete them:

  • The entire redis service block.
  • The airflow-worker service (Celery’s worker).
  • The flower service (Celery monitoring dashboard).
  • Any AIRFLOW__CELERY__... lines inside environment blocks.

Extending the DAG with a Load Step

Now let’s extend our existing DAG to include the Load phase of the ETL process. Already we had extract_market_data() and transform_market_data() created in the first part of this tutorial. This new task will read each transformed CSV file and insert its data into a MySQL table.

Here’s our updated daily_etl_pipeline_airflow3 DAG with the new load_to_mysql() task.
You can also find the complete version of this DAG in the cloned repository([email protected]:dataquestio/tutorials.git), inside the part-two/

folder under airflow-docker-tutorial .

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        """Load the transformed CSV data into a MySQL table."""
        import mysql.connector
        import os

        db_config = {
            "host": "host.docker.internal",  # enables Docker-to-local communication
            "user": "airflow",
            "password": "airflow",
            "database": "airflow_db",
            "port": 3306
        }

        df = pd.read_csv(transformed_file)

        # Derive the table name dynamically based on region
        table_name = f"transformed_market_data_{os.path.basename(transformed_file).split('_')[-1].replace('.csv', '')}"

        conn = mysql.connector.connect(**db_config)
        cursor = conn.cursor()

        # Create table if it doesn’t exist
        cursor.execute(f"""
            CREATE TABLE IF NOT EXISTS {table_name} (
                timestamp VARCHAR(50),
                market VARCHAR(50),
                company VARCHAR(255),
                price_usd DECIMAL,
                daily_change_percent DECIMAL
            );
        """)

        # Insert records
        for _, row in df.iterrows():
            cursor.execute(
                f"""
                INSERT INTO {table_name} (timestamp, market, company, price_usd, daily_change_percent)
                VALUES (%s, %s, %s, %s, %s)
                """,
                tuple(row)
            )

        conn.commit()
        conn.close()
        print(f"[LOAD] Data successfully loaded into MySQL table: {table_name}")

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

When you trigger this DAG, Airflow will automatically create three sequential tasks for each defined region (us, europe, asia, africa):

first extracting market data, then transforming it, and finally loading it into a region-specific MySQL table.

Create a Database and User for Airflow (2)

Each branch runs independently, so by the end of a successful run, your local MySQL database (airflow_db) will contain four separate tables, one for each region:

transformed_market_data_us
transformed_market_data_europe
transformed_market_data_asia
transformed_market_data_africa

Each table contains the cleaned and sorted dataset for its region, including company names, prices, and daily percentage changes.

Once your containers are running, open MySQL (via terminal or MySQL Workbench) and run:

SHOW TABLES;

Create a Database and User for Airflow (3)

You should see all four tables listed. Then, to inspect one of them, for example us, run:

SELECT * FROM transformed_market_data_us;

Create a Database and User for Airflow (4)

From above, we can see the dataset that Airflow extracted, transformed, and loaded for the U.S. market, confirming your pipeline has now completed all three stages of ETL: Extract → Transform → Load.

This integration demonstrates Airflow’s ability to manage data flow across multiple sources and databases seamlessly, a key capability in modern data engineering pipelines.

Absolutely, here’s the updated subsection with your requested note added in the right place.

It keeps the professional teaching tone and gently reminds learners that these connection values must match the local MySQL setup they created earlier.

Previewing the Loaded Data in Airflow

By now, you’ve confirmed that your transformed datasets are successfully loaded into MySQL, you can view them directly in MySQL Workbench or through a SQL client. But Airflow also provides a convenient way to query and preview this data right from the UI, using Connections and the SQLExecuteQueryOperator.

Connections in Airflow store the credentials and parameters needed to connect to external systems such as databases, APIs, or cloud services. Instead of hardcoding passwords or host details in your DAGs, you define a connection once in the Web UI and reference it securely using its conn_id.

To set this up:

  1. Open the Airflow Web UI
  2. Navigate to Admin → Connections → + Add a new record
  3. Fill in the following details:
Field Value
Conn Id local_mysql
Conn Type MySQL
Host host.docker.internal
Schema airflow_db
Login airflow
Password airflow
Port 3306

Note: These values must match the credentials you defined earlier when setting up your local MySQL instance.

Specifically, the database airflow_db, user airflow, and password airflow should already exist in your MySQL setup.

The host.docker.internal value ensures that your Airflow containers can communicate with MySQL running on your local machine.

  • Also note that when you use docker compose down -v, all volumes, including your Airflow connections, will be deleted. Always remember to re-add the connection afterward.

If your changes are not volume-related, you can safely shut down the containers using docker compose down (without -v), which preserves your existing connections and data.

Click Save to register the connection.

Now, Airflow knows how to connect to your MySQL database whenever a task specifies conn_id="local_mysql".

Let’s create a simple SQL query task to preview the data we just loaded.


    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        ...

        from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator

        preview_mysql = SQLExecuteQueryOperator(
            task_id="preview_mysql_table",
            conn_id="local_mysql",
            sql="SELECT * FROM transformed_market_data_us LIMIT 5;",
            do_xcom_push=True,  # makes query results viewable in Airflow’s XCom tab
        )
        # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

Next, link this task to your DAG so that it runs after the loading process, update this line load_to_mysql.expand(transformed_file=transformed_files) to this:

    load_to_mysql.expand(transformed_file=transformed_files) >> preview_mysql

When you trigger the DAG again (always remember to shut down the containers before making changes to your DAGs using docker compose down, and then, once saved, use docker compose up -d), Airflow will:

  1. Connect to your MySQL database using the stored connection credentials.
  2. Run the SQL query on the specified table.
  3. Display the first few rows of your data as a JSON result in the XCom view.

To see it:

  • Go to Grid View
  • Click on the preview_mysql_table task
  • Choose XCom from the top menu

Previewing the Loaded Data in Airflow (5)

You’ll see your data represented in JSON format, confirming that the integration works, Airflow not only orchestrates the workflow but can also interactively query and visualize your results without leaving the platform.

This makes it easy to verify that your ETL pipeline is functioning correctly end-to-end: extraction, transformation, loading, and now validation, all visible and traceable inside Airflow.

Git-Based DAG Management and CI/CD for Deployment (with git-sync)

At this stage, your local Airflow environment is complete, you’ve built a fully functional ETL pipeline that extracts, transforms, and loads regional market data into MySQL, and even validated results directly from the Airflow UI.

Now it’s time to take the final step toward production readiness: managing your DAGs the way data engineering teams do in real-world systems, through Git-based deployment and continuous integration.

We’ll push our DAGs to a shared GitHub repository called airflow_dags, and connect Airflow to it using the git-sync container, which automatically keeps your DAGs in sync. This allows every team member (or environment) to pull from the same source, the Git repo, without manually copying files into containers.

Why Manage DAGs with Git

Every DAG is just a Python file, and like all code, it deserves version control. Storing DAGs in a Git repository brings the same advantages that software engineers rely on:

  • Versioning: track every change and roll back safely.
  • Collaboration: multiple developers can work on different workflows without conflict.
  • Reproducibility: every environment can pull identical DAGs from a single source.
  • Automation: changes sync automatically, eliminating manual uploads.

This structure makes Airflow easier to maintain and scales naturally as your pipelines grow in number and complexity.

Pushing Your DAGs to GitHub

To begin, create a public or private repository named airflow_dags (e.g., https://github.com/<your-username>/airflow_dags).

Then, in your project root (airflow-docker), initialize Git and push your local dags/ directory:

git init
git remote add origin https://github.com/<your-username>/airflow_dags.git
git add dags/
git commit -m "Add Airflow ETL pipeline DAGs"
git branch -M main
git push -u origin main

Once complete, your DAGs live safely in GitHub, ready for syncing.

How git-sync Works

git-sync is a lightweight sidecar container that continuously clones and updates a Git repository into a shared volume.

Once running, it:

  • Clones your repository (e.g., https://github.com/<your-username>/airflow_dags.git),
  • Pulls updates every 30 seconds by default,
  • Exposes the latest DAGs to Airflow automatically, no rebuilds or restarts required.

This is how Airflow stays up to date with your Git repo in real time.

Setting Up git-sync in Docker Compose

In your existing docker-compose.yaml, you already have a list of services that define your Airflow environment, like the api-server, scheduler, triggerer, and dag-processor. Each of these runs in its own container but works together as part of the same orchestration system.

The git-sync container will become another service in this list, just like those, but with a very specific purpose:

  • to keep your /dags folder continuously synchronized with your remote GitHub repository.

Instead of copying Python DAG files manually or rebuilding containers every time you make a change, the git-sync service will automatically pull updates from your GitHub repo (in our case, airflow_dags) into a shared volume that all Airflow services can read from.

This ensures that your environment always runs the latest DAGs from GitHub ,without downtime, restarts, or manual synchronization.

Remember in our docker-compose.yaml file, we had this kind of setup:

Setting Up Git in Docker Compose

Now, we’ll extend that structure by introducing git-sync as an additional service within the same services: section and also an addition in the volumes: section(other than postgres-db-volume: we we have to also add airflow-dags-volume: for uniformity accross all containers).

Below is a configuration that works seamlessly with Docker on any OS:

services:
  git-sync:
    image: registry.k8s.io/git-sync/git-sync:v4.1.0
    user: "0:0"    # run as root so it can create /dags/git-sync
    restart: always
    environment:
      GITSYNC_REPO: "https://github.com/<your-username>/airflow-dags.git"
      GITSYNC_BRANCH: "main"           # use BRANCH not REF
      GITSYNC_PERIOD: "30s"
      GITSYNC_DEPTH: "1"
      GITSYNC_ROOT: "/dags/git-sync"
      GITSYNC_DEST: "repo"
      GITSYNC_LINK: "current"
      GITSYNC_ONE_TIME: "false"
      GITSYNC_ADD_USER: "true"
      GITSYNC_CHANGE_PERMISSIONS: "1"
      GITSYNC_STALE_WORKTREE_TIMEOUT: "24h"
    volumes:
      - airflow-dags-volume:/dags
    healthcheck:
      test: ["CMD-SHELL", "test -L /dags/git-sync/current && test -d /dags/git-sync/current/dags && [ \"$(ls -A /dags/git-sync/current/dags 2>/dev/null)\" ]"]
      interval: 10s
      timeout: 3s
      retries: 10
      start_period: 10s

volumes:
  airflow-dags-volume:

In this setup, the git-sync service runs as a lightweight companion container that keeps your Airflow DAGs in sync with your GitHub repository.

The GITSYNC_REPO variable tells it where to pull code from, in this case, your DAG repository (airflow_dags). Make sure you replace <your-username> with your exact GitHub username. The GITSYNC_BRANCH specifies which branch to track, usually main, while GITSYNC_PERIOD defines how often to check for updates. Here, it’s set to every 30 seconds, meaning Airflow will always be within half a minute of your latest Git push.

The synchronization happens inside the directory defined by GITSYNC_ROOT, which becomes /dags/git-sync inside the container. Inside that root, GITSYNC_DEST defines where the repo is cloned (as repo), and GITSYNC_LINK creates a symbolic link called current pointing to the active clone.

This design allows Airflow to always reference a stable, predictable path (/dags/git-sync/current/dags) even as the repository updates in the background, no path changes, no reloads.

A few environment flags ensure stability and portability across systems. For instance, GITSYNC_ADD_USER and GITSYNC_CHANGE_PERMISSIONS make sure the synced files are accessible to Airflow even when permissions differ across Docker environments.

GITSYNC_DEPTH limits the clone to just the latest commit (keeping it lightweight), while GITSYNC_STALE_WORKTREE_TIMEOUT helps clean up old syncs if something goes wrong.

The shared volume, airflow-dags-volume, acts as the bridge between git-sync and Airflow. It stores all synced DAGs in one central location accessible by both containers. The health check at the end ensures that git-sync is functioning, it verifies that the /current/dags directory exists and contains files before Airflow tries to load them.

Finally, the healthcheck section ensures that Airflow doesn’t start until git-sync has successfully cloned your repository. It runs a small shell command that checks three things, whether the symbolic link /dags/git-sync/current exists, whether the dags directory is present inside it, and whether that directory actually contains files. Only when all these conditions pass does Docker mark the git-sync service as healthy. The interval and retry parameters control how often and how long these checks run, ensuring that Airflow’s scheduler, webserver, and other components wait patiently until the DAGs are fully available. This simple step prevents race conditions and guarantees a smooth startup every time.

Together, these settings ensure that your Airflow instance always runs the latest DAGs from GitHub, automatically, securely, and without manual file transfers.

Generally, this configuration does the following:

  • Creates a shared volume (airflow-dags-volume) where the DAGs are cloned.
  • Mounts it into both git-sync and Airflow services.
  • Runs git-sync as root to fix permission issues on Windows.
  • Keeps DAGs up to date every 30 seconds.

Adjusting the Airflow Configuration

We’ve now added git-sync as part of our Airflow services, sitting right alongside the api-server, scheduler, triggerer, and dag-processor.

This new service continuously pulls our DAGs from GitHub and stores them inside a shared volume (airflow-dags-volume) that both git-sync and Airflow can access.

However, our Airflow setup still expects to find DAGs through local directory mounts defined under each service (via x-airflow-common), not global named volumes. The default configuration maps these paths as follows:

volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins

This setup points Airflow to the local dags/ folder in your host machine, but now that we have git-sync, our DAGs will live inside a synchronized Git directory instead.

So we need to update the DAG volume mapping to pull from the new shared Git volume instead of the local one.

Replace the first line(- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags) under the volumes: section with: - airflow-dags-volume:/opt/airflow/dags

This tells Docker to mount the shared airflow-dags-volume (created by git-sync) into Airflow’s /opt/airflow/dags directory.

That way, any DAGs pulled by git-sync from your GitHub repository will immediately appear inside Airflow’s working environment, without needing to rebuild or copy files.

We also need to explicitly tell Airflow where the synced DAGs live.

In the environment section of your x-airflow-common block, add the following:

AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags/git-sync/current/dags

This line links Airflow directly to the directory created by the git-sync container.

Here’s how it connects:

  • Inside the git-sync configuration, we defined:

    GITSYNC_ROOT: "/dags/git-sync"
    GITSYNC_LINK: "current"

    Together, these ensure that the most recent repository clone is always available under /dags/git-sync/current.

  • When we mount airflow-dags-volume:/opt/airflow/dags, this path becomes accessible inside the Airflow containers as

    /opt/airflow/dags/git-sync/current/dags.

By setting AIRFLOW__CORE__DAGS_FOLDER to that exact path, Airflow automatically watches the live Git-synced DAG directory for changes, meaning every new commit to your GitHub repo will reflect instantly in the Airflow UI.

Finally, ensure that Airflow waits for git-sync to finish cloning before starting up.

In each Airflow service (airflow-scheduler, airflow-apiserver, dag-processor, and triggerer), depends_on section, add:

depends_on:
  git-sync:
    condition: service_healthy

This guarantees that Airflow only starts once the git-sync container has successfully pulled your repository, preventing race conditions during startup.

Once complete, Airflow will read its DAGs directly from the synchronized Git directory , /opt/airflow/dags/git-sync/current/dags , instead of your local project folder.

This change transforms your setup into a live, Git-driven workflow, where Airflow continuously tracks and loads the latest DAGs from GitHub automatically.

Automating Validation with GitHub Actions

Our Git integration wouldn’t be truly powerful without CI/CD (Continuous Integration and Continuous Deployment).

While git-sync ensures that any change pushed to GitHub automatically reflects in Airflow, that alone can be risky, not every change should make it to production immediately.

Imagine pushing a DAG with a missing import, a syntax error, or a bad dependency.

Airflow might fail to parse it, causing your scheduler or api-server to crash or restart repeatedly. That’s why we need a safety net, a way to automatically check that every DAG in our repository is valid before it ever reaches Airflow.

This is exactly where GitHub Actions comes in.

We can set up a lightweight CI pipeline that validates all DAGs whenever someone pushes to the main branch. If a broken DAG is detected, the pipeline fails, preventing the merge and protecting your Airflow environment from unverified code.

GitHub also provides notifications directly in your repository interface, showing failed workflows and highlighting the cause of the issue.

Inside your airflow_dags repository, create a GitHub Actions workflow file at:

.github/workflows/validate-dags.yml

name: Validate Airflow DAGs

on:
  push:
    branches: [ main ]
    paths:
      - 'dags/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Airflow
        run: pip install apache-airflow==3.1.1

      - name: Validate DAGs
        run: |
          echo "Validating DAG syntax..."
          airflow dags list || exit 1

This simple workflow automatically runs every time you push a new commit to the main branch (or modify anything in the dags/ directory).

It installs Apache Airflow in a lightweight test environment, loads all your DAGs, and checks that they parse successfully, no missing imports, syntax issues, or circular dependencies.

If even one DAG fails to load, the validation job will exit with an error, causing the GitHub Actions pipeline to fail.

GitHub then immediately notifies you (and any collaborators) through the repository’s Actions tab, issue alerts, and optional email notifications.

By doing this, you’re adding a crucial layer of protection to your workflow:

  • Pre-deployment safety: invalid DAGs never reach your running Airflow instance.
  • Automatic feedback: failed DAGs trigger GitHub notifications, allowing you to fix errors early.
  • Confidence in deployment: once the pipeline passes, you know every DAG is production-ready.

Together, this CI validation and your git-sync setup create a self-updating, automated Airflow environment that mirrors production deployment practices.

With this final step, your Airflow environment becomes a versioned, automated, and production-ready orchestration system, capable of handling real data pipelines the way modern engineering teams do.

You’ve now completed a full transformation:

from local DAG development to automated, Git-driven deployment, all within Docker, all powered by Apache Airflow.

  • Note that, both the git-sync service and the Airflow UI depend on your Docker containers running. As long as your containers are up, git-sync remains active, continuously checking for updates in your GitHub repository and syncing any new DAGs to your Airflow environment.

    Once you stop or shut down the containers (docker compose down), this synchronization pauses. You also won’t be able to access the Airflow Web UI or trigger DAGs until the containers are started again.

    When you restart with docker compose up -d, everything, including git-sync , resumes automatically, picking up the latest changes from GitHub and restoring your full Airflow setup just as you left it.

Summary and Up Next

In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your workflow and connecting it to a local MySQL database. You learned how Airflow securely manages external connections, dynamically handles multiple data regions, and enables in-UI data previews through XCom and Connections.

You also took your setup a step closer to production by integrating Git-based DAG management with git-sync, and implementing GitHub Actions CI to validate DAGs automatically before deployment.

Together, these changes transformed your environment into a version-controlled, automated orchestration system that mirrors the structure of production-grade setups, a final step before deploying to the cloud.

In the next tutorial, you’ll move beyond simulated data and build a real-world data pipeline, extracting data from an API, transforming it with Python, and loading it into MySQL. You’ll also add retries, alerts, and monitoring, and deploy the full workflow through CI/CD, achieving a truly end-to-end, production-grade Airflow setup.

Running and Managing Apache Airflow with Docker (Part I)

7 November 2025 at 22:50

In the last tutorial, we explored what workflow orchestration is, why it matters, and how Apache Airflow structures, automates, and monitors complex data pipelines through DAGs, tasks, and the scheduler. We examined how orchestration transforms scattered scripts into a coordinated system that ensures reliability, observability, and scalability across modern data workflows.

In this two-part hands-on tutorial, we move from theory to practice. You’ll run Apache Airflow inside Docker, the most efficient and production-like way to deploy Airflow for development and testing. This containerized approach mirrors how Airflow operates in real-world environments, from on-premises teams to managed services like ECR and Cloud Composer.

In Part One, our focus goes beyond setup. You’ll learn how to work effectively with DAGs inside a Dockerized Airflow environment, writing, testing, visualizing, and managing them through the Web UI. You’ll use the TaskFlow API to build clean, Pythonic workflows and implement dynamic task mapping to run multiple processes in parallel. By the end of this part, you’ll have a fully functional Airflow environment running in Docker and a working DAG that extracts and transforms data automatically, the foundation of modern data engineering pipelines.

In Part Two, we’ll extend that foundation to handle data management and automation workflows. You’ll connect Airflow to a local MySQL database for data loading, manage credentials securely through the Admin panel and environment variables, and integrate Git with Git Sync to enable version control and continuous deployment. You’ll also see how CI/CD pipelines can automate DAG validation and deployment, ensuring your Airflow environment remains consistent and collaborative across development teams.

By the end of the series, you’ll not only understand how Airflow runs inside Docker but also how to design, orchestrate, and manage production-grade data pipelines the way data engineers do in real-world systems.

Why Use Docker for Airflow

While Airflow can be installed locally with pip install apache-airflow, this approach often leads to dependency conflicts, version mismatches, and complicated setups. Airflow depends on multiple services, an API server, scheduler, triggerer, metadata database, and dag-processors, all of which must communicate correctly. Installing and maintaining these manually on your local machine can be tedious and error-prone.

Docker eliminates these issues by packaging everything into lightweight, isolated containers. Each container runs a single Airflow component, but all work together seamlessly through Docker Compose. The result is a clean, reproducible environment that behaves consistently across operating systems.

In short:

  • Local installation: works for testing but often breaks due to dependency conflicts or version mismatches.
  • Cloud-managed services (like AWS ECS or Cloud Composer): excellent for production but not that much flexible for learning or prototyping.
  • Docker setup: combines realism with simplicity, providing the same multi-service environment used in production without the overhead of manual configuration.

Docker setup is ideal for learning and development and closely mirrors production environments, but additional configuration is needed for a full production deployment

Prerequisites

Before you begin, ensure the following are installed and ready on your system:

  1. Docker Desktop – Required to build and run Airflow containers.
  2. A code editor – Visual Studio Code or similar, for writing and editing DAGs.
  3. Python 3.10 or higher – Used for authoring Airflow DAGs and helper scripts.

Running Airflow Using Docker

Now that your environment is ready (Docker is open and running), let’s get Airflow running using Docker Compose.

This tool orchestrates all Airflow services, api-server, scheduler, triggerer, database, and workers — so they start and communicate properly.

Clone the Tutorial Repository

We’ve already prepared the starter files you’ll need for this tutorial on GitHub.

Begin by cloning the repository:

git clone [email protected]:dataquestio/tutorials.git

Then navigate to the Airflow tutorial folder:

cd airflow-docker-tutorial

This is the directory where you’ll be working throughout the tutorial.

Inside, you’ll notice a structure similar to this:

airflow-docker-tutorial/
├── part-one/  
├── part-two/
├── docker-compose.yaml
└── README.md
  • The part-one/ and part-two/ folders contain the complete reference files for both tutorials (Part One and Part Two).

    You don’t need to modify anything there, it’s only for comparison or review.

  • The docker-compose.yaml file is your starting point and will evolve as the tutorial progresses.

Explore the Docker Compose File

Open the docker-compose.yaml file in your code editor.

This file defines all the Airflow components and how they interact inside Docker.

It includes:

  • api-server – Airflow’s web user interface
  • Scheduler – Parses and triggers DAGs
  • Triggerer – Manages deferrable tasks efficiently
  • Metadata database – Tracks DAG runs and task states
  • Executors – Execute tasks

Each of these services runs in its own container, but together they form a single working Airflow environment.

You’ll be updating this file as you move through the tutorial to configure, extend, and manage your Airflow setup.

Create Required Folders

Airflow expects certain directories to exist before launching.

Create them inside the same directory as your docker-compose.yaml file:

mkdir -p ./dags ./logs ./plugins ./config
  • dags/ – your workflow scripts
  • logs/ – task execution logs
  • plugins/ – custom hooks and operators
  • config/ – optional configuration overrides (this will be auto-populated later when initializing the database)

Configure User Permissions

If you’re using Linux, set a user ID to prevent permission issues when Docker writes files locally:

echo -e "AIRFLOW_UID=$(id -u)" > .env

If you’re using macOS or Windows, manually create a .env file in the same directory with the following content:

AIRFLOW_UID=50000

This ensures consistent file ownership between your host system and the Docker containers.

Initialize the Metadata Database

Airflow keeps track of DAG runs, task states, and configurations in a metadata database.

Initialize it by running:

docker compose up airflow-init

Once initialization completes, you’ll see a message confirming that an admin user has been created with default credentials:

  • Username: airflow
  • Password: airflow

Start Airflow

Now start all Airflow services in the background:

docker compose up -d

Docker Compose will spin up the scheduler, API server, triggerer, database, and worker containers.

Step 6: Start Airflow

Now launch all the services in the background:

docker compose up -d

Docker Compose will start the scheduler, api-server, triggerer, database, and executor containers.

Start Airflow

Make sure the triggerer, dag-processor, scheduler, and api-server are shown as started as above. If that is not the case, rebuild the Docker container, since the build process might have been interrupted. Otherwise, navigate to http://localhost:8080 to access the Airflow UI exposed by the api-server.

You can also access this through your Docker app, by navigating to containers:

Start Airflow (2)

Log in using the credentials above to accessh the Airflow Web UI.

  • If the UI fails to load or some containers keep restarting, increase Docker’s memory allocation to at least 4 GB (8 GB recommended) in Docker Desktop → Settings → Resources.

Configuring the Airflow Project

Once Airflow is running and you visit http://localhost:8080, you’ll be seee Airflow Web UI.

Configuring the Airflow Project

This is the command center for your workflows, where you can visualize DAGs, monitor task runs, and manage system configurations. When you navigate to Dags, you’ll see a dashboard that lists several example DAGs provided by the Airflow team. These are sample workflows meant to demonstrate different operators, sensors, and features.

However, for this tutorial, we’ll build our own clean environment, so we’ll remove these example DAGs and customize our setup to suit our project.

Before doing that, though, it’s important to understand the docker-compose.yaml file, since this is where your Airflow environment is actually defined.

Understanding the docker-compose.yaml File

The docker-compose.yaml file tells Docker how to build, connect, and run all the Airflow components as containers.

If you open it, you’ll see multiple sections that look like this:

Understanding the Docker Compose File

Let’s break this down briefly:

  • x-airflow-common – This is the shared configuration block that all Airflow containers inherit from. It defines the base Docker image (apache/airflow:3.1.0), key environment variables, and mounted volumes for DAGs, logs, and plugins. It also specifies user permissions to ensure that files created inside the containers are accessible from your host machine. The depends_on lists dependencies such as the PostgreSQL database used to store Airflow metadata. In short, this section sets up the common foundation for every container in your environment.
  • services – This section defines the actual Airflow components that make up your environment. Each service, such as the api-server, scheduler, triggerer, dag-processor , and metadata database, runs as a separate container but uses the shared configuration from x-airflow-common. Together, they form a complete Airflow deployment where each container plays a specific role.
  • volumes - this section sets up persistent storage for containers. Airflow uses it by default for the Postgres database, keeping your DAGs, logs, and configurations saved across runs. In part 2, we’ll expand it to include Git integration.

Each of these sections works together to create a unified Airflow environment that’s easy to configure, extend, or simplify as needed.

Understanding these parts now will make the next steps - cleaning, customizing, and managing your Airflow setup - much clearer.

Resetting the Environment Before Making Changes

Before editing anything inside the docker-compose.yaml, it’s crucial to shut down your containers cleanly to avoid conflicts.

Run: docker compose down -v

Here’s what this does:

  • docker compose down stops and removes all containers.
  • The v flag removes volumes, which clears stored metadata, logs, and configurations.

    This ensures that you start with a completely fresh environment the next time you launch Airflow — which can be helpful when your environment becomes misconfigured or broken. However, you shouldn’t do this routinely after every DAG or configuration change, as it will also remove your saved Connections, Variables, and other stateful data. In most cases, you can simply run docker compose down instead to stop the containers without wiping the environment.

Disabling Example DAGs

By default, Airflow loads several example DAGs to help new users explore its features. For our purposes, we want a clean workspace that only shows our own DAGs.

  1. Open the docker-compose.yaml file in your code editor.
  2. Locate the environment section under x-airflow-common and find this line: AIRFLOW__CORE__LOAD_EXAMPLES: 'true' . Change 'true' to 'false': AIRFLOW__CORE__LOAD_EXAMPLES: 'false'

This setting tells Airflow not to load any of the example workflows when it starts.

Once you’ve made the changes:

  1. Save your docker-compose.yaml file.
  2. Rebuild and start your Airflow environment again: docker compose up -d
  3. Wait a few moments, then visit http://localhost:8080 again.

This time, when you log in, you’ll notice the example DAGs are gone, leaving you with a clean workspace ready for your own workflows.

Disabling Example DAGs

Let’s now build our first DAG.

Working with DAGs in Airflow

Now that your Airflow environment is clean and running, it’s time to create **** our first real workflow.

This is where you begin writing DAGs (Directed Acyclic Graphs), which sit at the very heart of how Airflow operates.

A DAG is more than just a piece of code, it’s a visual and logical representation of your workflow, showing how tasks connect, when they run, and in what order.

Each task in a DAG represents a distinct step in your process, such as pulling data, cleaning it, transforming it, or loading it into a database. In this tutorial we will create tasks that extract and transform data. We will the see the loading process in part two, and how airflow intergrates to git.

Airflow ensures these tasks execute in the correct order without looping back on themselves (that’s what acyclic means).

Setting Up Your DAG File

Let’s start by creating the foundation of our workflow( make sure to shut down the running containers by using docker compose down -v)

Open your airflow-docker project folder and, inside the dags/ directory, create a new file named:

our_first_dag.py

Every .py file you place in this folder becomes a workflow that Airflow can recognize and manage automatically.

You don’t need to manually register anything, Airflow continuously scans this directory and loads any valid DAGs it finds.

At the top of our file, let’s import the core libraries we need for our project:

from airflow.decorators import dag, task
from datetime import datetime, timedelta
import pandas as pd
import random
import os

Let’s pause to understand what each of these imports does and why they matter:

  • dag and task come from Airflow’s TaskFlow API.

    These decorators turn plain Python functions into Airflow-managed tasks, giving you cleaner, more intuitive code while Airflow handles orchestration behind the scenes.

  • datetime and timedelta handle scheduling logic.

    They help define when your DAG starts and how frequently it runs.

  • pandas, random, and os are standard Python libraries we’ll use to simulate a simple ETL process, generating, transforming, and saving data locally.

This setup might seem minimal, but it’s everything you need to start orchestrating real tasks.

Defining the DAG Structure

With our imports ready, the next step is to define the skeleton of our DAG, its blueprint.

Think of this as defining when and how your workflow runs.

default_args = {
    "owner": "Your name",
    "retries": 3,
    "retry_delay": timedelta(minutes=1),
}

@dag(
    dag_id="daily_etl_pipeline_airflow3",
    description="ETL workflow demonstrating dynamic task mapping and assets",
    schedule="@daily",
    start_date=datetime(2025, 10, 29),
    catchup=False,
    default_args=default_args,
    tags=["airflow3", "etl"],
)
def daily_etl_pipeline():
    ...

dag = daily_etl_pipeline()

Let’s break this down carefully:

  • default_args

    This dictionary defines shared settings for all tasks in your DAG.

    Here, each task will automatically retry up to three times with a one-minute delay between attempts, a good practice when your tasks depend on external systems like APIs or databases that can occasionally fail.

  • The @dag decorator

    This tells Airflow that everything inside the daily_etl_pipeline() function(we can have this to any name) belongs to one cohesive workflow.

    It defines:

    • schedule="@daily" → when the DAG should run.
    • start_date → the first execution date.
    • catchup=False → prevents Airflow from running past-due DAGs automatically.
    • tags → helps you categorize DAGs in the UI.
  • The daily_etl_pipeline() function

    This is the container for your workflow logic, it’s where you’ll later define your tasks and how they depend on one another.

    Think of it as the “script” that describes what happens in each run of your DAG.

  • dag = daily_etl_pipeline()

    This single line instantiates the DAG. It’s what makes your workflow visible and schedulable inside Airflow.

This structure acts as the foundation for everything that follows.

If we think of a DAG as a movie script, this section defines the production schedule and stage setup before the actors (tasks) appear.

Creating Tasks with the TaskFlow API

Now it’s time to define the stars of our workflow, the tasks.

Tasks are the actual units of work that Airflow runs. Each one performs a specific action, and together they form your complete data pipeline.

Airflow’s TaskFlow API makes this remarkably easy: you simply decorate ordinary Python functions with @task, and Airflow takes care of converting them into fully managed, trackable workflow steps.

We’ll start with two tasks:

  • Extract → simulates pulling or generating data.
  • Transform → processes and cleans the extracted data.

(We’ll add the Load step in the next part of this tutorial.)

Extract Task — Generating Fake Data

@task
def extract_market_data():
    """
    Simulate extracting market data for popular companies.
    This task mimics pulling live stock prices or API data.
    """
    companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix", "NVIDIA", "Meta"]

    # Simulate today's timestamped price data
    records = []
    for company in companies:
        price = round(random.uniform(100, 1500), 2)
        change = round(random.uniform(-5, 5), 2)
        records.append({
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "company": company,
            "price_usd": price,
            "daily_change_percent": change,
        })

    df = pd.DataFrame(records)
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    raw_path = "/opt/airflow/tmp/market_data.csv"
    df.to_csv(raw_path, index=False)

    print(f"[EXTRACT] Market data successfully generated at {raw_path}")
    return raw_path

Let’s unpack what’s happening here:

  • The function simulates the extraction phase of an ETL pipeline by generating a small, timestamped dataset of popular companies and their simulated market prices.
  • Each record includes a company name, current price in USD, and a randomly generated daily percentage change, mimicking what you’d expect from a real API response or financial data feed.
  • The data is stored in a CSV file inside /opt/airflow/tmp, a shared directory accessible from within your Docker container, this mimics saving raw extracted data before it’s cleaned or transformed.
  • Finally, the function returns the path to that CSV file. This return value becomes crucial because Airflow automatically treats it as the output of this task. Any downstream task that depends on it, for example, a transformation step, can receive it as an input automatically.

In simpler terms, Airflow handles the data flow for you. You focus on defining what each task does, and Airflow takes care of passing outputs to inputs behind the scenes, ensuring your pipeline runs smoothly and predictably.

Transform Task — Cleaning and Analyzing Market Data

@task
def transform_market_data(raw_file: str):
    """
    Clean and analyze extracted market data.
    This task simulates transforming raw stock data
    to identify the top gainers and losers of the day.
    """
    df = pd.read_csv(raw_file)

    # Clean: ensure numeric fields are valid
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")

    # Sort companies by daily change (descending = top gainers)
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    # Select top 3 gainers and bottom 3 losers
    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    # Save transformed files
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    gainers_path = "/opt/airflow/tmp/top_gainers.csv"
    losers_path = "/opt/airflow/tmp/top_losers.csv"

    top_gainers.to_csv(gainers_path, index=False)
    top_losers.to_csv(losers_path, index=False)

    print(f"[TRANSFORM] Top gainers saved to {gainers_path}")
    print(f"[TRANSFORM] Top losers saved to {losers_path}")

    return {"gainers": gainers_path, "losers": losers_path}

Let’s unpack what this transformation does and why it’s important:

  • The function begins by reading the extracted CSV file produced by the previous task (extract_market_data). This is our “raw” dataset.
  • Next, it cleans the data, converting prices and percentage changes into numeric formats, a vital first step before analysis, since raw data often arrives as text.
  • It then sorts companies by their daily percentage change, allowing us to quickly identify which ones gained or lost the most value during the day.
  • Two smaller datasets are then created: one for the top gainers and one for the top losers, each saved as separate CSV files in the same temporary directory.
  • Finally, the task returns both file paths as a dictionary, allowing any downstream task (for example, a visualization or database load step) to easily access both datasets.

This transformation demonstrates how Airflow tasks can move beyond simple sorting; they can perform real business logic, generate multiple outputs, and return structured data to other steps in the workflow.

At this point, your DAG has two working tasks:

  • Extract — to simulate data collection
  • Transform — to clean and analyze that data

When Airflow runs this workflow, it will execute them in order:

Extract → Transform

Now that both the Extract and Transform tasks are defined inside your DAG, let’s see how Airflow links them together when you call them in sequence.

Inside your daily_etl_pipeline() function, add these two lines to establish the task order:

raw = extract_market_data()
transformed = transform_market_data(raw)

When Airflow parses the DAG, it doesn’t see these as ordinary Python calls, it reads them as task relationships.

The TaskFlow API automatically builds a dependency chain, so Airflow knows that extract_market_data must complete before transform_market_data begins.

Notice that we’ve assigned extract_market_data() to a variable called raw. This variable represents the output of the first task, in our case, the path to the extracted data file. The next line, transform_market_data(raw), then takes that output and uses it as input for the transformation step.

This pattern makes the workflow clear and logical: data is extracted, then transformed, with Airflow managing the sequence automatically behind the scenes.

This is how Airflow builds the workflow graph internally: by reading the relationships you define through function calls.

Visualizing the Workflow in the Airflow UI

Once you’ve saved your DAG file with both tasks ****—Extract and Transform —it’s time to bring it to life. Start your Airflow environment using:

docker compose up -d

Then open your browser and navigate to: http://localhost:8080

You’ll be able to see the Airflow Home page, this time with the dag we just created ; daily_etl_pipeline_airflow3.

Visualizing the Workflow in the Airflow UI

Click on it to open the DAG details, then trigger a manual run using the Play button.

The task currently running will turn blue, and once it completes successfully, it will turn green.

Visualizing the Workflow in the Airflow UI (2)

On the graph view, you will also see two tasks: extract_market_data and transform_market_data , connected in sequence showing success in each.

Visualizing the Workflow in the Airflow UI (3)

If a task encounters an issue, Airflow will automatically retry it up to three times (as defined in default_args). If it continues to fail after all retries, it will appear red, indicating that the task, and therefore the DAG run, has failed.

Inspecting Task Logs

Click on any task box (for example, transform_market_data), then click Task Instances.

Inspecting Task Logs

All DAG runs for the selected task will be listed here. Click on the latest run. This will open a detailed log of the task’s execution, an invaluable feature for debugging and understanding what’s happening under the hood.

In your log, you’ll see:

  • The [EXTRACT] or [TRANSFORM] tags you printed in the code.
  • Confirmation messages showing where your files were saved, e.g.:

    Inspecting Task Logs (2)

    Inspecting Task Logs (3)

These messages prove that your tasks executed correctly and help you trace your data through each stage of the pipeline.

Dynamic Task Mapping

As data engineers, we rarely process just one dataset; we usually work with many sources at once.

For example, instead of analyzing one market, you might process stock data from multiple exchanges or regions simultaneously.

In our current DAG, the extraction and transformation handle only a single dataset.

But what if we wanted to repeat that same process for several markets, say, us, europe, asia, and africa , all in parallel?

Writing a separate task for each region would make our DAG repetitive and hard to maintain.

That’s where Dynamic Task Mapping comes in.

It allows Airflow to create parallel tasks automatically at runtime based on input data such as lists, dictionaries, or query results.

Before editing the DAG, stop any running containers to ensure Airflow picks up your changes cleanly:

docker compose down -v

Now, extend your existing daily_etl_pipeline_airflow3 to handle multiple markets dynamically:

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...
    @task
    def transform_market_data(raw_file: str):
      ...

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create parallel tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)

dag = daily_etl_pipeline()

By using .expand(), Airflow automatically generates multiple parallel task instances from a single function. You’ll notice the argument market passed into the extract_market_data() function. For that to work effectively, here’s the updated version of the extract_market_data() function:

@task
def extract_market_data(market: str):
        """Simulate extracting market data for a given region or market."""
        companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix"]
        records = []
        for company in companies:
            price = round(random.uniform(100, 1500), 2)
            change = round(random.uniform(-5, 5), 2)
            records.append({
                "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                "market": market,
                "company": company,
                "price_usd": price,
                "daily_change_percent": change,
            })

        df = pd.DataFrame(records)
        os.makedirs("/opt/airflow/tmp", exist_ok=True)
        raw_path = f"/opt/airflow/tmp/market_data_{market}.csv"
        df.to_csv(raw_path, index=False)
        print(f"[EXTRACT] Market data for {market} saved at {raw_path}")
        return raw_path

We also updated our transform_market_data() task to align with this dynamic setup:

@task
def transform_market_data(raw_file: str):
    """Clean and analyze each regional dataset."""
    df = pd.read_csv(raw_file)
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    transformed_path = raw_file.replace("market_data_", "transformed_")
    top_gainers.to_csv(transformed_path, index=False)
    print(f"[TRANSFORM] Transformed data saved at {transformed_path}")
    return transformed_path

Both extract_market_data() and transform_market_data() now work together dynamically:

  • extract_market_data() generates a unique dataset per region (e.g., market_data_us.csv, market_data_europe.csv).
  • transform_market_data() then processes each of those files individually and saves transformed versions (e.g., transformed_us.csv).

Generally:

  • One extract task is created for each market (us, europe, asia, africa).
  • Each extract’s output file becomes the input for its corresponding transform task.
  • Airflow handles all the mapping logic automatically, no loops or manual duplication needed.

Let’s redeploy our containers by running docker compose up -d .

You’ll see this clearly in the Graph View, where the DAG fans out into several parallel branches, one per market.

Dynamic Task Mapping

Each branch runs independently, and Airflow retries or logs failures per task as defined in default_args. You’ll notice that there are four task instances, which clearly correspond to the four market regions we processed.

Dynamic Task Mapping (2)

When you click any of the tasks, for example, extract_market_data , and open the logs, you’ll notice that the data for the corresponding market regions was extracted and saved independently.

Dynamic Task Mapping (3)

Dynamic Task Mapping (4)

Dynamic Task Mapping (5)

Dynamic Task Mapping (6)

Summary and What’s Next

We have built a complete foundation for working with Apache Airflow inside Docker. You learned how to deploy a fully functional Airflow environment using Docker Compose, understand its architecture, and configure it for clean, local development. We explored the Airflow Web UI, and used the TaskFlow API to create our first real workflow, a simple yet powerful ETL pipeline that extracts and transforms data automatically.

By extending it with Dynamic Task Mapping, we saw how Airflow can scale horizontally by processing multiple datasets in parallel, creating independent task instances for each region without duplicating code.

In Part Two, we’ll build on this foundation and introduce the Load phase of our ETL pipeline. You’ll connect Airflow to a local MySQL database, learn how to configure Connections through the Admin panel and environment variables. We’ll also integrate Git and Git Sync to automate DAG deployment and introduce CI/CD pipelines for version-controlled, collaborative Airflow workflows.

By the end of the next part, your environment will evolve from a development sandbox into a production-ready data orchestration system, capable of automating data ingestion, transformation, and loading with full observability, reliability, and continuous integration support.

Install PostgreSQL 14.7 on Ubuntu

4 November 2025 at 22:23

In this tutorial, you'll learn how to install PostgreSQL 14.7 on your Ubuntu system. The process is straightforward and consists of the following steps:

  1. Update your system packages
  2. Install PostgreSQL
  3. Set up the superuser
  4. Download the Northwind PostgreSQL SQL file
  5. Create a new Database
  6. Import the Northwind SQL file
  7. Verify the Northwind database installation
  8. Connect to the Database Using Jupyter Notebook

Prerequisites

To follow this tutorial, you should be running Ubuntu 20.04 LTS or later.

Step 1: Update System Packages

First, you need to update the system packages. Open the Terminal app ("Ctrl + Alt + T") and enter the following command:

sudo apt update && sudo apt upgrade -y

Enter your admin password when prompted. This command will update the package lists for upgrades for packages that need upgrading, as well as new packages that have just come to the repositories, and then upgrade the currently installed packages. The -y option will automatically answer 'yes' to all prompts, making the process non-interactive.

Note: sudo is a prefix that gives you superuser permissions for a command, which is often necessary when making system-wide changes like installing or upgrading software. Be careful when using sudo, as it provides complete control over your system, including the ability to break it if misused.

Step 2: Install PostgreSQL

With the system packages updated, you're ready to install PostgreSQL.

To install the PostgreSQL package, use the apt package manager:

sudo apt install postgresql-14

You may be prompted to confirm the amount of space the installation requires on your local system. After the installation is complete, check the status of the PostgreSQL service:

systemctl status postgresql

When you run this command, it will display information such as whether the service is active or inactive, when it was started, the process ID, and recent log entries. You'll know that it has been installed successfully if you see a line similar to Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled) indicating system has successfully read the PostgreSQL service file.

After you run systemctl status postgresql, you should find yourself back at the command prompt. If not, and you're stuck in a view of log files, you might be in a "less" or "more" program that lets you scroll through the logs. You can typically exit this view and return to the command prompt by pressing q. If that doesn't work, then "Ctrl + C" will send an interrupt signal to the current process and return you to the command line.

Step 3: Setting up the postgres user

PostgreSQL automatically creates a user (also known as a "role") named postgres. To ensure you'll be able to use PostgreSQL without any issues, let’s create a password for this user that has superuser privileges. You can set a password for this user with this command:

sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'your_password';"

Replace your_password with a new password and make sure it is wrapped in single quotes. Please note, this is not your local user account's password. This password will be used to connect to your PostgreSQL database with superuser privileges, so make sure it's strong and secure. This command will run the psql command as the postgres user, and pass it a SQL command to change the postgres user's password to your_password.

In PostgreSQL, the terms "USER" and "ROLE" are essentially interchangeable. The ALTER USER command is actually an alias for ALTER ROLE, which is why you see ALTER ROLE as the confirmation message.

So when you see ALTER ROLE, it just means that the password change was successful and the postgres role (or user, in everyday terms) has a new password. You're now able to use this new password to connect to PostgreSQL as the postgres user.

Step 4: Download the Northwind PostgreSQL SQL file

First, you need to download a version of the Northwind database that's compatible with PostgreSQL. You can find an adapted version on GitHub. To download the SQL file, follow these two steps:

  1. From the Terminal, create a new directory for the Northwind database and navigate to it:

    mkdir northwind && cd northwind
  2. Download the Northwind PostgreSQL SQL file using wget:

    wget https://raw.githubusercontent.com/pthom/northwind_psql/master/northwind.sql

    This will download the northwind.sql file to the northwind directory you created above.

Step 5: Create a new PostgreSQL database

Before importing the Northwind SQL file, you must create a new PostgreSQL database. Follow these three steps:

  1. Connect to the PostgreSQL server as the postgres user:

    sudo -u postgres psql

    This command is telling the system to execute the psql command as the postgres user. psql is the interactive terminal for PostgreSQL, and when it starts, it changes the command prompt to let you know that you're interacting with the PostgreSQL command-line and not the system command-line.

    Once you've run sudo -u postgres psql, your terminal prompt will change to something similar to postgres=# to indicate you're connected to the postgres database.

  2. Create a new database called northwind:

    postgres=# CREATE DATABASE northwind;

    You'll see "CREATE DATABASE" is returned if the command is successful.

  3. Exit the psql command-line interface:

    postgres=# \q

Step 6: Import the Northwind SQL file

With the northwind database created, you can import the Northwind SQL file using psql. Follow these steps:

  • In your Terminal, ensure you're in the northwind directory where you downloaded the northwind.sql file.
  • Run the following command to import the Northwind SQL file into the northwind database:

    sudo -u postgres psql -d northwind -f northwind.sql

    This command connects to the PostgreSQL server as the postgres user, selects the northwind database, and executes the SQL commands in the northwind.sql file.

Step 7: Verify the Northwind database installation

To verify that the Northwind database has been installed correctly, follow these four steps:

  1. Connect to the northwind database using psql:

    sudo -u postgres psql -d northwind
  2. List the tables in the Northwind database:

    northwind=# \dt

    You should see a list of Northwind tables: categories, customers, employees, orders, and more.

  3. Run a sample query to ensure the data has been imported correctly. For example, you can query the customers table:

    northwind=# SELECT * FROM customers LIMIT 5;

    This should return the first five rows from the customers table. Similar to above when you used systemctl status postgresql, you might be in a "less" or "more" program that lets you scroll through the results of the query. Press q to return to the psql command-line interface.

  4. Exit the psql command-line interface:

    northwind=# \q

Step 8: Connect to the Database Using Jupyter Notebook

As we wrap up our installation, we will now introduce Jupyter Notebook as one of the tools available for executing SQL queries and analyzing the Northwind database. Jupyter Notebook offers a convenient and interactive platform that simplifies the visualization and sharing of query results, but it's important to note that it is an optional step. You can also access Postgres through other means. However, we highly recommend using Jupyter Notebook for its numerous benefits and enhanced user experience.

To set up the necessary tools and establish a connection to the Northwind database, here is an overview of what each step will do:

  • !pip install ipython-sql: This command installs the ipython-sql package. This package enables you to write SQL queries directly in your Jupyter Notebook, making it easier to execute and visualize the results of your queries within the notebook environment.
  • %load_ext sql: This magic command loads the sql extension for IPython. By loading this extension, you can use the SQL magic commands, such as %sql and %%sql, to run SQL queries directly in the Jupyter Notebook cells.
  • %sql postgresql://postgres@localhost:5432/northwind: This command establishes a connection to the Northwind database using the PostgreSQL database system. The connection string has the following format:

    postgresql://username@hostname:port/database_name

    • In this case, username is postgres, hostname is localhost, port is 5432, and database_name is northwind. The %sql magic command allows you to run a single-line SQL query in the Jupyter Notebook.
  1. Copy the following text into a code cell in the Jupyter Notebook:

    !pip install ipython-sql
    %load_ext sql
    %sql postgresql://postgres@localhost:5432/northwind
  2. Run the cell by either:

    • Clicking the "Run" button on the menu bar.
    • Using the keyboard shortcut: Shift + Enter or Ctrl + Enter.
  3. Upon successful connection, you should see an output similar to the following:

    'Connected: postgres@northwind'

    This output confirms that you are now connected to the Northwind database, and you can proceed with the guided project in your Jupyter Notebook environment.

Once you execute these commands, you'll be connected to the Northwind database, and you can start writing SQL queries in your Jupyter Notebook using the %sql or %%sql magic commands.

Next Steps

Based on what you've accomplished, here are some potential next steps to continue your learning journey:

  1. Deepen Your SQL Knowledge:
    • Try formulating more complex queries on the Northwind database to improve your SQL skills. These could include joins, subqueries, and aggregations.
    • Understand the design of the Northwind database: inspect the tables, their relationships, and how data is structured.
  2. Experiment with Database Management:
    • Learn how to backup and restore databases in PostgreSQL. Try creating a backup of your Northwind database.
    • Explore different ways to optimize your PostgreSQL database performance like indexing and query optimization.
  3. Integration with Python:
    • Learn how to use psycopg2, a popular PostgreSQL adapter for Python, to interact with your database programmatically.
    • Experiment with ORM (Object-Relational Mapping) libraries like SQLAlchemy to manage your database using Python.

Install PostgreSQL 14.7 on Windows 10

4 November 2025 at 22:21

In this tutorial, you'll learn how to install PostgreSQL 14.7 on Windows 10.

The process is straightforward and consists of the following steps:

  1. Install PostgreSQL
  2. Configure Environment Variables
  3. Verify the Installation
  4. Download the Northwind PostgreSQL SQL file
  5. Create a New PostgreSQL Database
  6. Import the Northwind SQL file
  7. Verify the Northwind database installation
  8. Connect to the Database Using Jupyter Notebook

Prerequisites

  • A computer running Windows 10
  • Internet connection
  1. Download the official PostgreSQL 14.7 at https://get.enterprisedb.com/postgresql/postgresql-14.7-2-windows-x64.exe
  2. Save the installer executable to your computer and run the installer.

Note: We recommend version 14.7 because it is commonly used. There are newer versions available, but their features vary substantially!

Step 1: Install PostgreSQL

We're about to initiate a vital part of this project - installing and configuring PostgreSQL.

Throughout this process, you'll define critical settings like the installation directory, components, data directory, and the initial 'postgres' user password. This password grants administrative access to your PostgreSQL system. Additionally, you'll choose the default port for connections and the database cluster locale.

Each choice affects your system's operation, file storage, available tools, and security. We're here to guide you through each decision to ensure optimal system functioning.

  1. In the PostgreSQL Setup Wizard, click Next to begin the installation process.

  2. Accept the default installation directory or choose a different directory by clicking Browse. Click Next to continue.

  3. Choose the components you want to install (e.g., PostgreSQL Server, pgAdmin 4 (optional), Stack Builder (optional), Command Line Tools), and click Next.

  4. Select the data directory for storing your databases and click Next.

  5. Set a password for the PostgreSQL “postgres” user and click Next.

    • There will be some points where you're asked to enter a password in the command prompt. It's important to note that for security reasons, as you type your password, no characters will appear on the screen. This standard security feature is designed to prevent anyone from looking over your shoulder and seeing your password. So, when you're prompted for your password, don't be alarmed if you don't see any response on the screen as you type. Enter your password and press 'Enter'. Most systems will allow you to re-enter the password if you make a mistake.

    • Remember, it's crucial to remember the password you set during the installation, as you'll need it to connect to your PostgreSQL databases in the future.

  6. Choose the default port number (5432) or specify a different port, then click Next.

  7. Select the locale to be used by the new database cluster and click Next.

  8. Review the installation settings and click Next to start the installation process. The installation may take a few minutes.

  9. Once the installation is complete, click Finish to close the Setup Wizard.

Step 2: Configure Environment Variables

Next, we're going to configure environment variables on your Windows system. Why are we doing this? Well, environment variables are a powerful feature of operating systems that allow us to specify values - like directory locations - that can be used by multiple applications. In our case, we need to ensure that our system can locate the PostgreSQL executable files stored in the "bin" folder of the PostgreSQL directory.

By adding the PostgreSQL "bin" folder path to the system's PATH environment variable, we're telling our operating system where to find these executables. This means you'll be able to run PostgreSQL commands directly from the command line, no matter what directory you're in, because the system will know where to find the necessary files. This makes working with PostgreSQL more convenient and opens up the possibility of running scripts that interact with PostgreSQL.

Now, let's get started with the steps to configure your environment variables on Windows!

  1. On the Windows taskbar, right-click the Windows icon and select System.

  2. Click on Advanced system settings in the left pane.

  3. In the System Properties dialog, click on the Environment Variables button.

  4. Under the System Variables section, scroll down and find the Path variable. Click on it to select it, then click the Edit button.

  5. In the Edit environment variable dialog, click the New button and add the path to the PostgreSQL bin folder, typically C:\Program Files\PostgreSQL\14\bin.

  6. Click OK to close the "Edit environment variable" dialog, then click OK again to close the "Environment Variables" dialog, and finally click OK to close the "System Properties" dialog.

Step 3: Verify the Installation

After going through the installation and configuration process, it's essential to verify that PostgreSQL is correctly installed and accessible. This gives us the assurance that the software is properly set up and ready to use, which can save us from troubleshooting issues later when we start interacting with databases.

If something went wrong during installation, this verification process will help you spot the problem early before creating or managing databases.

Now, let's go through the steps to verify your PostgreSQL installation.

  1. Open the Command Prompt by pressing Win + R, typing cmd, and pressing Enter.
  2. Type psql --version and press Enter. You should see the PostgreSQL version number you installed if the installation was successful.
  3. To connect to the PostgreSQL server, type psql -U postgres and press Enter.
  4. When prompted, enter the password you set for the postgres user during installation. You should now see the postgres=# prompt, indicating you are connected to the PostgreSQL server.

Step 4: Download the Northwind PostgreSQL SQL File

Now, we're going to introduce you to the Northwind database and help you download it. The Northwind database is a sample database originally provided by Microsoft for its Access Database Management System. It's based on a fictitious company named "Northwind Traders," and it contains data on their customers, orders, products, suppliers, and other aspects of the business. In our case, we'll be working with a version of Northwind that has been adapted for PostgreSQL.

The following steps will guide you on how to download this PostgreSQL-compatible version of the Northwind database from GitHub to your local machine. Let's get started:

First, you need to download a version of the Northwind database that's compatible with PostgreSQL. You can find an adapted version on GitHub. To download the SQL file, follow these steps:

  1. Open Command Prompt or PowerShell.
  2. Create a new directory for the Northwind database and navigate to it:

    mkdir northwind
    cd northwind
  3. Download the Northwind PostgreSQL SQL file using curl:

    curl -O https://raw.githubusercontent.com/pthom/northwind_psql/master/northwind.sql

    This will download the northwind.sql file to the northwind directory you created.

Step 5: Create a New PostgreSQL Database

Now that we've downloaded the Northwind SQL file, it's time to prepare our PostgreSQL server to host this data. The next steps will guide you in creating a new database on your PostgreSQL server, a crucial prerequisite before importing the Northwind SQL file.

Creating a dedicated database for the Northwind data is good practice as it isolates these data from other databases in your PostgreSQL server, facilitating better organization and management of your data. These steps involve connecting to the PostgreSQL server as the postgres user, creating the northwind database, and then exiting the PostgreSQL command-line interface.

Let's proceed with creating your new database:

  1. Connect to the PostgreSQL server as the postgres user:

    psql -U postgres
  2. Create a new database called northwind:

    postgres=# CREATE DATABASE northwind;
  3. Exit the psql command-line interface:

    postgres=# \q

Step 6: Import the Northwind SQL File

We're now ready to import the Northwind SQL file into our newly created northwind database. This step is crucial as it populates our database with the data from the Northwind SQL file, which we will use for our PostgreSQL learning journey.

These instructions guide you through the process of ensuring you're in the correct directory in your Terminal and executing the command to import the SQL file. This command will connect to the PostgreSQL server, target the northwind database, and run the SQL commands contained in the northwind.sql file.

Let's move ahead and breathe life into our northwind database with the data it needs!

With the northwind database created, you can import the Northwind SQL file using the psql command. Follow these steps:

  1. In your Terminal, ensure you're in the northwind directory where you downloaded the northwind.sql file.
  2. Run the following command to import the Northwind SQL file into the northwind database:

    psql -U postgres -d northwind -f northwind.sql

    This command connects to the PostgreSQL server as the postgres user, selects the northwind database, and executes the SQL commands in the northwind.sql file.

Step 7: Verify the Northwind Database Installation

You've successfully created your northwind database and imported the Northwind SQL file. Next, we must ensure everything was installed correctly, and our database is ready for use.

These upcoming steps will guide you on connecting to your northwind database, listing its tables, running a sample query, and finally, exiting the command-line interface. Checking the tables and running a sample query will give you a sneak peek into the data you now have and verify that the data was imported correctly. This means we can ensure everything is in order before diving into more complex operations and analyses.

To verify that the Northwind database has been installed correctly, follow these steps:

  1. Connect to the northwind database using psql:

    psql -U postgres -d northwind
  2. List the tables in the Northwind database:

    postgres=# \dt

    You should see a list of Northwind tables: categories, customers, employees, orders, and more.

  3. Run a sample query to ensure the data has been imported correctly. For example, you can query the customers table:

    postgres=# SELECT * FROM customers LIMIT 5;

    This should return the first five rows from the customers table.

  4. Exit the psql command-line interface:

    postgres=# \q

Congratulations! You've successfully installed the Northwind database in PostgreSQL using an SQL file and psql.

Step 8: Connect to the Database Using Jupyter Notebook

As we wrap up our installation, we will now introduce Jupyter Notebook as one of the tools available for executing SQL queries and analyzing the Northwind database. Jupyter Notebook offers a convenient and interactive platform that simplifies the visualization and sharing of query results, but it's important to note that it is an optional step. You can also access Postgres through other means. However, we highly recommend using Jupyter Notebook for its numerous benefits and enhanced user experience.

To set up the necessary tools and establish a connection to the Northwind database, here is an overview of what each step will do:

  1. !pip install ipython-sql: This command installs the ipython-sql package. This package enables you to write SQL queries directly in your Jupyter Notebook, making it easier to execute and visualize the results of your queries within the notebook environment.

  2. %sql postgresql://postgres@localhost:5432/northwind: This command establishes a connection to the Northwind database using the PostgreSQL database system. The connection string has the following format:

    postgresql://username@hostname:port/database_name

    In this case, username is postgres, hostname is localhost, port is 5432, and database_name is northwind. The %sql magic command allows you to run a single-line SQL query in the Jupyter Notebook.

  3. Copy the following text into a code cell in the Jupyter Notebook:

    !pip install ipython-sql
    %load_ext sql
    %sql postgresql://postgres@localhost:5432/northwind 

    On Windows you may need to try the following command because you need to provide the password you set for the “postgres” user during installation:

    %sql postgresql://postgres:{password}@localhost:5432/northwind

    Bear in mind that it's considered best practice not to include sensitive information like passwords directly in files that could be shared or accidentally exposed. Instead, you can store your password securely using environment variables or a password management system (we'll link to some resources at the end of this guide if you are interested in doing this).

  4. Run the cell by either:

    • Clicking the "Run" button on the menu bar.
    • Using the keyboard shortcut: Shift + Enter or Ctrl + Enter.
  5. Upon successful connection, you should see an output similar to the following:

    'Connected: postgres@northwind'

    This output confirms that you are now connected to the Northwind database, and you can proceed with the guided project in your Jupyter Notebook environment.

Once you execute these commands, you'll be connected to the Northwind database, and you can start writing SQL queries in your Jupyter Notebook using the %sql or %%sql magic commands.

Next Steps

Based on what you've accomplished, here are some potential next steps to continue your learning journey:

  1. Deepen Your SQL Knowledge:
    • Try formulating more complex queries on the Northwind database to improve your SQL skills. These could include joins, subqueries, and aggregations.
    • Understand the design of the Northwind database: inspect the tables, their relationships, and how data is structured.
  2. Experiment with Database Management:
    • Learn how to backup and restore databases in PostgreSQL. Try creating a backup of your Northwind database.
    • Explore different ways to optimize your PostgreSQL database performance like indexing and query optimization.
  3. Integration with Python:
    • Learn how to use psycopg2, a popular PostgreSQL adapter for Python, to interact with your database programmatically.
    • Experiment with ORM (Object-Relational Mapping) libraries like SQLAlchemy to manage your database using Python.
  4. Security and Best Practices:
    • Learn about database security principles and apply them to your PostgreSQL setup.
    • Understand best practices for storing sensitive information, like using .env files for environment variables.
    • For more guidance on securely storing passwords, you might find the following resources helpful:

Install PostgreSQL 14.7 on MacOS

4 November 2025 at 22:07

In this tutorial, you'll learn how to install PostgreSQL 14.7 on your MacBook using Homebrew, a popular package manager for macOS.
The process is straightforward and consists of the following steps:

  1. Install Homebrew
  2. Install PostgreSQL
  3. Download the Northwind PostgreSQL SQL file
  4. Create a New PostgreSQL Database
  5. Import the Northwind SQL file
  6. Verify the Northwind database installation
  7. Connecting to the Database Using Jupyter Notebook

Prerequisites

You'll need a MacBook or iMac running macOS 10.13 or later to follow this tutorial.

Install Xcode Command Line Tools
First, you need to install the Xcode Command Line Tools, which provide essential tools for building software on your Mac.

  1. Open the Terminal app (you can find it in Applications > Utilities) and enter the following command:
xcode-select --install

A pop-up window will appear, prompting you to install the Command Line Tools. Click on "Install" to proceed. Once the installation is complete, you can move on to the next step.

Step 1: Install Homebrew

With the Xcode Command Line Tools installed, you can now install Homebrew itself. Homebrew is a package manager for macOS that will make it easier for us to install PostgreSQL and manage other software packages.

After installing Homebrew, it's important to add it to your system's PATH. By doing this, we're telling the system where to find the Homebrew executables. This means you can run Homebrew commands directly from any location in your Terminal.

We will then verify the successful installation of Homebrew using the version check command. This crucial step confirms that Homebrew has been correctly installed and is accessible via your Terminal.

Now, let's get started with the installation of Homebrew and its addition to your system's PATH.

1. Copy and paste the following command into the Terminal app

The script will automatically download and install Homebrew on your Mac. You might be prompted to enter your admin password during the installation process. After the installation is complete, you will see a message saying, "Installation successful!"

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

2. Add Homebrew to Your PATH

Follow the steps below to complete the Homebrew installation:

(echo; echo 'eval "$(/opt/homebrew/bin/brew shellenv)"') >> /Users/user_name/.zprofile

This command appends the necessary line to your .zprofile file to load Homebrew into your PATH. Replace /Users/user_name with your actual user directory.

3. Running the command

Now, run the second command:

eval "$(/opt/homebrew/bin/brew shellenv)"

This command loads the Homebrew environment variables into your current Terminal session.

4. Verify the installation

To verify your Homebrew installation, run the following command:

brew --version

This should display the Homebrew version number, confirming that it's installed and available in your PATH.

Step 2: Install PostgreSQL

In this step, we will utilize Homebrew to set up PostgreSQL on your macOS system. The steps will guide you through updating Homebrew to guarantee that we're accessing the latest software packages. Next, we'll install PostgreSQL version 14 using a single Homebrew command, and start the PostgreSQL service, getting the database server up and running.

Finally, we'll secure our PostgreSQL installation by creating a new user (or "role") called postgres with superuser privileges. This user will serve as our primary means of interacting with our PostgreSQL server, granting us broad administrative capabilities. Now, let's dive into these steps:

1. Ensure Homebrew is Installed

Make sure you have Homebrew installed. If not, you can install it using the following command:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

2. Update Homebrew

Update Homebrew to ensure you have the latest package information

brew update

3. Install PostgreSQL 14

Install PostgreSQL 14 using Homebrew by entering the following command:

brew install postgresql@14

4. Start PostgreSQL Service

After the installation process is done, start the PostgreSQL service:

brew services start postgresql@14

You should see a message indicating that PostgreSQL has started successfully:

Successfully started postgresql@14

5. Verify PostgreSQL Installation
To ensure PostgreSQL is installed correctly, check the version:

psql --version

6. Create a PostgreSQL User

To ensure you will be able to use PostgreSQL without any issues, let’s create a new PostgreSQL user (also called a "role") named postgres with superuser privileges:

createuser -s postgres

7. Set a Password for the postgres User

For security purposes, set a password for the postgres user:

psql
\password postgres

8. Configure PostgreSQL Environment

Add PostgreSQL to your PATH by adding the following line to your shell profile (~/.zshrc or ~/.bash_profile):

export PATH="/usr/local/opt/postgresql@14/bin:$PATH"

Then, reload your profile:

source ~/.zshrc   # or source ~/.bash_profile

9. Troubleshooting Tips

If the PostgreSQL service does not start, check the logs for errors:

tail -f /usr/local/var/log/postgres.log

Ensure your system has enough resources and dependencies for PostgreSQL.

Step 3: Download the Northwind PostgreSQL SQL File

Now, we're going to introduce you to the Northwind database and help you download it. The Northwind database is a sample database originally provided by Microsoft for its Access Database Management System. It's based on a fictitious company named "Northwind Traders," and it contains data on their customers, orders, products, suppliers, and other aspects of the business. In our case, we'll be working with a version of Northwind that has been adapted for PostgreSQL.

The following steps will guide you on how to download this PostgreSQL-compatible version of the Northwind database from GitHub to your local machine. Let's get started:

First, you need to download a version of the Northwind database that's compatible with PostgreSQL. You can find an adapted version on GitHub. To download the SQL file, follow these steps:

  1. Open your Terminal application.

  2. Create a new directory for the Northwind database and navigate to it:

    mkdir northwind && cd northwind
  3. Download the Northwind PostgreSQL SQL file using curl:

    curl -O https://raw.githubusercontent.com/pthom/northwind_psql/master/northwind.sql

    This will download the northwind.sql file to the northwind directory you created.

Step 4: Create a New PostgreSQL Database

Now that we've downloaded the Northwind SQL file, it's time to prepare our PostgreSQL server to host this data. The next steps will guide you in creating a new database on your PostgreSQL server, a crucial prerequisite before importing the Northwind SQL file.

Creating a dedicated database for the Northwind data is good practice as it isolates these data from other databases in your PostgreSQL server, facilitating better organization and management of your data. These steps involve connecting to the PostgreSQL server as the postgres user, creating the northwind database, and then exiting the PostgreSQL command-line interface.

Let's proceed with creating your new database:

  1. Connect to the PostgreSQL server as the postgres user:

    psql -U postgres
  2. Create a new database called northwind:

    postgres=# CREATE DATABASE northwind;
  3. Exit the psql command-line interface:

    postgres=# \q

Step 5: Import the Northwind SQL File

We're now ready to import the Northwind SQL file into our newly created northwind database. This step is crucial as it populates our database with the data from the Northwind SQL file, which we will use for our PostgreSQL learning journey.

These instructions guide you through the process of ensuring you're in the correct directory in your Terminal and executing the command to import the SQL file. This command will connect to the PostgreSQL server, target the northwind database, and run the SQL commands contained in the northwind.sql file.

Let's move ahead and breathe life into our northwind database with the data it needs!

With the northwind database created, you can import the Northwind SQL file using psql. Follow these steps:

  1. In your Terminal, ensure you're in the northwind directory where you downloaded the northwind.sql file.
  2. Run the following command to import the Northwind SQL file into the northwind database:
    psql -U postgres -d northwind -f northwind.sql

    This command connects to the PostgreSQL server as the postgres user, selects the northwind database, and executes the SQL commands in the northwind.sql file.

Step 6: Verify the Northwind Database Installation

You've successfully created your northwind database and imported the Northwind SQL file. Next, we must ensure everything was installed correctly, and our database is ready for use.

These upcoming steps will guide you on connecting to your northwind database, listing its tables, running a sample query, and finally, exiting the command-line interface. Checking the tables and running a sample query will give you a sneak peek into the data you now have and verify that the data was imported correctly. This means we can ensure everything is in order before diving into more complex operations and analyses.

To verify that the Northwind database has been installed correctly, follow these steps:

  1. Connect to the northwind database using psql:

    psql -U postgres -d northwind
  2. List the tables in the Northwind database:

    postgres=# \dt

    You should see a list of Northwind tables: categories, customers, employees, orders, and more.

  3. Run a sample query to ensure the data has been imported correctly. For example, you can query the customers table:

    postgres=# SELECT * FROM customers LIMIT 5;

    This should return the first five rows from the customers table.

  4. Exit the psql command-line interface:

    postgres=# \q

Congratulations! You've successfully installed the Northwind database in PostgreSQL using an SQL file and psql.

Step 7: Connect to the Database Using Jupyter Notebook

As we wrap up our installation, we will now introduce Jupyter Notebook as one of the tools available for executing SQL queries and analyzing the Northwind database. Jupyter Notebook offers a convenient and interactive platform that simplifies the visualization and sharing of query results, but it's important to note that it is an optional step. You can also access Postgres through other means. However, we highly recommend using Jupyter Notebook for its numerous benefits and enhanced user experience.

To set up the necessary tools and establish a connection to the Northwind database, here is an overview of what each step will do:

  1. !pip install ipython-sql: This command installs the ipython-sql package. This package enables you to write SQL queries directly in your Jupyter Notebook, making it easier to execute and visualize the results of your queries within the notebook environment.

  2. %load_ext sql: This magic command loads the sql extension for IPython. By loading this extension, you can use the SQL magic commands, such as %sql and %%sql, to run SQL queries directly in the Jupyter Notebook cells.

  3. %sql postgresql://postgres@localhost:5432/northwind: This command establishes a connection to the Northwind database using the PostgreSQL database system. The connection string has the following format:

    'postgresql://username@hostname:port/database_name'

    In this case, username is postgres, hostname is localhost, port is 5432, and database_name is northwind. The %sql magic command allows you to run a single-line SQL query in the Jupyter Notebook.

  4. Copy the following text into a code cell in the Jupyter Notebook:

    !pip install ipython-sql
    %load_ext sql
    %sql postgresql://postgres@localhost:5432/northwind
  5. Run the cell by either:

    • Clicking the "Run" button on the menu bar.
    • Using the keyboard shortcut: Shift + Enter or Ctrl + Enter.
  6. Upon successful connection, you should see an output similar to the following:

    'Connected: postgres@northwind'

    This output confirms that you are now connected to the Northwind database, and you can proceed with the guided project in your Jupyter Notebook environment.

Once you execute these commands, you'll be connected to the Northwind database, and you can start writing SQL queries in your Jupyter Notebook using the %sql or %%sql magic commands.

Next Steps

Based on what you've accomplished, here are some potential next steps to continue your learning journey:

  1. Deepen Your SQL Knowledge:
    • Try formulating more complex queries on the Northwind database to improve your SQL skills. These could include joins, subqueries, and aggregations.
    • Understand the design of the Northwind database: inspect the tables, their relationships, and how data is structured.
  2. Experiment with Database Management:
    • Learn how to backup and restore databases in PostgreSQL. Try creating a backup of your Northwind database.
    • Explore different ways to optimize your PostgreSQL database performance like indexing and query optimization.
  3. Integration with Python:
    • Learn how to use psycopg2, a popular PostgreSQL adapter for Python, to interact with your database programmatically.
    • Experiment with ORM (Object-Relational Mapping) libraries like SQLAlchemy to manage your database using Python.

Generating Embeddings with APIs and Open Models

4 November 2025 at 21:39

In the previous tutorial, you learned that embeddings convert text into numerical vectors that capture semantic meaning. You saw how papers about machine learning, data engineering, and data visualization naturally clustered into distinct groups when we visualized their embeddings. That was the foundation.

But we only worked with 12 handwritten paper abstracts that we typed directly into our code. That approach works great for understanding core concepts, but it doesn't prepare you for real projects. Real applications require processing hundreds or thousands of documents, and you need to make strategic decisions about how to generate those embeddings efficiently.

This tutorial teaches you how to collect documents programmatically and generate embeddings using different approaches. You'll use the arXiv API to gather 500 research papers, then generate embeddings using both local models and cloud services. By comparing these approaches hands-on, you'll understand the tradeoffs and be able to make informed decisions for your own projects.

These techniques form the foundation for production systems, but we're focusing on core concepts with a learning-sized dataset. A real system handling millions of documents would require batching strategies, streaming pipelines, and specialized vector databases. We'll touch on those considerations, but our goal here is to build your intuition about the embedding generation process itself.

Setting Up Your Environment

Before we start collecting data, let's install the libraries we'll need. We'll use the arxiv library to access research papers programmatically, pandas for data manipulation, and the same embedding libraries from the previous tutorial.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.2
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2
# arxiv==2.2.0
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1

pip install sentence-transformers scikit-learn matplotlib numpy arxiv pandas cohere python-dotenv

This tutorial works in any Python environment: Jupyter notebooks, Python scripts, VS Code, or your preferred IDE. Run the pip command above in your terminal before starting, then use the Python code blocks throughout this tutorial.

Collecting Research Papers with the arXiv API

arXiv is a repository of over 2 million scholarly papers in physics, mathematics, computer science, and more. Researchers publish cutting-edge work here before it appears in journals, making it a valuable resource for staying current with AI and machine learning research. Best of all, arXiv provides a free API for programmatic access. While they do monitor usage and have some rate limits to prevent abuse, these limits are generous for learning and research purposes. Check their Terms of Use for current guidelines.

We'll use the arXiv API to collect 500 papers from five different computer science categories. This diversity will give us clear semantic clusters when we visualize or search our embeddings. The categories we'll use are:

  • cs.LG (Machine Learning): Core ML algorithms, training methods, and theoretical foundations
  • cs.CV (Computer Vision): Image processing, object detection, and visual recognition
  • cs.CL (Computational Linguistics/NLP): Natural language processing and understanding
  • cs.DB (Databases): Data storage, query optimization, and database systems
  • cs.SE (Software Engineering): Development practices, testing, and software architecture

These categories use distinct vocabularies and will create well-separated clusters in our embedding space. Let's write a function to collect papers from specific arXiv categories:

import arxiv

# Create the arXiv client once and reuse it
# This is recommended by the arxiv package to respect rate limits
client = arxiv.Client()

def collect_arxiv_papers(category, max_results=100):
    """
    Collect papers from arXiv by category.

    Parameters:
    -----------
    category : str
        arXiv category code (e.g., 'cs.LG', 'cs.CV')
    max_results : int
        Maximum number of papers to retrieve

    Returns:
    --------
    list of dict
        List of paper dictionaries containing title, abstract, authors, etc.
    """
    # Construct search query for the category
    search = arxiv.Search(
        query=f"cat:{category}",
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    papers = []
    for result in client.results(search):
        paper = {
            'title': result.title,
            'abstract': result.summary,
            'authors': [author.name for author in result.authors],
            'published': result.published,
            'category': category,
            'arxiv_id': result.entry_id.split('/')[-1]
        }
        papers.append(paper)

    return papers

# Define the categories we want to collect from
categories = [
    ('cs.LG', 'Machine Learning'),
    ('cs.CV', 'Computer Vision'),
    ('cs.CL', 'Computational Linguistics'),
    ('cs.DB', 'Databases'),
    ('cs.SE', 'Software Engineering')
]

# Collect 100 papers from each category
all_papers = []
for category_code, category_name in categories:
    print(f"Collecting papers from {category_name} ({category_code})...")
    papers = collect_arxiv_papers(category_code, max_results=100)
    all_papers.extend(papers)
    print(f"  Collected {len(papers)} papers")

print(f"\nTotal papers collected: {len(all_papers)}")

# Let's examine the first paper from each category
separator = "=" * 80
print(f"\n{separator}", "SAMPLE PAPERS (one from each category)", f"{separator}", sep="\n")
for i, (_, category_name) in enumerate(categories):
    paper = all_papers[i * 100]
    print(f"\n{category_name}:")
    print(f"  Title: {paper['title']}")
    print(f"  Abstract (first 150 chars): {paper['abstract'][:150]}...")
    
Collecting papers from Machine Learning (cs.LG)...
  Collected 100 papers
Collecting papers from Computer Vision (cs.CV)...
  Collected 100 papers
Collecting papers from Computational Linguistics (cs.CL)...
  Collected 100 papers
Collecting papers from Databases (cs.DB)...
  Collected 100 papers
Collecting papers from Software Engineering (cs.SE)...
  Collected 100 papers

Total papers collected: 500

================================================================================
SAMPLE PAPERS (one from each category)
================================================================================

Machine Learning:
  Title: Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
  Abstract (first 150 chars): Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale struc...

Computer Vision:
  Title: Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
  Abstract (first 150 chars): Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have propose...

Computational Linguistics:
  Title: VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
  Abstract (first 150 chars): LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct an...

Databases:
  Title: Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
  Abstract (first 150 chars): Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it ...

Software Engineering:
  Title: evomap: A Toolbox for Dynamic Mapping in Python
  Abstract (first 150 chars): This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among ...
  

The code above demonstrates how easy it is to collect papers programmatically. In just a few lines, we've gathered 500 recent research papers from five distinct computer science domains.

Take a look at your output when you run this code. You might notice something interesting: sometimes the same paper title appears under multiple categories. This happens because researchers often cross-list their papers in multiple relevant categories on arXiv. A paper about deep learning for natural language processing could legitimately appear in both Machine Learning (cs.LG) and Computational Linguistics (cs.CL). A paper about neural networks for image generation might be listed in both Machine Learning (cs.LG) and Computer Vision (cs.CV).

While our five categories are conceptually separate, there's naturally some overlap, especially between closely related fields. This real-world messiness is exactly what makes working with actual data more interesting than handcrafted examples. Your specific results will look different from ours because arXiv returns the most recently submitted papers, which change as new research is published.

Preparing Your Dataset

Before generating embeddings, we need to clean and structure our data. Real-world datasets always have imperfections. Some papers might have missing abstracts, others might have abstracts that are too short to be meaningful, and we need to organize everything into a format that's easy to work with.

Let's use pandas to create a DataFrame and handle these data quality issues:

import pandas as pd

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(all_papers)

print("Dataset before cleaning:")
print(f"Total papers: {len(df)}")
print(f"Papers with abstracts: {df['abstract'].notna().sum()}")

# Check for missing abstracts
missing_abstracts = df['abstract'].isna().sum()
if missing_abstracts > 0:
    print(f"\nWarning: {missing_abstracts} papers have missing abstracts")
    df = df.dropna(subset=['abstract'])

# Filter out papers with very short abstracts (less than 100 characters)
# These are often just placeholders or incomplete entries
df['abstract_length'] = df['abstract'].str.len()
df = df[df['abstract_length'] >= 100].copy()

print(f"\nDataset after cleaning:")
print(f"Total papers: {len(df)}")
print(f"Average abstract length: {df['abstract_length'].mean():.0f} characters")

# Show the distribution across categories
print("\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Display the first few entries
separator = "=" * 80
print(f"\n{separator}", "FIRST 3 PAPERS IN CLEANED DATASET", f"{separator}", sep="\n")
for idx, row in df.head(3).iterrows():
    print(f"\n{idx+1}. {row['title']}")
    print(f"   Category: {row['category']}")
    print(f"   Abstract length: {row['abstract_length']} characters")
    
Dataset before cleaning:
Total papers: 500
Papers with abstracts: 500

Dataset after cleaning:
Total papers: 500
Average abstract length: 1337 characters

Papers per category:
category
cs.CL    100
cs.CV    100
cs.DB    100
cs.LG    100
cs.SE    100
Name: count, dtype: int64

================================================================================
FIRST 3 PAPERS IN CLEANED DATASET
================================================================================

1. Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design
   Category: cs.LG
   Abstract length: 1783 characters

2. Multi-Method Analysis of Mathematics Placement Assessments: Classical, Machine Learning, and Clustering Approaches
   Category: cs.LG
   Abstract length: 1519 characters

3. Forgetting is Everywhere
   Category: cs.LG
   Abstract length: 1150 characters
   

Data preparation matters because poor quality input leads to poor quality embeddings. By filtering out papers with missing or very short abstracts, we ensure that our embeddings will capture meaningful semantic content. In production systems, you'd likely implement more sophisticated quality checks, but this basic approach handles the most common issues.

Strategy One: Local Open-Source Models

Now we're ready to generate embeddings. Let's start with local models using sentence-transformers, the same approach we used in the previous tutorial. The key advantage of local models is that everything runs on your own machine. There are no API costs, no data leaves your computer, and you have complete control over the embedding process.

We'll use all-MiniLM-L6-v2 again for consistency, and we'll also demonstrate a larger model called all-mpnet-base-v2 to show how different models produce different results:

from sentence_transformers import SentenceTransformer
import numpy as np
import time

# Load the same model from the previous tutorial
print("Loading all-MiniLM-L6-v2 model...")
model_small = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all abstracts
abstracts = df['abstract'].tolist()

print(f"Generating embeddings for {len(abstracts)} papers...")
start_time = time.time()

# The encode() method handles batching automatically
embeddings_small = model_small.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32  # Process 32 abstracts at a time
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_small.shape}")
print(f"Each abstract is now a {embeddings_small.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add embeddings to our DataFrame
df['embedding_minilm'] = list(embeddings_small)
Loading all-MiniLM-L6-v2 model...
Generating embeddings for 500 papers...
Batches: 100%|██████████| 16/16 [01:05<00:00,  4.09s/it]

Completed in 65.45 seconds
Embedding shape: (500, 384)
Each abstract is now a 384-dimensional vector
Average time per abstract: 0.131 seconds

That was fast! On a typical laptop, we generated embeddings for 500 abstracts in about 65 seconds. Now let's try a larger, more powerful model to see the difference.

Spoiler alert: this will take several more minutes than the last one, so you may want to freshen up your coffee while it's running:

# Load a larger (more dimensions) model
print("\nLoading all-mpnet-base-v2 model...")
model_large = SentenceTransformer('all-mpnet-base-v2')

print("Generating embeddings with larger model...")
start_time = time.time()

embeddings_large = model_large.encode(
    abstracts,
    show_progress_bar=True,
    batch_size=32
)

elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds")
print(f"Embedding shape: {embeddings_large.shape}")
print(f"Each abstract is now a {embeddings_large.shape[1]}-dimensional vector")
print(f"Average time per abstract: {elapsed_time/len(abstracts):.3f} seconds")

# Add these embeddings to our DataFrame too
df['embedding_mpnet'] = list(embeddings_large)
Loading all-mpnet-base-v2 model...
Generating embeddings with larger model...
Batches: 100%|██████████| 16/16 [11:20<00:00, 30.16s/it]

Completed in 680.47 seconds
Embedding shape: (500, 768)
Each abstract is now a 768-dimensional vector
Average time per abstract: 1.361 seconds

Notice the differences between these two models:

  • Dimensionality: The smaller model produces 384-dimensional embeddings, while the larger model produces 768-dimensional embeddings. More dimensions can capture more nuanced semantic information.
  • Speed: The smaller model is about 10 times faster. For 500 papers, that's a difference of about 10 minutes. For thousands of documents, this difference becomes significant.
  • Quality: Larger models generally produce higher-quality embeddings that better capture subtle semantic relationships. However, the smaller model is often good enough for many applications.

The key insight here is that local models give you flexibility. You can choose models that balance quality, speed, and computational resources based on your specific needs. For rapid prototyping, use smaller models. For production systems where quality matters most, use larger models.

Visualizing Real-World Embeddings

In our previous tutorial, we saw beautifully separated clusters using handcrafted paper abstracts. Let's see what happens when we visualize embeddings from real arXiv papers. We'll use the same PCA approach to reduce our 384-dimensional embeddings down to 2D:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_small)

print(f"Original embedding dimensions: {embeddings_small.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
Original embedding dimensions: 384
Reduced embedding dimensions: 2

Now let's create a visualization showing how our 500 papers cluster by category:

# Create the visualization
plt.figure(figsize=(12, 8))

# Define colors for different categories
colors = ['#C8102E', '#003DA5', '#00843D', '#FF8200', '#6A1B9A']
category_names = ['Machine Learning', 'Computer Vision', 'Comp. Linguistics', 'Databases', 'Software Eng.']
category_codes = ['cs.LG', 'cs.CV', 'cs.CL', 'cs.DB', 'cs.SE']

# Plot each category
for i, (cat_code, cat_name, color) in enumerate(zip(category_codes, category_names, colors)):
    # Get papers from this category
    mask = df['category'] == cat_code
    cat_embeddings = embeddings_2d[mask]

    plt.scatter(cat_embeddings[:, 0], cat_embeddings[:, 1],
                c=color, label=cat_name, s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

plt.xlabel('First Principal Component', fontsize=12)
plt.ylabel('Second Principal Component', fontsize=12)
plt.title('500 arXiv Papers Across Five Computer Science Categories\n(Real-world embeddings show overlapping clusters)',
          fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='best', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Visualization of embeddings for 500 arXiv Papers Across Five Computer Science Categories

The visualization above reveals an important aspect of real-world data. Unlike our handcrafted examples in the previous tutorial, where clusters were perfectly separated, these real arXiv papers show more overlap. You can see clear groupings, as well as papers that bridge multiple topics. For example, a paper about "deep learning for database query optimization" uses vocabulary from both machine learning and databases, so it might appear between those clusters.

This is exactly what you'll encounter in production systems. Real data is messy, topics overlap, and semantic boundaries are often fuzzy rather than sharp. The embeddings are still capturing meaningful relationships, but the visualization shows the complexity of actual research papers rather than the idealized examples we used for learning.

Strategy Two: API-Based Embedding Services

Local models work great, but they require computational resources and you're responsible for managing them. API-based embedding services offer an alternative approach. You send your text to a cloud provider, they generate embeddings using their infrastructure, and they send the embeddings back to you.

We'll use Cohere's API for our main example because they offer a generous free trial tier that doesn't require payment information. This makes it perfect for learning and experimentation.

Setting Up Cohere Securely

First, you'll need to create a free Cohere account and get an API key:

  1. Visit Cohere's registration page
  2. Sign up for a free account (no credit card required)
  3. Navigate to the API Keys section in your dashboard
  4. Copy your Trial API key

Important security practice: Never hardcode API keys directly in your notebooks or scripts. Store them in a .env file instead. This prevents accidentally sharing sensitive credentials when you share your code.

Create a file named .env in your project directory with the following entry:

COHERE_API_KEY=your_key_here

Important: Add .env to your .gitignore file to prevent committing it to version control.

Now load your API key securely:

from dotenv import load_dotenv
import os
from cohere import ClientV2
import time

# Load environment variables from .env file
load_dotenv()

# Access your API key
cohere_api_key = os.getenv('COHERE_API_KEY')

if not cohere_api_key:
    raise ValueError(
        "COHERE_API_KEY not found. Please create a .env file with your API key.\n"
        "See https://dashboard.cohere.com for instructions on getting your key."
    )

# Initialize the Cohere client using the V2 API
co = ClientV2(api_key=cohere_api_key)
print("API key loaded successfully from environment")
API key loaded successfully from environment

Now let's generate embeddings using the Cohere API. Here's something we discovered through trial and error: when we first ran this code without delays, we hit Cohere's rate limit and got a 429 TooManyRequestsError with this message: "trial token rate limit exceeded, limit is 100000 tokens per minute."

This exposes an important lesson about working with APIs. Rate limits aren't always clearly documented upfront. Sometimes you discover them by running into them, then you have to dig through the error responses in the documentation to understand what happened. In this case, we found the details in Cohere's error responses documentation. You can also check their rate limits page for current limits, though specifics for free tier accounts aren't always listed there.

With 500 papers averaging around 1,337 characters each, we can easily exceed 100,000 tokens per minute if we send batches too quickly. So we've built in two safeguards: a 12-second delay between batches to stay under the limit, and retry logic in case we do hit it. This makes the process take about 60-70 seconds instead of the 6-8 seconds the API actually needs, but it's reliable and won't throw errors mid-process.

Think of it as the tradeoff for using a free tier: we get access to powerful models without paying, but we work within some constraints. Let's see it in action:

print("Generating embeddings using Cohere API...")
print(f"Processing {len(abstracts)} abstracts...")

start_time = time.time()
actual_api_time = 0  # Track time spent on actual API calls

# Cohere recommends processing in batches for efficiency
# Their API accepts up to 96 texts per request
batch_size = 90
all_embeddings = []

for i in range(0, len(abstracts), batch_size):
    batch = abstracts[i:i+batch_size]
    batch_num = i//batch_size + 1
    total_batches = (len(abstracts) + batch_size - 1) // batch_size
    print(f"Processing batch {batch_num}/{total_batches} ({len(batch)} abstracts)...")

    # Add retry logic for rate limits
    max_retries = 3
    retry_delay = 60  # Wait 60 seconds if we hit rate limit

    for attempt in range(max_retries):
        try:
            # Track actual API call time
            api_start = time.time()

            # Generate embeddings for this batch using V2 API
            response = co.embed(
                texts=batch,
                model='embed-v4.0',
                input_type='search_document',
                embedding_types=['float']
            )

            actual_api_time += time.time() - api_start
            # V2 API returns embeddings in a different structure
            all_embeddings.extend(response.embeddings.float_)
            break  # Success, move to next batch

        except Exception as e:
            if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                print(f"  Rate limit hit. Waiting {retry_delay} seconds before retry...")
                time.sleep(retry_delay)
            else:
                raise  # Re-raise if it's not a rate limit error or we're out of retries

    # Add a delay between batches to avoid hitting rate limits
    # Wait 12 seconds between batches (spreads 500 papers over ~1 minute)
    if i + batch_size < len(abstracts):  # Don't wait after the last batch
        time.sleep(12)

# Convert to numpy array for consistency with local models
embeddings_cohere = np.array(all_embeddings)
elapsed_time = time.time() - start_time

print(f"\nCompleted in {elapsed_time:.2f} seconds (includes rate limit delays)")
print(f"Actual API processing time: {actual_api_time:.2f} seconds")
print(f"Time spent waiting for rate limits: {elapsed_time - actual_api_time:.2f} seconds")
print(f"Embedding shape: {embeddings_cohere.shape}")
print(f"Each abstract is now a {embeddings_cohere.shape[1]}-dimensional vector")
print(f"Average time per abstract (API only): {actual_api_time/len(abstracts):.3f} seconds")

# Add to DataFrame
df['embedding_cohere'] = list(embeddings_cohere)
Generating embeddings using Cohere API...
Processing 500 abstracts...
Processing batch 1/6 (90 abstracts)...
Processing batch 2/6 (90 abstracts)...
Processing batch 3/6 (90 abstracts)...
Processing batch 4/6 (90 abstracts)...
Processing batch 5/6 (90 abstracts)...
Processing batch 6/6 (50 abstracts)...

Completed in 87.23 seconds (includes rate limit delays)
Actual API processing time: 27.18 seconds
Time spent waiting for rate limits: 60.05 seconds
Embedding shape: (500, 1536)
Each abstract is now a 1536-dimensional vector
Average time per abstract (API only): 0.054 seconds

Notice the timing breakdown? The actual API processing was quite fast (around 27 seconds), but we spent most of our time waiting between batches to respect rate limits (around 60 seconds). This is the reality of free-tier accounts: they're fantastic for learning and prototyping, but come with constraints. Paid tiers would give us much higher limits and let us process at full speed.

Something else worth noting: Cohere's embeddings are 1536-dimensional, which is 4x larger than our small local model (384 dimensions) and 2x larger than our large local model (768 dimensions). Yet the API processing was still faster than our small local model. This demonstrates the power of specialized infrastructure. Cohere runs optimized hardware designed specifically for embedding generation at scale, while our local models run on general-purpose computers. Higher dimensions don't automatically mean slower processing when you have the right infrastructure behind them.

For this tutorial, Cohere’s free tier works perfectly. We're focusing on understanding the concepts and comparing approaches, not optimizing for production speed. The key differences from local models:

  • No local computation: All processing happens on Cohere's servers, so it works equally well on any hardware.
  • Internet dependency: Requires an active internet connection to work.
  • Rate limits: Free tier accounts have token-per-minute limits, which is why we added delays between batches.

Other API Options

While we're using Cohere for this tutorial, you should know about other popular embedding APIs:

OpenAI offers excellent embedding models, but requires payment information upfront. If you have an OpenAI account, their text-embedding-3-small model is very affordable at \$0.02 per 1M tokens. You can find setup instructions in their embeddings documentation.

Together AI provides access to many open-source models through their API. They offer models like BAAI/bge-large-en-v1.5 and detailed documentation in their embeddings guide. Note that their rate limit tiers are subject to change, so be sure to check their rate limit documentation to determine the tier you'll need based on your needs.

The choice between these services depends on your priorities. OpenAI has excellent quality but requires payment setup. Together AI offers many model choices and different paid tiers. Cohere has a truly free tier for learning and prototyping.

Comparing Your Options

Now that we've generated embeddings using both local models and an API service, let's think about how to choose between these approaches for real projects. The decision isn't about one being universally better than the other. It's about matching the approach to your specific constraints and requirements.

To clarify terminology: "self-hosted models" means running models on infrastructure you control, whether that's your laptop for learning or your company's cloud servers for production. "API services" means using third-party providers like Cohere or OpenAI where you send data to their servers for processing.

Dimension Self-hosted Models API Services
Cost Zero ongoing costs after initial setup. Best for high-volume applications where you'll generate embeddings frequently. Pay-per-use model per 1M tokens. Cohere: \$0.12 per 1M tokens. OpenAI: \$0.13 per 1M tokens. Best for low to moderate volume, or when you want predictable costs without infrastructure.
Performance Speed depends on your hardware. Our results: 0.131 seconds per abstract (small model), 1.361 seconds per abstract (large model). Best for batch processing or when you control the infrastructure. Speed depends on internet connection and API server load. Our results: 0.054 seconds per abstract (Cohere). Includes network latency and third-party infrastructure considerations. Best when you don't have powerful local hardware or need access to the latest models.
Privacy All data stays on your infrastructure. Complete control over data handling. No data sent to third parties. Best for sensitive data, healthcare, financial services, or when compliance requires data locality. Data is sent to third-party servers for processing. Subject to the provider's data handling policies. Cohere states that API data isn't used for training (verify current policy). Best for non-sensitive data, or when provider policies meet your requirements.
Customization Can fine-tune models on your specific domain. Full control over model selection and updates. Can modify inference parameters. Best for specialized domains, custom requirements, or when you need reproducibility. Limited to provider's available models. Model updates happen on provider's schedule. Less control over inference details. Best for general-purpose applications, or when using the latest models matters more than control.
Infrastructure Requires managing infrastructure. Whether running on your laptop or company cloud servers, you handle model updates, dependencies, and scaling. Best for organizations with existing ML infrastructure or when infrastructure control is important. No infrastructure management needed. Automatic scaling to handle load. Provider manages updates and availability. Best for smaller teams, rapid prototyping, or when you want to focus on application logic rather than infrastructure.

When to Use Each Approach

Here's a practical decision guide to help you choose the right approach for your project:

Choose Self-Hosted Models when you:

  • Process large volumes of text regularly
  • Work with sensitive or regulated data
  • Need offline capability
  • Have existing ML infrastructure (whether local or cloud-based)
  • Want to fine-tune models for your domain
  • Need complete control over the deployment

Choose API Services when you:

  • Are just getting started or prototyping
  • Have unpredictable or variable workload
  • Want to avoid infrastructure management
  • Need automatic scaling
  • Prefer the latest models without maintenance
  • Value simplicity over infrastructure control

For our tutorial series, we've used both approaches to give you hands-on experience with each. In our next tutorial, we'll use the Cohere embeddings for our semantic search implementation. We're choosing Cohere because they offer a generous free tier for learning (no payment required), their models are well-suited for semantic search tasks, and they work consistently across different hardware setups.

In practice, you'd evaluate embedding quality by testing on your specific use case: generate embeddings with different models, run similarity searches on sample queries, and measure which model returns the most relevant results for your domain.

Storing Your Embeddings

We've generated embeddings using multiple methods, and now we need to save them for future use. Storing embeddings properly is important because generating them can be time-consuming and potentially costly. You don't want to regenerate embeddings every time you run your code.

Let's explore two storage approaches:

Option 1: CSV with Numpy Arrays

This approach works well for learning and small-scale prototyping:

# Save the metadata to CSV (without embeddings, which are large arrays)
df_metadata = df[['title', 'abstract', 'authors', 'published', 'category', 'arxiv_id', 'abstract_length']]
df_metadata.to_csv('arxiv_papers_metadata.csv', index=False)
print("Saved metadata to 'arxiv_papers_metadata.csv'")

# Save embeddings as numpy arrays
np.save('embeddings_minilm.npy', embeddings_small)
np.save('embeddings_mpnet.npy', embeddings_large)
np.save('embeddings_cohere.npy', embeddings_cohere)
print("Saved embeddings to .npy files")

# Later, you can load them back like this:
# df_loaded = pd.read_csv('arxiv_papers_metadata.csv')
# embeddings_loaded = np.load('embeddings_cohere.npy')
Saved metadata to 'arxiv_papers_metadata.csv'
Saved embeddings to .npy files

This approach is simple and transparent, making it perfect for learning and experimentation. However, it has significant limitations for larger datasets:

  • Loading all embeddings into memory doesn't scale beyond a few thousand documents
  • No indexing for fast similarity search
  • Manual coordination between CSV metadata and numpy arrays

For production systems with thousands or millions of embeddings, you'll want specialized vector databases (Option 2) that handle indexing, similarity search, and efficient storage automatically.

Option 2: Preparing for Vector Databases

In production systems, you'll likely store embeddings in a specialized vector database like Pinecone, Weaviate, or Chroma. These databases are optimized for similarity search. While we'll cover vector databases in detail in another tutorial series, here's how you'd structure your data for them:

# Prepare data in a format suitable for vector databases
# Most vector databases want: ID, embedding vector, and metadata

vector_db_data = []
for idx, row in df.iterrows():
    vector_db_data.append({
        'id': row['arxiv_id'],
        'embedding': row['embedding_cohere'].tolist(),  # Convert numpy array to list
        'metadata': {
            'title': row['title'],
            'abstract': row['abstract'][:500],  # Many DBs limit metadata size
            'authors': ', '.join(row['authors'][:3]),  # First 3 authors
            'category': row['category'],
            'published': str(row['published'])
        }
    })

# Save in JSON format for easy loading into vector databases
import json
with open('arxiv_papers_vector_db_format.json', 'w') as f:
    json.dump(vector_db_data, f, indent=2)
print("Saved data in vector database format to 'arxiv_papers_vector_db_format.json'")

print(f"\nTotal storage sizes:")
print(f"  Metadata CSV: ~{os.path.getsize('arxiv_papers_metadata.csv')/1024:.1f} KB")
print(f"  JSON for vector DB: ~{os.path.getsize('arxiv_papers_vector_db_format.json')/1024:.1f} KB")
Saved data in vector database format to 'arxiv_papers_vector_db_format.json'

Total storage sizes:
  Metadata CSV: ~764.6 KB
  JSON for vector DB: ~15051.0 KB
  

Each storage method has its purpose:

  • CSV + numpy: Best for learning and small-scale experimentation
  • JSON for vector databases: Best for production systems that need efficient similarity search

Preparing for Semantic Search

You now have 500 research papers from five distinct computer science domains with embeddings that capture their semantic meaning. These embeddings are vectors, which means we can measure how similar or different they are using mathematical distance calculations.

In the next tutorial, you'll use these embeddings to build a search system that finds relevant papers based on meaning rather than keywords. You'll implement similarity calculations, rank results, and see firsthand how semantic search outperforms traditional keyword matching.

Save your embeddings now, especially the Cohere embeddings since we'll use those in the next tutorial to build our search system. We chose Cohere because they work consistently across different hardware setups and provide a consistent baseline for implementing similarity calculations.

Next Steps

Before we move on, try these experiments to deepen your understanding:

Experiment with different arXiv categories:

  • Try collecting papers from categories like stat.ML (Statistics Machine Learning) or math.OC (Optimization and Control)
  • Use the PCA visualization code to see how these new categories cluster with your existing five
  • Do some categories overlap more than others?

Compare embedding models visually:

  • Generate embeddings for your dataset using all-mpnet-base-v2
  • Create side-by-side PCA visualizations comparing the small model and large model
  • Do the clusters look tighter or more separated with the larger model?

Test different dataset sizes:

  • Collect just 50 papers per category (250 total) and visualize the results
  • Then try 200 papers per category (1000 total)
  • How does dataset size affect the clarity of the clusters?
  • At what point does collection or processing time become noticeable?

Explore model differences:

Ready to implement similarity search and build a working semantic search engine? The next tutorial will show you how to turn these embeddings into a powerful research discovery tool.


Key Takeaways:

  • Programmatic data collection through APIs like arXiv enables working with real-world datasets
  • Collecting papers from diverse categories (cs.LG, cs.CV, cs.CL, cs.DB, cs.SE) creates semantic clusters for effective search
  • Papers can be cross-listed in multiple arXiv categories, creating natural overlap between related fields
  • Self-hosted embedding models provide zero-cost, private embedding generation with full control over the process
  • API-based embedding services offer high-quality embeddings without infrastructure management
  • Secure credential handling using .env files protects sensitive API keys and tokens
  • Rate limits aren't always clearly documented and are sometimes discovered through trial and error
  • The choice between self-hosted and API approaches depends on cost, privacy, scale, and infrastructure considerations
  • Free tier APIs provide powerful embedding generation for learning, but require handling rate limits and delays that paid tiers avoid
  • Real-world embeddings show more overlap than handcrafted examples, reflecting the complexity of actual data
  • Proper storage of embeddings prevents costly regeneration and enables efficient reuse across projects

Setting Up Your Data Engineering Lab with Docker

28 October 2025 at 21:57

This guide helps you set up a clean, isolated environment for running Dataquest tutorials. While many tutorials work fine directly on your computer, some (particularly those involving data processing tools like PySpark) can run into issues depending on your operating system or existing software setup. The lab environment we'll create ensures everything runs consistently, with the right versions of Python and other tools, without affecting your main system.

What's a Lab Environment?

You can think of this "lab" as a separate workspace just for your Dataquest tutorials. It's a controlled space where you can experiment and test code without affecting your main computer. Just like scientists use labs for experiments, we'll use this development lab to work through tutorials safely.

Benefits for everyone:

  • Windows/Mac users: Avoid errors from system differences. No more "command not found" or PySpark failing to find files
  • Linux users: Get the exact versions of Python and Java needed for tutorials, without conflicting with your system's packages
  • Everyone: Keep your tutorial work separate from personal projects. Your code and files are saved normally, but any packages you install or system changes you make stay contained in the lab

We'll use a tool called Docker to create this isolated workspace. Think of it as having a dedicated computer just for tutorials inside your regular computer. Your files and code are saved just like normal (you can edit them with your favorite editor), but the tutorial environment itself stays clean and separate from everything else on your system.

The lab command you'll use creates this environment, and it mirrors real data engineering workflows (most companies use isolated environments like this to ensure consistency across their teams).

Installing Docker

Docker creates isolated Linux environments on any operating system. This means you'll get a consistent Linux workspace whether you're on Windows, Mac, or even Linux itself. We're using it as a simple tool, so no container orchestration or cloud deployment knowledge is needed.

On Windows:
Download Docker Desktop from docker.com/products/docker-desktop. Run the installer, restart your computer when prompted, and open Docker Desktop. You'll see a whale icon in your system tray when it's running.

Note: Docker Desktop will automatically enable required Windows features like WSL 2. If you see an error about virtualization, you may need to enable it in your computer's BIOS settings. Search online for your computer model + "enable virtualization" for specific steps.

On Mac:
Download Docker Desktop for your chip type (Intel or Apple Silicon) from the same link. Drag Docker to your Applications folder and launch it. You'll see the whale in your menu bar.

On Linux:
You probably already have Docker, but if not, run this command in your terminal:

curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh

Verify it works:
Open your terminal (PowerShell, Terminal, or bash) and run:

docker --version
docker compose version

You should see version numbers for both commands. If you see "command not found," restart your terminal or computer and try again.

Getting the Lab Environment

The lab is already configured in the Dataquest tutorials repository. Clone or download it:

git clone https://github.com/dataquestio/tutorials.git
cd tutorials

If you don't have git, download the repository as a ZIP file from GitHub and extract it.

The repository includes everything you need:

  • Dockerfile - Configures the Linux environment with Python 3.11 and Java (for Spark)
  • docker-compose.yml - Defines the lab setup
  • Tutorial folders with all the code and data

Starting Your Lab

In your IDE’s terminal, ensure you’re in the tutorials folder and start the lab:

docker compose run --rm lab

Note that the first time you run this command, the setup may take 2-5 minutes.

You're now in Linux! Your prompt will change to something like root@abc123:/tutorials#, which is your Linux command line where everything will work as expected.

The --rm flag means the lab cleans itself up when you exit, keeping your system tidy.

Using Your Lab

Once you’re in the lab environment, here's your typical workflow:

1. Navigate to the tutorial you're working on

# See all available tutorials
ls

# Enter a specific tutorial
cd pyspark-etl

2. Install packages as needed
Each tutorial might need different packages:

# For PySpark tutorials
pip install pyspark

# For data manipulation tutorials
pip install pandas numpy

# For database connections
pip install sqlalchemy psycopg2-binary

3. Run the tutorial code

python <script-name>.py

Because the code will run within a standardized Linux environment, you shouldn’t run into setup errors.

4. Edit files normally
The beauty of this setup: you can still use your favorite editor! The tutorials folder on your computer is synchronized with the lab. Edit files in VS Code, PyCharm, or any editor, and the lab sees changes immediately.

5. Exit when done
Type exit or press Ctrl+D to leave the lab. The environment cleans itself up automatically.

Common Workflow Examples

Running a PySpark tutorial:

docker compose run --rm lab
cd pyspark-etl
pip install pyspark pandas
python main.py

Working with Jupyter notebooks:

docker compose run --rm -p 8888:8888 lab
pip install jupyterlab
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
# Open the URL it shows in your browser

Keeping packages installed between sessions:
If you're tired of reinstalling packages, create a requirements file:

# After installing packages, save them
pip freeze > requirements.txt

# Next session, restore them
pip install -r requirements.txt

Quick Reference

The one command you need:

# From the tutorials folder
docker compose run --rm lab

Exit the lab:

exit # Or press Ctrl+D

Where things are:

  • Tutorial code: Each folder in /tutorials
  • Your edits: Automatically synchronized
  • Data files: In each tutorial's data/ folder
  • Output files: Save to the tutorial folder to see them on your computer

Adding services (databases, etc.):
For tutorials needing PostgreSQL, MongoDB, or other services, we can extend the docker-compose.yml. For now, the base setup handles all Python and PySpark tutorials.

Troubleshooting

  • "Cannot connect to Docker daemon"

    • Docker Desktop needs to be running. Start it from your applications.
  • "docker compose" not recognized

    • Older Docker versions use docker-compose (with a hyphen). Try:

      docker-compose run --rm lab
  • Slow performance on Windows

    • Docker on Windows can be slow with large datasets. For better performance, store data files in the container rather than the mounted folder.
  • "Permission denied" on Linux

    • Add your user to the docker group:

      sudo usermod -aG docker $USER

      Then log out and back in.

You're Ready

You now have a Linux lab environment that matches production systems. Happy experimenting!

How to Become a Data Scientist (Yes, Even in 2025)

28 October 2025 at 16:35

The world is becoming increasingly data-driven. Data is one of the most valuable resources a company can have, but without a data scientist, it’s just numbers.

Businesses everywhere are looking for professionals who can turn raw data into clear insights. According to the U.S. Bureau of Labor Statistics, jobs for data scientists are expected to grow by 34% between 2024 and 2034, much faster than most careers.

Becoming a data scientist takes more than coding or statistics. It’s a mix of math, computer science, business knowledge, and communication skills. This combination makes the role both challenging and in demand.

I know it’s possible to get there. I started with a history degree and later became a machine learning engineer, data science consultant, and founder of Dataquest. With the right plan, you can do it too.

What is a Data Scientist?

A data scientist is someone who uses data to answer questions and solve problems. They collect large amounts of information, clean it, analyze it, and turn it into something actionable.

They use tools like Python, R, and SQL to manage and explore data. They apply statistics, machine learning, and data visualization to find patterns, understand trends, and make predictions.

Some data scientists build tools and systems for users, while others focus on helping businesses make better decisions by predicting future outcomes.

What Do Data Scientists Do?

Data scientists wear many hats. Their work depends on the company and the type of data they handle, but the goal is always the same: to turn data into useful insights that help people make data-driven decisions.

Data science powers everything from the algorithm showing you the next TikTok video to how ChatGPT answers questions to how Netflix recommends shows.

Some data scientist responsibilities include:

  • Collect and clean data from databases, APIs, and spreadsheets to prepare it for analysis.
  • Analyze and explore data to find trends, patterns, and relationships that explain what’s happening.
  • Build machine learning models and make predictions to forecast sales, detect fraud, or recommend products.
  • Visualize and communicate insights through charts and dashboards using tools like Tableau, Matplotlib, or Power BI.
  • Automate and improve systems by creating smarter processes, optimizing marketing campaigns, or building better recommendation engines.

In short, they help businesses make smarter decisions and work faster.

The Wrong and Right Way

When I started learning data science, I followed every online guide I could find, but I ended up bored and without real skills to show for it. It felt like a teacher handing me a pile of books and telling me to read them all.

Eventually, I realized I learn best when I’m solving problems that interest me. So instead of memorizing a checklist of skills, I began building real projects with real data. That approach kept me motivated and mirrored the work I’d actually do as a data scientist.

With that experience, I created Dataquest to help others learn the same way: by doing. But courses alone aren’t enough. To succeed, you need to learn how to think, plan, and execute effectively. This guide will show you how.

How to Become a Data Scientist:

  • Step 1: Earn a Degree (Recommended, Not Required)
  • Step 2: Learn the Core Skills
  • Step 3: Question Everything and Find Your Niche
  • Step 4: Build Projects
  • Step 5: Share Your Work
  • Step 6: Learn From Others
  • Step 7: Push Your Boundaries
  • Step 8: Start Looking for a Job

Now, let’s go over each of these one by one.

Step 1: Earn a Degree (Recommended, Not Required)

Most data scientists start with a degree in a technical field. According to Zippia, 51% of data scientists hold a bachelor’s degree, 34% a master’s, and 13% a doctorate.

A degree helps you build a solid foundation in math, statistics, and programming. It also shows employers that you can handle complex concepts and long-term projects.

Relevant degrees include computer science, statistics, mathematics, data science, or engineering.

If university isn’t an option, you can still learn online. Platforms like Dataquest, Coursera, edX, and Google Career Certificates have trusted online courses and programs that teach the same essential skills through practical, hands-on projects.

Step 2: Learn the Core Skills

Even if you can’t study at a university or enroll in a course, the internet and books offer everything you need to get started. So, let’s look at what you should learn.

If you come from a computer science background, many concepts like algorithms, logic, and data structures will feel familiar. If not, Python is a great starting point because it teaches those fundamentals in a practical way.

1. Programming languages

Start with Python. It’s beginner-friendly and powerful for data analysis, machine learning, and automation.

Learn how to:

  • Write basic code (variables, loops, functions)
  • Use data science libraries like pandas, NumPy, and Matplotlib
  • Work with raw data files (e.g., CSVs and JSON) and collect data via APIs

Once you’re comfortable with Python, consider learning R for statistics and SQL for managing and querying databases.

Helpful guides:

  1. How to learn Python
  2. How to learn R
  3. How to learn SQL

2. Math and Statistics

A strong understanding of math and statistics is essential in data science. It helps you make sense of data and build accurate models.

Focus on:

3. Data Handling and Visualization

Being able to clean, organize, and visualize data is a key part of any data scientist’s toolkit. These skills help you turn raw data into clear insights that others can easily understand.

You’ll use tools like Excel, Tableau, or Power BI to build dashboards and reports, and Python libraries like pandas and Matplotlib for deeper analysis and visualization.

Here are some learning paths to guide you:

4. Core Concepts

Once you’ve built a solid technical foundation, it’s time to understand how these skills fit into the bigger picture.

  • How machine learning models work
  • How to ask business questions and measure results
  • How to translate data insights into real business impact

Step 3: Question Everything and Find Your Niche

The data science and data analytics field is appealing because you get to answer interesting questions using actual data and code. These questions can range from “Can I predict whether a flight will be on time?” to “How much does the U.S. spend per student on education?"

To answer these questions, you need to develop an analytical mindset.

The best way to develop this mindset is to start by analyzing news articles. First, find a news article that discusses data. Here are two great examples: Can Running Make You Smarter? or Is Sugar Really Bad for You?

Then, think about the following:

  • How they reach their conclusions given the data they discuss
  • How you might design a study to investigate further
  • What questions you might want to ask if you had access to the underlying data

Some articles, like this one on gun deaths in the U.S. and this one on online communities supporting Donald Trump, actually have the underlying data available for download. This allows you to explore even deeper.

You could do the following:

  • Download the data, and open it in Excel or an equivalent tool
  • See what patterns you can find in the data by eyeballing it
  • Does the data support the conclusions of the article? Why or why not?
  • What additional questions can you use the data to answer?

Here are some good places to find data-driven articles:

Think About What You’re Interested In

After a few weeks of reading articles, reflect on whether you enjoyed coming up with questions and answering them. Becoming a data scientist is a long road, and you need to be very passionate about the field to make it all the way. What is the industry that attracts you the most?

Perhaps you don't enjoy the process of coming up with questions in the abstract, but maybe you enjoy analyzing healthcare or finance data. Find what you're passionate about, and then start viewing it through an analytical lens.

Personally, I was very interested in stock market data, which motivated me to build a model to predict the market.

If you want to put in the months of hard work necessary to learn data science, working on something you’re passionate about will help you stay motivated when you face setbacks.

Step 4: Build Projects

As you’re learning the basics of coding, start applying your knowledge to get practical experience. Coursework isn't enough. Projects help you practice real-world techniques and develop the practical skills employers look for in the job market. It's a great way to test your knowledge.

Your projects don’t have to be complex. For example, you could analyze Super Bowl winners to find patterns, study weather data to predict rainfall, or explore movie ratings to see what drives popularity. The goal is to take an interesting dataset, ask good questions, and use code to answer them.

As you build projects, keep these points in mind:

  • Most real-world data science work involves data cleaning and preparation.
  • Simple machine learning algorithms like linear regression or decision trees are powerful starting points.
  • Focus on improving how you handle messy data, visualize insights, and communicate results. These are the techniques that make you stand out.
  • Everyone starts somewhere. Even small projects can show your creativity, logic, and problem-solving skills.

Building projects early helps you get practical experience that will make your portfolio much stronger when entering the job market.

As you're learning the basics of data science, you should start building projects that answer interesting questions that will showcase your data science skills.

If you need help finding free datasets for your projects, we've got you covered!

Where to Find Project Ideas

Not only does building projects help you practice your skills and understand real data science work, it also helps you build a portfolio to show potential employers.

Here are some more detailed guides on building projects on your own:

Additionally, most of Dataquest’s courses contain interactive projects that you can complete while you’re learning.

Here are just a few examples:

  • Profitable App Profiles for the App Store and Google Play Markets — Explore the app market to see what makes an app successful on both iOS and Android. You’ll analyze real data and find out why some book-based apps perform better than others.
  • Exploring Hacker News Posts — Analyze a dataset of posts from Hacker News, a popular tech community, and find out which kinds of discussions get the most attention.
  • Exploring eBay Car Sales Data — Use Python to work with a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
  • Star Wars Survey — Analyze survey data from Star Wars fans and find fun patterns, like which movie is the most loved (or the most hated).
  • Analyzing NYC High School Data — Explore how different factors like income and race relate to SAT scores using scatter plots and maps.
  • Classifying Heart Disease — Go through the complete machine learning workflow of data exploration, data splitting, model creation, and model evaluation to develop a logistic regression classifier for detecting heart disease.

Our students have fun while practicing with these projects. Online courses don’t have to be boring.

Take It Up a Notch

After a few small projects, it’s time to level up! Start adding more complexity to your work so you can learn advanced topics. The key is to choose projects in an area that interests you.

For example, since I was interested in the stock market, I focused on predictive modeling. As your skills grow, you can make your projects more detailed, like using minute-by-minute data or improving prediction accuracy.

Check out our article on Python project ideas for more inspiration.

Step 5: Share Your Work

Once you've built a few data science projects, share them with others on GitHub! This might just be the way to find internships!

Here’s why:

  • It makes you think about how to best present your projects, which is what you'd do in a data science role.
  • They allow your peers to view your projects and provide feedback.
  • They allow employers to view your projects.

Helpful resources about project portfolios:

Start a Simple Blog

Besides uploading projects to GitHub, start a blog. Writing about what you learn helps you understand topics better and spot what you’ve missed. Teaching others is one of the fastest ways to master a concept.

When I was learning data science, writing blog posts helped me do the following:

  • Capture interest from recruiters
  • Learn concepts more thoroughly (the process of teaching really helps you learn)
  • Connect with peers

You can write about:

  • Explaining data science concepts in simple terms
  • Walking through your projects and findings
  • Sharing your learning journey

Here’s an example of a visualization I made on my blog many years ago that tries to answer the question: do the Simpsons characters like each other?

Step 6: Learn From Others

After you've started to build an online presence, it's a good idea to start engaging with other data scientists. You can do this in-person or in online communities.

Here are some good online communities:

Here at Dataquest, we have an online community where learners can receive feedback on projects, discuss tough data-related problems, and build relationships with data professionals.

Personally, I was very active on Quora and Kaggle when I was learning, which helped me immensely.

Engaging in online communities is a good way to do the following:

  • Find other people to learn with
  • Enhance your profile and find opportunities
  • Strengthen your knowledge by learning from others

You can also engage with people in person through Meetups. In-person engagement can help you meet and learn from more experienced data scientists in your area. Take all the opportunities to learn.

Step 7: Push Your Boundaries

What kind of data scientists do organizations want to hire? The ones that find critical insights that save them money or make their customers happier. You have to apply the same process to learning, keep searching for new questions to answer, and keep answering harder and more complex questions.

If you look back on your projects from a month or two ago, and you don’t see room for improvement, you probably aren't pushing your boundaries enough. You should be making strong progress every month, and your work should reflect that. Interesting projects will make you stand out among applicants.

Here are some ways to push your boundaries and learn data science faster:

  • Try working with a larger dataset
  • Start a data science project that requires knowledge you don't have
  • Try making your project run faster
  • Teach what you did in a project to someone else

Step 8: Start Looking for a Job

Once you’ve built a few projects and learned the core skills, it’s time to start applying, not “someday,” but now. Don’t wait until you feel completely ready. The truth is, no one ever does.

Start with internships, entry-level roles, or freelance gigs. These give you real-world experience and help you understand how data science works in a business setting. Even if the job description looks intimidating, apply anyway. Many employers list “ideal” requirements, not must-haves.

Don’t get stuck studying forever. The best learning happens on the job. Every interview, every project, and every rejection teaches you something new.

You never know, the opportunity that looks like a long shot might be the one that launches your data science career. The more practical experience you gain, the deeper your knowledge becomes.

Becoming a Data Scientist FAQs

I know what you might be thinking: Is it still worth pursuing a career in data science? Will AI replace data scientists, or will the role evolve with it? What skills do I actually need to keep up?

I get these questions a lot, and since I was once in your shoes, let me share what I’ve learned and help you find the right path.

Is data science still a good career choice?

Yes, a data science career is still a fantastic choice. Demand for data scientists is high, and the world is generating a massive (and increasing) amount of data every day.

We don't claim to have a crystal ball or know what the future holds, but data science is a fast-growing field with high demand and lucrative salaries.

Will AI replace data scientists?

AI won’t replace data scientists, but it will definitely change what they do. As AI tools become more advanced, data scientists will use them to make decisions faster and with greater accuracy. Instead of doing only technical work, they’ll focus more on strategy and big-picture analysis.

Data scientists will also work closely with AI engineers and machine learning specialists to develop and improve AI models. This includes tasks like choosing the right algorithms, engineering features, and making sure systems are fair and reliable.

To stay relevant, data scientists will need to expand their skills into areas such as machine learning, deep learning, and natural language processing. They’ll also play an important role in ethical AI, helping prevent bias, protect data privacy, and promote responsible use of technology.

Continuous learning will be essential as the field evolves, but AI isn’t replacing data scientists. It’s helping them become even more powerful problem solvers.

What are the AI skills a data scientist needs?

Every data scientist should have a knowledge of the basics, but as artificial intelligence becomes part of nearly every industry, learning AI-related skills is essential.

Start with a strong understanding of machine learning and the ability to use deep learning frameworks like TensorFlow and PyTorch. Learn natural language processing (NLP) for analyzing text data, and make sure you understand AI ethics, especially how to recognize and reduce bias in models.

It also helps to be comfortable with AI development tools and libraries, build some data engineering skills, and learn to work effectively in cross-disciplinary teams.

Continuous learning is key. AI evolves quickly, and the best data scientists keep experimenting, exploring new methods, and adapting their skills to stay ahead.

You’ve Got This!

Studying to become a data scientist or data engineer isn't easy, but the key is to stay motivated and enjoy what you're doing. If you're consistently building projects and sharing them, you'll build your expertise and get the data scientist job that you want.

After years of being frustrated with how conventional sites taught data science, I created Dataquest, a better way to learn data science online. Dataquest solves the problems of MOOCs, where you never know what course to take next, and you're never motivated by what you're learning.

Dataquest is just the lessons I've learned from helping thousands of people learn data science, and it focuses on making the learning experience engaging. Here, you'll build dozens of projects, and you’ll learn all the skills you need to be a successful data scientist. Dataquest students have been hired at companies like Accenture and SpaceX .

I wish you all the best on your path to becoming a data scientist!

Understanding, Generating, and Visualizing Embeddings

27 October 2025 at 23:01

Imagine you're searching through a massive library of data science papers looking for content about "cleaning messy datasets." A traditional keyword search returns papers that literally contain those exact words. But it completely misses an excellent paper about "handling missing values and duplicates" and another about "data validation techniques." Even though these papers teach exactly what you're looking for, you'll never see them because they're using different words.

This is the fundamental problem with keyword-based searches: they match words, not meaning. When you search for "neural network training," it won't connect you to papers about "optimizing deep learning models" or "improving model convergence," despite these being essentially the same topic.

Embeddings solve this by teaching machines to understand meaning instead of just matching text. And if you're serious about building AI systems, generating embeddings is a fundamental concept you need to master.

What Are Embeddings?

Embeddings are numerical representations that capture semantic meaning. Instead of treating text as a collection of words to match, embeddings convert text into vectors (a list of numbers) where similar meanings produce similar vectors. Think of it like translating human language into a mathematical language that computers can understand and compare.

When we convert two pieces of text that mean similar things into embeddings, those embedding vectors will be mathematically close to each other in the embedding space. Think of the embedding space as a multi-dimensional map where meaning determines location. Papers about machine learning will cluster together. Papers about data cleaning will form their own group. And papers about data visualization? They'll gather in a completely different region. In a moment, we'll create a visualization that clearly demonstrates this.

Setting Up Your Environment

Before we start working directly with embeddings, let's install the libraries we'll need. We'll use sentence-transformers from Hugging Face to generate embeddings, sklearn for dimensionality reduction, matplotlib for visualization, and numpy to handle the numerical arrays we'll be working with.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.1
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2

pip install sentence-transformers scikit-learn matplotlib numpy

Run the command above in your terminal to install all required libraries. This will work whether you're using a Python script, Jupyter notebook, or any other development environment.

For this tutorial series, we'll work with research paper abstracts from arXiv.org, a repository where researchers publish cutting-edge AI and machine learning papers. If you're building AI systems, arXiv is a great resource to have. It's where you'll find the latest research on new architectures, techniques, and approaches that can help you implement the latest techniques in your projects.

arXiv is pronounced as "archive" because the X represents the Greek letter chi ⟨χ⟩

For this tutorial, we've manually created 12 abstracts for papers spanning machine learning, data engineering, and data visualization. These abstracts are stored directly in our code as Python strings, keeping things simple for now. We'll work with APIs and larger datasets in the next tutorial to automate this process.

# Abstracts from three data science domains
papers = [
    # Machine Learning Papers
    {
        'title': 'Building Your First Neural Network with PyTorch',
        'abstract': '''Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the fundamentals of defining layers, activation functions, and forward propagation. You'll build a multi-layer perceptron for classification tasks, understand backpropagation, and implement gradient descent optimization. By the end, you'll have a working model that achieves over 90% accuracy on the MNIST dataset.'''
    },
    {
        'title': 'Preventing Overfitting: Regularization Techniques Explained',
        'abstract': '''Overfitting is one of the most common challenges in machine learning. This guide explores practical regularization methods including L1 and L2 regularization, dropout layers, and early stopping. You'll learn how to detect overfitting by monitoring training and validation loss, implement regularization in both scikit-learn and TensorFlow, and tune regularization hyperparameters to improve model generalization on unseen data.'''
    },
    {
        'title': 'Hyperparameter Tuning with Grid Search and Random Search',
        'abstract': '''Selecting optimal hyperparameters can dramatically improve model performance. This paper demonstrates systematic approaches to hyperparameter optimization using grid search and random search. You'll learn how to define hyperparameter spaces, implement cross-validation during tuning, and use scikit-learn's GridSearchCV and RandomizedSearchCV. We'll compare both methods and discuss when to use each approach for efficient model optimization.'''
    },
    {
        'title': 'Transfer Learning: Using Pre-trained Models for Image Classification',
        'abstract': '''Transfer learning lets you leverage pre-trained models to solve new problems with limited data. This paper shows how to use pre-trained convolutional neural networks like ResNet and VGG for custom image classification tasks. You'll learn how to freeze layers, fine-tune network weights, and adapt pre-trained models to your specific domain. We'll build a classifier that achieves high accuracy with just a few hundred training images.'''
    },

    # Data Engineering/ETL Papers
    {
        'title': 'Handling Missing Data: Strategies and Best Practices',
        'abstract': '''Missing data can derail your analysis if not handled properly. This comprehensive guide covers detection methods for missing values, statistical techniques for understanding missingness patterns, and practical imputation strategies. You'll learn when to use mean imputation, forward fill, and more sophisticated approaches like KNN imputation. We'll work through real datasets with missing values and implement robust solutions using pandas.'''
    },
    {
        'title': 'Data Validation Techniques for ETL Pipelines',
        'abstract': '''Building reliable data pipelines requires thorough validation at every stage. This paper teaches you how to implement data quality checks, define validation rules, and catch errors before they propagate downstream. You'll learn schema validation, outlier detection, and referential integrity checks. We'll build a validation framework using Great Expectations and integrate it into an automated ETL workflow for production data systems.'''
    },
    {
        'title': 'Cleaning Messy CSV Files: A Practical Guide',
        'abstract': '''Real-world CSV files are rarely clean and analysis-ready. This hands-on paper walks through common data quality issues: inconsistent formatting, duplicate records, invalid entries, and encoding problems. You'll master pandas techniques for standardizing column names, removing duplicates, handling date parsing errors, and dealing with mixed data types. We'll transform a messy CSV with multiple quality issues into a clean dataset ready for analysis.'''
    },
    {
        'title': 'Building Scalable ETL Workflows with Apache Airflow',
        'abstract': '''Apache Airflow helps you build, schedule, and monitor complex data pipelines. This paper introduces Airflow's core concepts including DAGs, operators, and task dependencies. You'll learn how to define pipeline workflows, implement retry logic and error handling, and schedule jobs for automated execution. We'll build a complete ETL pipeline that extracts data from APIs, transforms it, and loads it into a data warehouse on a daily schedule.'''
    },

    # Data Visualization Papers
    {
        'title': 'Creating Interactive Dashboards with Plotly Dash',
        'abstract': '''Interactive dashboards make data exploration intuitive and engaging. This paper teaches you how to build web-based dashboards using Plotly Dash. You'll learn to create interactive charts with dropdowns, sliders, and date pickers, implement callbacks for dynamic updates, and design responsive layouts. We'll build a complete dashboard for exploring sales data with filters, multiple chart types, and real-time updates.'''
    },
    {
        'title': 'Matplotlib Best Practices: Making Publication-Quality Plots',
        'abstract': '''Creating clear, professional visualizations requires attention to design principles. This guide covers matplotlib best practices for publication-quality plots. You'll learn about color palette selection, font sizing and typography, axis formatting, and legend placement. We'll explore techniques for reducing chart clutter, choosing appropriate chart types, and creating consistent styling across multiple plots for research papers and presentations.'''
    },
    {
        'title': 'Data Storytelling: Designing Effective Visualizations',
        'abstract': '''Good visualizations tell a story and guide viewers to insights. This paper focuses on the principles of visual storytelling and effective chart design. You'll learn how to choose the right visualization for your data, apply pre-attentive attributes to highlight key information, and structure narratives through sequential visualizations. We'll analyze both effective and ineffective visualizations, discussing what makes certain design choices successful.'''
    },
    {
        'title': 'Building Real-Time Visualization Streams with Bokeh',
        'abstract': '''Visualizing streaming data requires specialized techniques and tools. This paper demonstrates how to create real-time updating visualizations using Bokeh. You'll learn to implement streaming data sources, update plots dynamically as new data arrives, and optimize performance for continuous updates. We'll build a live monitoring dashboard that displays streaming sensor data with smoothly updating line charts and real-time statistics.'''
    }
]

print(f"Loaded {len(papers)} paper abstracts")
print(f"Topics covered: Machine Learning, Data Engineering, and Data Visualization")
Loaded 12 paper abstracts
Topics covered: Machine Learning, Data Engineering, and Data Visualization

Generating Your First Embeddings

Now let's transform these paper abstracts into embeddings. We'll use a pre-trained model from the sentence-transformers library called all-MiniLM-L6-v2. We're using this model because it's fast and efficient for learning purposes, perfect for understanding the core concepts. In our next tutorial, we'll explore more recent production-grade models used in real-world applications.

The model will convert each abstract into an n-dimensional vector, where the value of n depends on the specific model architecture. Different embedding models produce vectors of different sizes. Some models create compact 128-dimensional embeddings, while others produce larger 768 or even 1024-dimensional vectors. Generally, larger embeddings can capture more nuanced semantic information, but they also require more computational resources and storage space.

Think of each dimension in the vector as capturing some aspect of the text's meaning. Maybe one dimension responds strongly to "machine learning" concepts, another to "data cleaning" terminology, and another to "visualization" language. The model learned these representations automatically during training.

Let's see what dimensionality our specific model produces.

from sentence_transformers import SentenceTransformer

# Load the pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract just the abstracts for embedding
abstracts = [paper['abstract'] for paper in papers]

# Generate embeddings for all abstracts
embeddings = model.encode(abstracts)

# Let's examine what we've created
print(f"Shape of embeddings: {embeddings.shape}")
print(f"Each abstract is represented by a vector of {embeddings.shape[1]} numbers")
print(f"\nFirst few values of the first embedding:")
print(embeddings[0][:10])
Shape of embeddings: (12, 384)
Each abstract is represented by a vector of 384 numbers

First few values of the first embedding:
[-0.06071806 -0.13064863  0.00328695 -0.04209436 -0.03220841  0.02034248
  0.0042156  -0.01300791 -0.1026612  -0.04565621]

Perfect! We now have 12 embeddings, one for each paper abstract. Each embedding is a 384-dimensional vector, represented as a NumPy array of floating-point numbers.

These numbers might look random at first, but they encode meaningful information about the semantic content of each abstract. When we want to find similar documents, we measure the cosine similarity between their embedding vectors. Cosine similarity looks at the angle between vectors. Vectors pointing in similar directions (representing similar meanings) have high cosine similarity, even if their magnitudes differ. In a later tutorial, we'll compute vector similarity using cosine, Euclidean, and dot-product methods to compare different approaches.

Before we move on, let's verify we can retrieve the original text:

# Let's look at one paper and its embedding
print("Paper title:", papers[0]['title'])
print("\nAbstract:", papers[0]['abstract'][:100] + "...")
print("\nEmbedding shape:", embeddings[0].shape)
print("Embedding type:", type(embeddings[0]))
Paper title: Building Your First Neural Network with PyTorch

Abstract: Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the ...

Embedding shape: (384,)
Embedding type: <class 'numpy.ndarray'>

Great! We can still access the original paper text alongside its embedding. Throughout this tutorial, we'll work with these embeddings while keeping track of which paper each one represents.

Making Sense of High-Dimensional Spaces

We now have 12 vectors, each with 384 dimensions. But here's the issue: humans can't visualize 384-dimensional space. We struggle to imagine even four dimensions! To understand what our embeddings have learned, we need to reduce them to two dimensions so that we can plot them on a graph.

This is where dimensionality reduction is a good skill to have. We'll use Principal Component Analysis (PCA), a technique we can use to find the two most important dimensions (the ones that capture the most variation in our data). It's like taking a 3D object and finding the best angle to photograph it in 2D while preserving as much information as possible.

While we're definitely going to lose some detail during this compression, our original 384-dimensional embeddings capture rich, nuanced information about semantic meaning. When we squeeze them down to 2D, some subtleties are bound to get lost. But the major patterns (which papers belong to which topic) will still be clearly visible.

from sklearn.decomposition import PCA

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

print(f"Original embedding dimensions: {embeddings.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
print(f"\nVariance explained by these 2 dimensions: {pca.explained_variance_ratio_.sum():.2%}")
Original embedding dimensions: 384
Reduced embedding dimensions: 2

Variance explained by these 2 dimensions: 41.20%

The variance explained tells us how much of the variation in the original data is preserved in these 2 dimensions. Think of it this way: if all our papers were identical, they'd have zero variance. The more different they are, the more variance. We've kept about 41% of that variation, which is plenty to see the major patterns. The clustering itself depends on whether papers use similar vocabulary, not on how much variance we've retained. So even though 41% might seem relatively low, the major patterns separating different topics will still be very clear in our embedding visualization.

Understanding Our Tutorial Topics

Before we create our embeddings visualization, let's see how the 12 papers are organized by topic. This will help us understand the patterns we're about to see in the embeddings:

# Print papers grouped by topic
print("=" * 80)
print("PAPER REFERENCE GUIDE")
print("=" * 80)

topics = [
    ("Machine Learning", list(range(0, 4))),
    ("Data Engineering/ETL", list(range(4, 8))),
    ("Data Visualization", list(range(8, 12)))
]

for topic_name, indices in topics:
    print(f"\n{topic_name}:")
    print("-" * 80)
    for idx in indices:
        print(f"  Paper {idx+1}: {papers[idx]['title']}")
================================================================================
PAPER REFERENCE GUIDE
================================================================================

Machine Learning:
--------------------------------------------------------------------------------
  Paper 1: Building Your First Neural Network with PyTorch
  Paper 2: Preventing Overfitting: Regularization Techniques Explained
  Paper 3: Hyperparameter Tuning with Grid Search and Random Search
  Paper 4: Transfer Learning: Using Pre-trained Models for Image Classification

Data Engineering/ETL:
--------------------------------------------------------------------------------
  Paper 5: Handling Missing Data: Strategies and Best Practices
  Paper 6: Data Validation Techniques for ETL Pipelines
  Paper 7: Cleaning Messy CSV Files: A Practical Guide
  Paper 8: Building Scalable ETL Workflows with Apache Airflow

Data Visualization:
--------------------------------------------------------------------------------
  Paper 9: Creating Interactive Dashboards with Plotly Dash
  Paper 10: Matplotlib Best Practices: Making Publication-Quality Plots
  Paper 11: Data Storytelling: Designing Effective Visualizations
  Paper 12: Building Real-Time Visualization Streams with Bokeh

Now that we know which tutorials belong to which topic, let's visualize their embeddings.

Visualizing Embeddings to Reveal Relationships

We're going to create a scatter plot where each point represents one paper abstract. We'll color-code them by topic so we can see how the embeddings naturally group similar content together.

import matplotlib.pyplot as plt
import numpy as np

# Create the visualization
plt.figure(figsize=(8, 6))

# Define colors for different topics
colors = ['#0066CC', '#CC0099', '#FF6600']
categories = ['Machine Learning', 'Data Engineering/ETL', 'Data Visualization']

# Create color mapping for each paper
color_map = []
for i in range(12):
    if i < 4:
        color_map.append(colors[0])  # Machine Learning
    elif i < 8:
        color_map.append(colors[1])  # Data Engineering
    else:
        color_map.append(colors[2])  # Data Visualization

# Plot each paper
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=color_map[i], s=275, alpha=0.7, edgecolors='black', linewidth=1)
    # Add paper numbers as labels
    plt.annotate(str(i+1), (x, y), fontsize=10, fontweight='bold',
                ha='center', va='center')

plt.xlabel('First Principal Component', fontsize=14)
plt.ylabel('Second Principal Component', fontsize=14)
plt.title('Paper Embeddings from Three Data Science Topics\n(Papers close together have similar semantic meaning)',
          fontsize=15, fontweight='bold', pad=20)

# Add a legend showing which colors represent which topics
legend_elements = [plt.Line2D([0], [0], marker='o', color='w',
                              markerfacecolor=colors[i], markersize=12,
                              label=categories[i]) for i in range(len(categories))]
plt.legend(handles=legend_elements, loc='best', fontsize=12)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What the Visualization Reveals About Semantic Similarity

Take a look at the visualization below that was generated using the code above. As you can see, the results are pretty striking! The embeddings have naturally organized themselves into three distinct regions based purely on semantic content.

Keep in mind that we deliberately chose papers from very distinct topics to make the clustering crystal clear. This is perfect for learning, but real-world datasets are messier. When you're working with papers that bridge multiple topics or have overlapping vocabulary, you'll see more gradual transitions between clusters rather than these sharp separations. We'll encounter that reality in the next tutorial when we work with hundreds of real arXiv papers.

Paper Embeddings from Data Science Topics

  • The Machine Learning cluster (blue, papers 1-4) dominates the lower-left side of the plot. These four points sit close together because they all discuss neural networks, training, and model optimization. Look at papers 1 and 4. They're positioned very near each other despite one focusing on building networks from scratch and the other on transfer learning. The embedding model recognizes that they both use the core language of deep learning: layers, weights, training, and model architectures.
  • The Data Engineering/ETL cluster (magenta, papers 5-8) occupies the upper portion of the plot. These papers share vocabulary around data quality, pipelines, and validation. Notice how papers 5, 6, and 7 form a tight grouping. They all discuss data quality issues using terms like "missing values," "validation," and "cleaning." Paper 8 (about Airflow) sits slightly apart from the others, which makes sense: while it's definitely about data engineering, it focuses more on workflow orchestration than data quality, giving it a slightly different semantic fingerprint.
  • The Data Visualization cluster (orange, papers 9-12) is gathered on the lower-right side. These four papers are packed close together because they all use visualization-specific vocabulary: "charts," "dashboards," "colors," and "interactive elements." The tight clustering here shows just how distinct visualization terminology is from both ML and data engineering language.

What's remarkable is the clear separation between all three clusters. The distance between the ML papers on the left and the visualization papers on the right tells us that these topics use fundamentally different vocabularies. There's minimal semantic overlap between "neural networks" and "dashboards," so they end up far apart in the embedding space.

How the Model Learned to Understand Meaning

The all-MiniLM-L6-v2 embedding model was trained on millions of text pairs, learning which words tend to appear together. When it sees a tutorial full of words like "layers," "training," and "optimization," it produces an embedding vector that's mathematically similar to other texts with that same vocabulary pattern. The clustering emerges naturally from those learned associations.

Why This Matters for Your Work as an AI Engineer

Embeddings are foundational to the modern AI systems you'll build as an AI Engineer. Let's look at how embeddings enable the core technologies you'll work with:

  1. Building Intelligent Search Systems

    Traditional keyword search has a fundamental limitation: it can only find exact matches. If a user searches for "handling null values," they won't find documents about "missing data strategies" or "imputation techniques," even though these are exactly what they need. Embeddings solve this by understanding semantic similarity. When you embed both the search query and your documents, you can find relevant content based on meaning rather than word matching. The result is a search system that actually understands what you're looking for.

  2. Working with Vector Databases

    Vector databases are specialized databases that are built to store and query embeddings efficiently. Instead of SQL queries that match exact values, vector databases let you ask "find me all documents similar to this one" and get results ranked by semantic similarity. They're optimized for the mathematical operations that embeddings require, like calculating distances between high-dimensional vectors, which makes them essential infrastructure for AI applications. Modern systems often use hybrid search approaches that combine semantic similarity with traditional keyword matching to get the best of both worlds.

  3. Implementing Retrieval-Augmented Generation (RAG)

    RAG systems are one of the most powerful patterns in modern AI engineering. Here's how they work: you embed a large collection of documents (like company documentation, research papers, or knowledge bases). When a user asks a question, you embed their question and use that embedding to find the most relevant documents from your collection. Then you pass those documents to a language model, which generates an informed answer grounded in your specific data. Embeddings make the retrieval step possible because they're how the system knows which documents are relevant to the question.

  4. Creating AI Agents with Long-Term Memory

    AI agents that can remember past interactions and learn from experience need a way to store and retrieve relevant memories. Embeddings enable this. When an agent has a conversation or completes a task, you can embed the key information and store it in a vector database. Later, when the agent encounters a similar situation, it can retrieve relevant past experiences by finding embeddings close to the current context. This gives agents the ability to learn from history and make better decisions over time. In practice, long-term agent memory often uses similarity thresholds and time-weighted retrieval to prevent irrelevant or outdated information from being recalled.

These four applications (search, vector databases, RAG, and AI agents) are foundational tools for any aspiring AI Engineer's toolkit. Each builds on embeddings as a core technology. Understanding how embeddings capture semantic meaning is the first step toward building production-ready AI systems.

Advanced Topics to Explore

As you continue learning about embeddings, you'll encounter several advanced techniques that are widely used in production systems:

  • Multimodal Embeddings allow you to embed different types of content (text, images, audio) into the same embedding space. This enables powerful cross-modal search capabilities, like finding images based on text descriptions or vice versa. Models like CLIP demonstrate how effective this approach can be.
  • Instruction-Tuned Embeddings are models fine-tuned to better understand specific types of queries or instructions. These specialized models often outperform general-purpose embeddings for domain-specific tasks like legal document search or medical literature retrieval.
  • Quantization reduces the precision of embedding values (from 32-bit floats to 8-bit integers, for example), which can dramatically reduce storage requirements and speed up similarity calculations with minimal impact on search quality. This becomes crucial when working with millions of embeddings.
  • Dimension Truncation takes advantage of the fact that the most important information in embeddings is often concentrated in the first dimensions. By keeping only the first 256 dimensions of a 768-dimensional embedding, you can achieve significant efficiency gains while preserving most of the semantic information.

These techniques become increasingly important as you scale from prototype to production systems handling real-world data volumes.

Building Toward Production Systems

You've now learned the following core foundational embedding concepts:

  • Embeddings convert text into numerical vectors that capture meaning
  • Similar content produces similar vectors
  • These relationships can be visualized to understand how the model organizes information

But we've only worked with 12 handwritten paper abstracts. This is perfect for getting the core concept, but real applications need to handle hundreds or thousands of documents.

In the next tutorial, we'll scale up dramatically. You'll learn how to collect documents programmatically using APIs, generate embeddings at scale, and make strategic decisions about different embedding approaches.

You'll also face the practical challenges that come with real data: rate limits on APIs, processing time for large datasets, the tradeoff between embedding quality and speed, and how to handle edge cases like empty documents or very long texts. These considerations separate a learning exercise from a production system.

By the end of the next tutorial, you'll be equipped to build an embedding system that handles real-world data at scale. That foundation will prepare you for our final embeddings tutorial, where we'll implement similarity search and build a complete semantic search engine.

Next Steps

For now, experiment with the code above:

  • Try replacing one of the paper abstracts with content from your own learning.
    • Where does it appear on the visualization?
    • Does it cluster with one of our three topics, or does it land somewhere in between?
  • Add a paper abstract that bridges two topics, like "Using Machine Learning to Optimize ETL Pipelines."
    • Does it position itself between the ML and data engineering clusters?
    • What does this tell you about how embeddings handle multi-topic content?
  • Try changing the embedding model to see how it affects the visualization.
    • Models like all-mpnet-base-v2 produce different dimensional embeddings.
    • Do the clusters become tighter or more spread out?
  • Experiment with adding a completely unrelated abstract, like a cooking recipe or news article.
    • Where does it land relative to our three data science clusters?
    • How far away is it from the nearest cluster?

This hands-on exploration and experimentation will deepen your intuition about how embeddings work.

Ready to scale things up? In the next tutorial, we'll work with real arXiv data and build an embedding system that can handle thousands of papers. See you there!


Key Takeaways:

  • Embeddings convert text into numerical vectors that capture semantic meaning
  • Similar meanings produce similar vectors, enabling mathematical comparison of concepts
  • Papers from different topics cluster separately because they use distinct vocabulary
  • Dimensionality reduction (like PCA) helps visualize high-dimensional embeddings in 2D
  • Embeddings power modern AI systems, including semantic search, vector databases, RAG, and AI agents

20 Fun (and Unique) Data Analyst Projects for Beginners

26 October 2025 at 21:23

You're here because you're serious about becoming a data analyst. You’ve probably noticed that just about every data analytics job posting asks for experience. But how do individuals get experience if they’re just starting out?! The answer: you do it by building a solid portfolio of data analytic projects so that you can land a job as a junior data analyst, even with no experience.

Data Analyst with a magnifying glass examining large chart graphics in the background.

Your portfolio is your ticket to proving your capabilities to a potential employer. Even without previous job experience, a well-curated collection of data analytics projects can set you apart from the competition. They demonstrate your ability to tackle real-world problems with real data, showcasing your ability to clean datasets, create compelling visualizations, and extract meaningful insights—skills that are in high demand.

You just have to pick the ones that speak to you and get started!

Getting started with data analytics projects

So, you're ready to tackle your first data analytics project? Awesome! Let's break down what you need to know to set yourself up for success.

Our curated list of 20 projects below will help you develop the most sought-after data analysis skills and practice using the most frequently used data analysis tools. Namely:

Setting up an effective development environment is also vital. Begin by creating a Python environment with Conda or venv. Use version control like Git to track project changes. Combine an IDE like Jupyter Notebook with core Python libraries to boost your productivity.

Remember, Rome wasn't built in a day! Start your data analysis journey with bite-sized projects to steadily build your skills. Keep learning, stay curious, and enjoy the ride. Before you know it, you'll be tackling real-world data challenges like the professionals do.

20 Data Analyst Projects for Beginners

Each project listed below will help you apply what you've learned to real data, growing your abilities one step at a time. While they are tailored towards beginners, some will be more challenging than others. By working through them, you'll create a portfolio that shows a potential employer you have the practical skills to analyze data on the job.

The data analytics projects below cover a range of analysis techniques, applications, and tools:

  1. Learn and Install Jupyter Notebook
  2. Profitable App Profiles for the App Store and Google Play Markets
  3. Exploring Hacker News Posts
  4. Clean and Analyze Employee Exit Surveys
  5. Star Wars Survey
  6. Word Raider
  7. Install RStudio
  8. Creating An Efficient Data Analysis Workflow
  9. Creating An Efficient Data Analysis Workflow, Part 2
  10. Preparing Data with Excel
  11. Visualizing the Answer to Stock Questions Using Spreadsheet Charts
  12. Identifying Customers Likely to Churn for a Telecommunications Provider
  13. Data Prep in Tableau
  14. Business Intelligence Plots
  15. Data Presentation
  16. Modeling Data in Power BI
  17. Visualization of Life Expectancy and GDP Variation Over Time
  18. Building a BI App
  19. Analyzing Kickstarter Projects
  20. Analyzing Startup Fundraising Deals from Crunchbase

In the following sections, you'll find step-by-step guides to walk you through each project. These detailed instructions will help you apply what you've learned and solidify your data analytics skills.

1. Learn and Install Jupyter Notebook

Overview

In this beginner-level project, you'll assume the role of a Jupyter Notebook novice aiming to gain the essential skills for real-world data analytics projects. You'll practice running code cells, documenting your work with Markdown, navigating Jupyter using keyboard shortcuts, mitigating hidden state issues, and installing Jupyter locally. By the end of the project, you'll be equipped to use Jupyter Notebook to work on data analytics projects and share compelling, reproducible notebooks with others.

Tools and Technologies

  • Jupyter Notebook
  • Python

Prerequisites

Before you take on this project, it's recommended that you have some foundational Python skills in place first, such as:

Step-by-Step Instructions

  1. Get acquainted with the Jupyter Notebook interface and its components
  2. Practice running code cells and learn how execution order affects results
  3. Use keyboard shortcuts to efficiently navigate and edit notebooks
  4. Create Markdown cells to document your code and communicate your findings
  5. Install Jupyter locally to work on projects on your own machine

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

  • Familiarity with the core components and workflow of Jupyter Notebook
  • Ability to use Jupyter Notebook to run code, perform analysis, and share results
  • Understanding of how to structure and document notebooks for real-world reproducibility
  • Proficiency in navigating Jupyter Notebook using keyboard shortcuts to boost productivity
  • Readiness to apply Jupyter Notebook skills to real-world data projects and collaborate with others

Relevant Links and Resources

Additional Resources

2. Profitable App Profiles for the App Store and Google Play Markets

Overview

In this guided project, you'll assume the role of a data analyst for a company that builds ad-supported mobile apps. By analyzing historical data from the Apple App Store and Google Play Store, you'll identify app profiles that attract the most users and generate the most revenue. Using Python and Jupyter Notebook, you'll clean the data, analyze it using frequency tables and averages, and make practical recommendations on the app categories and characteristics the company should target to maximize profitability.

Tools and Technologies

  • Python
  • Data Analytics
  • Jupyter Notebook

Prerequisites

This is a beginner-level project, but you should be comfortable working with Python functions and Jupyter Notebook:

  • Writing functions with arguments, return statements, and control flow
  • Debugging functions to ensure proper execution
  • Using conditional logic and loops within functions
  • Working with Jupyter Notebook to write and run code

Step-by-Step Instructions

  1. Open and explore the App Store and Google Play datasets
  2. Clean the datasets by removing non-English apps and duplicate entries
  3. Isolate the free apps for further analysis
  4. Determine the most common app genres and their characteristics using frequency tables
  5. Make recommendations on the ideal app profiles to maximize users and revenue

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Cleaning real-world data to prepare it for analysis
  • Analyzing app market data to identify trends and success factors
  • Applying data analysis techniques like frequency tables and calculating averages
  • Using data insights to inform business strategy and decision-making
  • Communicating your findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

3. Exploring Hacker News Posts

Overview

In this project, you'll explore and analyze a dataset from Hacker News, a popular tech-focused community site. Using Python, you'll apply skills in string manipulation, object-oriented programming, and date management to uncover trends in user submissions and identify factors that drive community engagement. This hands-on project will strengthen your ability to interpret real-world datasets and enhance your data analysis skills.

Tools and Technologies

  • Python
  • Data cleaning
  • Object-oriented programming
  • Data Analytics
  • Jupyter Notebook

Prerequisites

To get the most out of this project, you should have some foundational Python and data cleaning skills, such as:

  • Employing loops in Python to explore CSV data
  • Utilizing string methods in Python to clean data for analysis
  • Processing dates from strings using the datetime library
  • Formatting dates and times for analysis using strftime

Step-by-Step Instructions

  1. Remove headers from a list of lists
  2. Extract 'Ask HN' and 'Show HN' posts
  3. Calculate the average number of comments for 'Ask HN' and 'Show HN' posts
  4. Find the number of 'Ask HN' posts and average comments by hour created
  5. Sort and print values from a list of lists

Expected Outcomes

After completing this project, you'll have gained practical experience and skills, including:

  • Applying Python string manipulation, OOP, and date handling to real-world data
  • Analyzing trends and patterns in user submissions on Hacker News
  • Identifying factors that contribute to post popularity and engagement
  • Communicating insights derived from data analysis

Relevant Links and Resources

Additional Resources

4. Clean and Analyze Employee Exit Surveys

Overview

In this hands-on project, you'll play the role of a data analyst for the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Your task is to clean and analyze employee exit surveys from both institutes to identify insights into why employees resign. Using Python and pandas, you'll combine messy data from multiple sources, clean column names and values, analyze the data, and share your key findings.

Tools and Technologies

  • Python
  • Pandas
  • Data cleaning
  • Data Analytics
  • Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with:

  • Exploring and analyzing data using pandas
  • Aggregating data with pandas groupby operations
  • Combining datasets using pandas concat and merge functions
  • Manipulating strings and handling missing data in pandas

Step-by-Step Instructions

  1. Load and explore the DETE and TAFE exit survey data
  2. Identify missing values and drop unnecessary columns
  3. Clean and standardize column names across both datasets
  4. Filter the data to only include resignation reasons
  5. Verify data quality and create new columns for analysis
  6. Combine the cleaned datasets into one for further analysis
  7. Analyze the cleaned data to identify trends and insights

Expected Outcomes

By completing this project, you will:

  • Clean real-world, messy HR data to prepare it for analysis
  • Apply core data cleaning techniques in Python and pandas
  • Combine multiple datasets and conduct exploratory analysis
  • Analyze employee exit surveys to understand key drivers of resignations
  • Summarize your findings and share data-driven recommendations

Relevant Links and Resources

Additional Resources

5. Star Wars Survey

Overview

In this project designed for beginners, you'll become a data analyst exploring FiveThirtyEight's Star Wars survey data. Using Python and pandas, you'll clean messy data, map values, compute statistics, and analyze the data to uncover fan film preferences. By comparing results between demographic segments, you'll gain insights into how Star Wars fans differ in their opinions. This project provides hands-on practice with key data cleaning and analysis techniques essential for data analyst roles across industries.

Tools and Technologies

  • Python
  • Pandas
  • Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with the following:

Step-by-Step Instructions

  1. Map Yes/No columns to Boolean values to standardize the data
  2. Convert checkbox columns to lists and get them into a consistent format
  3. Clean and rename the ranking columns to make them easier to analyze
  4. Identify the highest-ranked and most-viewed Star Wars films
  5. Analyze the data by key demographic segments like gender, age, and location
  6. Summarize your findings on fan preferences and differences between groups

Expected Outcomes

After completing this project, you will have gained:

  • Experience cleaning and analyzing a real-world, messy dataset
  • Hands-on practice with pandas data manipulation techniques
  • Insights into the preferences and opinions of Star Wars fans
  • An understanding of how to analyze survey data for business insights

Relevant Links and Resources

Additional Resources

6. Word Raider

Overview

In this beginner-level Python project, you'll step into the role of a developer to create "Word Raider," an interactive word-guessing game. Although this project won't have you perform any explicit data analysis, it will sharpen your Python skills and make you a better data analyst. Using fundamental programming skills, you'll apply concepts like loops, conditionals, and file handling to build the game logic from the ground up. This hands-on project allows you to consolidate your Python knowledge by integrating key techniques into a fun application.

Tools and Technologies

  • Python
  • Jupyter Notebook

Prerequisites

Before diving into this project, you should have some foundational Python skills, including:

Step-by-Step Instructions

  1. Build the word bank by reading words from a text file into a Python list
  2. Set up variables to track the game state, like the hidden word and remaining attempts
  3. Implement functions to receive and validate user input for their guesses
  4. Create the game loop, checking guesses against the hidden word and providing feedback
  5. Update the game state after each guess and check for a win or loss condition

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Strengthened proficiency in fundamental Python programming concepts
  • Experience building an interactive, text-based game from scratch
  • Practice with file I/O, data structures, and basic object-oriented design
  • Improved problem-solving skills and ability to translate ideas into code

Relevant Links and Resources

Additional Resources

7. Install RStudio

Overview

In this beginner-level project, you'll take the first steps in your data analysis journey by installing R and RStudio. As an aspiring data analyst, you'll set up a professional programming environment and explore RStudio's features for efficient R coding and analysis. Through guided exercises, you'll write scripts, import data, and create visualizations, building key foundations for your career.

Tools and Technologies

  • R
  • RStudio

Prerequisites

To complete this project, it's recommended to have basic knowledge of:

  • R syntax and programming fundamentals
  • Variables, data types, and arithmetic operations in R
  • Logical and relational operators in R expressions
  • Importing, exploring, and visualizing datasets in R

Step-by-Step Instructions

  1. Install the latest version of R and RStudio on your computer
  2. Practice writing and executing R code in the Console
  3. Import a dataset into RStudio and examine its contents
  4. Write and save R scripts to organize your code
  5. Generate basic data visualizations using ggplot2

Expected Outcomes

By completing this project, you'll gain essential skills including:

  • Setting up an R development environment with RStudio
  • Navigating RStudio's interface for data science workflows
  • Writing and running R code in scripts and the Console
  • Installing and loading R packages for analysis and visualization
  • Importing, exploring, and visualizing data in RStudio

Relevant Links and Resources

Additional Resources

8. Creating An Efficient Data Analysis Workflow

Overview

In this hands-on project, you'll step into the role of a data analyst hired by a company selling programming books. Your mission is to analyze their sales data to determine which titles are most profitable. You'll apply key R programming concepts like control flow, loops, and functions to develop an efficient data analysis workflow. This project provides valuable practice in data cleaning, transformation, and analysis, culminating in a structured report of your findings and recommendations.

Tools and Technologies

  • R
  • RStudio
  • Data Analytics

Prerequisites

To successfully complete this project, you should have the following foundational control flow, iteration, and functions in R skills:

  • Implementing control flow using if-else statements
  • Employing for loops and while loops for iteration
  • Writing custom functions to modularize code
  • Combining control flow, loops, and functions in R

Step-by-Step Instructions

  1. Get acquainted with the provided book sales dataset
  2. Transform and prepare the data for analysis
  3. Analyze the cleaned data to identify top performing titles
  4. Summarize your findings in a structured report
  5. Provide data-driven recommendations to stakeholders

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Applying R programming concepts to real-world data analysis
  • Developing an efficient, reproducible data analysis workflow
  • Cleaning and preparing messy data for analysis
  • Analyzing sales data to derive actionable business insights
  • Communicating findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

9. Creating An Efficient Data Analysis Workflow, Part 2

Overview

In this hands-on project, you'll step into the role of a data analyst at a book company tasked with evaluating the impact of a new program launched on July 1, 2019 to encourage customers to buy more books. Using your data analysis skills in R, you'll clean and process the company's 2019 sales data to determine if the program successfully boosted book purchases and improved review quality. This project allows you to apply key R packages like dplyr, stringr, and lubridate to efficiently analyze a real-world business dataset and deliver actionable insights.

Tools and Technologies

  • R
  • RStudio
  • dplyr
  • stringr
  • lubridate

Prerequisites

To successfully complete this project, you should have some specialized data processing in R skills:

  • Manipulating strings using stringr functions
  • Working with dates and times using lubridate
  • Applying the map function to vectorize custom functions
  • Understanding and employing regular expressions for pattern matching

Step-by-Step Instructions

  1. Load and explore the book company's 2019 sales data
  2. Clean the data by handling missing values and inconsistencies
  3. Process the text reviews to determine positive/negative sentiment
  4. Compare key sales metrics like purchases and revenue before and after the July 1 program launch date
  5. Analyze differences in sales between customer segments

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Cleaning and preparing a real-world business dataset for analysis
  • Applying powerful R packages to manipulate and process data efficiently
  • Analyzing sales data to quantify the impact of a new initiative
  • Translating analysis findings into meaningful business insights

Relevant Links and Resources

Additional Resources

10. Preparing Data with Excel

Overview

In this hands-on project for beginners, you'll step into the role of a data professional in a marine biology research organization. Your mission is to prepare a raw dataset on shark attacks for an analysis team to study trends in attack locations and frequency over time. Using Excel, you'll import the data, organize it in worksheets and tables, handle missing values, and clean the data by removing duplicates and fixing inconsistencies. This project provides practical experience in the essential data preparation skills required for real-world analysis projects.

Tools and Technologies

  • Excel

Prerequisites

This project is designed for beginners. To complete it, you should be familiar with preparing data in Excel:

  • Importing data into Excel from various sources
  • Organizing spreadsheet data using worksheets and tables
  • Cleaning data by removing duplicates, fixing inconsistencies, and handling missing values
  • Consolidating data from multiple sources into a single table

Step-by-Step Instructions

  1. Import the raw shark attack data into an Excel workbook
  2. Organize the data into worksheets and tables with a logical structure
  3. Clean the data by removing duplicate entries and fixing inconsistencies
  4. Consolidate shark attack data from multiple sources into a single table

Expected Outcomes

By completing this project, you will gain:

  • Hands-on experience in data preparation and cleaning techniques using Excel
  • Foundational skills for importing, organizing, and cleaning data for analysis
  • An understanding of how to handle missing values and inconsistencies in a dataset
  • Ability to consolidate data from disparate sources into an analysis-ready format
  • Practical experience working with a real-world dataset on shark attacks
  • A solid foundation for data analysis projects and further learning in Excel

Relevant Links and Resources

Additional Resources

11. Visualizing the Answer to Stock Questions Using Spreadsheet Charts

Overview

In this hands-on project, you'll step into the shoes of a business analyst to explore historical stock market data using Excel. By applying information design concepts, you'll create compelling visualizations and craft an insightful report – building valuable skills for communicating data-driven insights that are highly sought-after by employers across industries.

Tools and Technologies

  • Excel
  • Data visualization
  • Information design principles

Prerequisites

To successfully complete this project, it's recommended to have foundational visualizing data in Excel skills, such as:

  • Creating various chart types in Excel to visualize data
  • Selecting appropriate chart types to effectively present data
  • Applying design principles to create clear and informative charts
  • Designing charts for an audience using Gestalt principles

Step-by-Step Instructions

  1. Import the dataset to an Excel spreadsheet
  2. Create a report using data visualizations and tabular data
  3. Represent the data using effective data visualizations
  4. Apply Gestalt principles and pre-attentive attributes to all visualizations
  5. Maximize data-ink ratio in all visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Analyzing real-world stock market data in Excel
  • Applying information design principles to create effective visualizations
  • Selecting the best chart types to answer specific questions about the data
  • Combining multiple charts into a cohesive, insightful report
  • Developing in-demand data visualization and communication skills

Relevant Links and Resources

Additional Resources

12. Identifying Customers Likely to Churn for a Telecommunications Provider

Overview

In this beginner project, you'll take on the role of a data analyst at a telecommunications company. Your challenge is to explore customer data in Excel to identify profiles of those likely to churn. Retaining customers is crucial for telecom providers, so your insights will help inform proactive retention efforts. You'll conduct exploratory data analysis, calculating key statistics, building PivotTables to slice the data, and creating charts to visualize your findings. This project provides hands-on experience with core Excel skills for data-driven business decisions that will enhance your analyst portfolio.

Tools and Technologies

  • Excel

Prerequisites

To complete this project, you should feel comfortable exploring data in Excel:

  • Calculating descriptive statistics in Excel
  • Analyzing data with descriptive statistics
  • Creating PivotTables in Excel to explore and analyze data
  • Visualizing data with histograms and boxplots in Excel

Step-by-Step Instructions

  1. Import the customer dataset into Excel
  2. Calculate descriptive statistics for key metrics
  3. Create PivotTables, histograms, and boxplots to explore data differences
  4. Analyze and identify profiles of likely churners
  5. Compile a report with your data visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Hands-on practice analyzing a real-world customer dataset in Excel
  • Ability to calculate and interpret key statistics to profile churn risks
  • Experience building PivotTables and charts to slice data and uncover insights
  • Skill in translating analysis findings into an actionable report for stakeholders

Relevant Links and Resources

Additional Resources

13. Data Prep in Tableau

Overview

In this hands-on project, you'll take on the role of a data analyst for Dataquest to prepare their online learning platform data for analysis. You'll connect to Excel data, import tables into Tableau, and define table relationships to build a data model for uncovering insights on student engagement and performance. This project focuses on essential data preparation steps in Tableau, providing you with a robust foundation for data visualization and analysis.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should have some foundational skills in preparing data in Tableau, such as:

  • Connecting to data sources in Tableau to access the required data
  • Importing data tables into the Tableau canvas
  • Defining relationships between tables in Tableau to combine data
  • Cleaning and filtering imported data in Tableau to prepare it for use

Step-by-Step Instructions

  1. Connect to the provided Excel file containing key tables on student engagement, course performance, and content completion rates
  2. Import the tables into Tableau and define the relationships between tables to create a unified data model
  3. Clean and filter the imported data to handle missing values, inconsistencies, or irrelevant information
  4. Save the prepared data source to use for creating visualizations and dashboards
  5. Reflect on the importance of proper data preparation for effective analysis

Expected Outcomes

By completing this project, you will gain valuable skills and experience, including:

  • Hands-on practice with essential data preparation techniques in Tableau
  • Ability to connect to, import, and combine data from multiple tables
  • Understanding of how to clean and structure data for analysis
  • Readiness to progress to creating visualizations and dashboards to uncover insights

Relevant Links and Resources

Additional Resources

14. Business Intelligence Plots

Overview

In this hands-on project, you'll step into the role of a data visualization consultant for Adventure Works. The company's leadership team wants to understand the differences between their online and offline sales channels. You'll apply your Tableau skills to build insightful, interactive data visualizations that provide clear comparisons and enable data-driven business decisions. Key techniques include creating calculated fields, applying filters, utilizing dual-axis charts, and embedding visualizations in tooltips. By the end, you'll have a set of powerful Tableau dashboards ready to share with stakeholders.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should have a solid grasp of data visualization fundamentals in Tableau:

  • Navigating the Tableau interface and distinguishing between dimensions and measures
  • Constructing various foundational chart types in Tableau
  • Developing and interpreting calculated fields to enhance analysis
  • Employing filters to improve visualization interactivity

Step-by-Step Instructions

  1. Compare online vs offline orders using visualizations
  2. Analyze products across channels with scatter plots
  3. Embed visualizations in tooltips for added insight
  4. Summarize findings and identify next steps

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

  • Practical experience building interactive business intelligence dashboards in Tableau
  • Ability to create calculated fields to conduct tailored analysis
  • Understanding of how to use filters and tooltips to enable data exploration
  • Skill in developing visualizations that surface actionable insights for stakeholders

Relevant Links and Resources

Additional Resources

15. Data Presentation

Overview

In this project, you'll step into the role of a data analyst exploring conversion funnel trends for a company's leadership team. Using Tableau, you'll build interactive dashboards that uncover insights about which marketing channels, locations, and customer personas drive the most value in terms of volume and conversion rates. By applying data visualization best practices and incorporating dashboard actions and filters, you'll create a professional, usable dashboard ready to present your findings to stakeholders.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should be comfortable sharing insights in Tableau, such as:

  • Building basic charts like bar charts and line graphs in Tableau
  • Employing color, size, trend lines and forecasting to emphasize insights
  • Combining charts, tables, text and images into dashboards
  • Creating interactive dashboards with filters and quick actions

Step-by-Step Instructions

  1. Import and clean the conversion funnel data in Tableau
  2. Build basic charts to visualize key metrics
  3. Create interactive dashboards with filters and actions
  4. Add annotations and highlights to emphasize key insights
  5. Compile a professional dashboard to present findings

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

  • Analyzing conversion funnel data to surface actionable insights
  • Visualizing trends and comparisons using Tableau charts and graphs
  • Applying data visualization best practices to create impactful dashboards
  • Adding interactivity to enable exploration of the data
  • Communicating data-driven findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

16. Modeling Data in Power BI

Overview

In this hands-on project, you'll step into the role of an analyst at a company that sells scale model cars. Your mission is to model and analyze data from their sales records database using Power BI to extract insights that drive business decision-making. Power BI is a powerful business analytics tool that enables you to connect to, model, and visualize data. By applying data cleaning, transformation, and modeling techniques in Power BI, you'll prepare the sales data for analysis and develop practical skills in working with real-world datasets. This project provides valuable experience in extracting meaningful insights from raw data to inform business strategy.

Tools and Technologies

  • Power BI

Prerequisites

To successfully complete this project, you should know how to model data in Power BI, such as:

  • Designing a basic data model in Power BI
  • Configuring table and column properties in Power BI
  • Creating calculated columns and measures using DAX in Power BI
  • Reviewing the performance of measures, relationships, and visuals in Power BI

Step-by-Step Instructions

  1. Import the sales data into Power BI
  2. Clean and transform the data for analysis
  3. Design a basic data model in Power BI
  4. Create calculated columns and measures using DAX
  5. Build visualizations to extract insights from the data

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

  • Hands-on practice modeling and analyzing real-world sales data in Power BI
  • Ability to clean, transform and prepare data for analysis
  • Experience extracting meaningful business insights from raw data
  • Developing practical skills in data modeling and analysis using Power BI

Relevant Links and Resources

Additional Resources

17. Visualization of Life Expectancy and GDP Variation Over Time

Overview

In this project, you'll step into the role of a data analyst tasked with visualizing life expectancy and GDP data over time to uncover trends and regional differences. Using Power BI, you'll apply data cleaning, transformation, and visualization skills to create interactive scatter plots and stacked column charts that reveal insights from the Gapminder dataset. This hands-on project allows you to practice the full life-cycle of report and dashboard development in Power BI. You'll load and clean data, create and configure visualizations, and publish your work to showcase your skills. By the end, you'll have an engaging, interactive dashboard to add to your portfolio.

Tools and Technologies

  • Power BI

Prerequisites

To complete this project, you should be able to visualize data in Power BI, such as:

  • Creating basic Power BI visuals
  • Designing accessible report layouts
  • Customizing report themes and visual markers
  • Publishing Power BI reports and dashboards

Step-by-Step Instructions

  1. Import the life expectancy and GDP data into Power BI
  2. Clean and transform the data for analysis
  3. Create interactive scatter plots and stacked column charts
  4. Design an accessible report layout in Power BI
  5. Customize visual markers and themes to enhance insights

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Applying data cleaning, transformation, and visualization techniques in Power BI
  • Creating interactive scatter plots and stacked column charts to uncover data insights
  • Developing an engaging dashboard to showcase your data visualization skills
  • Practicing the full life-cycle of Power BI report and dashboard development

Relevant Links and Resources

Additional Resources

18. Building a BI App

Overview

In this hands-on project, you'll step into the role of a business intelligence analyst at Dataquest, an online learning platform. Using Power BI, you'll import and model data on course completion rates and Net Promoter Scores (NPS) to assess course quality. You'll create insightful visualizations like KPI metrics, line charts, and scatter plots to analyze trends and compare courses. Leveraging this analysis, you'll provide data-driven recommendations on which courses Dataquest should improve.

Tools and Technologies

  • Power BI

Prerequisites

To successfully complete this project, you should have some foundational skills in Power BI, such as how to manage workspaces and datasets in Power BI:

  • Creating and managing workspaces
  • Importing and updating assets within a workspace
  • Developing dynamic reports using parameters
  • Implementing static and dynamic row-level security

Step-by-Step Instructions

  1. Import and explore the course completion and NPS data, looking for data quality issues
  2. Create a data model relating the fact and dimension tables
  3. Write calculations for key metrics like completion rate and NPS, and validate the results
  4. Design and build visualizations to analyze course performance trends and comparisons

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

  • Importing, modeling, and analyzing data in Power BI to drive decisions
  • Creating calculated columns and measures to quantify key metrics
  • Designing and building insightful data visualizations to convey trends and comparisons
  • Developing impactful reports and dashboards to summarize findings
  • Sharing data stories and recommending actions via Power BI apps

Relevant Links and Resources

Additional Resources

19. Analyzing Kickstarter Projects

Overview

In this hands-on project, you'll step into the role of a data analyst to explore and analyze Kickstarter project data using SQL. You'll start by importing and exploring the dataset, followed by cleaning the data to ensure accuracy. Then, you'll write SQL queries to uncover trends and insights within the data, such as success rates by category, funding goals, and more. By the end of this project, you'll be able to use SQL to derive meaningful insights from real-world datasets.

Tools and Technologies

  • SQL

Prerequisites

To successfully complete this project, you should be comfortable working with SQL and databases, such as:

  • Basic SQL commands and querying
  • Data manipulation and joins in SQL
  • Experience with cleaning data and handling missing values

Step-by-Step Instructions

  1. Import and explore the Kickstarter dataset to understand its structure
  2. Clean the data to handle missing values and ensure consistency
  3. Write SQL queries to analyze the data and uncover trends
  4. Visualize the results of your analysis using SQL queries

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

  • Proficiency in using SQL for data analysis
  • Experience with cleaning and analyzing real-world datasets
  • Ability to derive insights from Kickstarter project data

Relevant Links and Resources

Additional Resources

20. Analyzing Startup Fundraising Deals from Crunchbase

Overview

In this beginner-level guided project, you'll step into the role of a data analyst to explore a dataset of startup investments from Crunchbase. By applying your pandas and SQLite skills, you'll work with a large dataset to uncover insights on fundraising trends, successful startups, and active investors. This project focuses on developing techniques for handling memory constraints, selecting optimal data types, and leveraging SQL databases. You'll strengthen your ability to apply the pandas-SQLite workflow to real-world scenarios.

Tools and Technologies

  • Python
  • Pandas
  • SQLite
  • Jupyter Notebook

Prerequisites

Although this is a beginner-level SQL project, you'll need some solid skills in Python and data analysis before taking it on:

Step-by-Step Instructions

  1. Explore the structure and contents of the Crunchbase startup investments dataset
  2. Process the large dataset in chunks and load into an SQLite database
  3. Analyze fundraising rounds data to identify trends and derive insights
  4. Examine the most successful startup verticals based on total funding raised
  5. Identify the most active investors by number of deals and total amount invested

Expected Outcomes

Upon completing this guided project, you'll gain practical skills and experience, including:

  • Applying pandas and SQLite to analyze real-world startup investment data
  • Handling large datasets effectively through chunking and efficient data types
  • Integrating pandas DataFrames with SQL databases for scalable data analysis
  • Deriving actionable insights from fundraising data to understand startup success
  • Building a project for your portfolio showcasing pandas and SQLite skills

Relevant Links and Resources

Additional Resources

Choosing the right data analyst projects

Since the list of data analytics projects on the internet is exhaustive (and can be exhausting!), no one can be expected to build them all. So, how do you pick the right ones for your portfolio, whether they're guided or independent projects? Let's go over the criteria you should use to make this decision.

Passions vs. Interests vs. In-Demand skills

When selecting projects, it’s essential to strike a balance between your passions, interests, and in-demand skills. Here’s how to navigate these three factors:

  • Passions: Choose projects that genuinely excite you and align with your long-term goals. Passions are often areas you are deeply committed to and are willing to invest significant time and effort in. Working on something you are passionate about can keep you motivated and engaged, which is crucial for learning and completing the project.
  • Interests: Pick projects related to fields or a topic that sparks your curiosity or enjoyment. Interests might not have the same level of commitment as passions, but they can still make the learning process more enjoyable and meaningful. For instance, if you're curious about sports analytics or healthcare data, these interests can guide your project choices.
  • In-Demand Skills: Focus on projects that help you develop skills currently sought after in the job market. Research job postings and industry trends to identify which skills are in demand and tailor your projects to develop those competencies.

Steps to picking the right data analytics projects

  1. Assess your current skill level
    • If you’re a beginner, start with projects that focus on data cleaning (an essential skill), exploration, and visualization. Using Python libraries like Pandas and Matplotlib is an efficient way to build these foundational skills.
    • Utilize structured resources that provide both a beginner data analyst learning path and support to guide you through your first projects.
  2. Plan before you code
    • Clearly define your topic, project objectives, and key questions upfront to stay focused and aligned with your goals.
    • Choose appropriate data sources early in the planning process to streamline the rest of the project.
  3. Focus on the fundamentals
    • Clean your data thoroughly to ensure accuracy.
    • Use analytical techniques that align with your objectives.
    • Create clear, impactful visualizations of your findings.
    • Document your process for reproducibility and effective communication.
  4. Start small and scale up
  5. Seek feedback and iterate
    • Share your projects with peers, mentors, or online communities to get feedback.
    • Use this feedback to improve and refine your work.

Remember, it’s okay to start small and gradually take on bigger challenges. Each project you complete, no matter how simple, helps you gain skills and learn valuable lessons. Tackling a series of focused projects is one of the best ways to grow your abilities as a data professional. With each one, you’ll get better at planning, execution, and communication.

Conclusion

If you're serious about landing a data analytics job, project-based learning is key.

There’s a lot of data out there and a lot you can do with it. Trying to figure out where to start can be overwhelming. If you want a more structured approach to reaching your goal, consider enrolling in Dataquest’s Data Analyst in Python career path. It offers exactly what you need to land your first job as a data analyst or to grow your career by adding one of the most popular programming languages, in-demand data skills, and projects to your CV.

But if you’re confident in doing this on your own, the list of projects we’ve shared in this post will definitely help you get there. To continue improving, we encourage you to take on additional projects and share them in the Dataquest Community. This provides valuable peer feedback, helping you refine your projects to become more advanced and join the group of professionals who do this for a living.

Python Projects: 60+ Ideas for Beginners to Advanced (2025)

23 October 2025 at 18:46
Quick Answer: The best Python projects for beginners include building an interactive word game, analyzing your Netflix data, creating a password generator, or making a simple web scraper. These projects teach core Python skills like loops, functions, data manipulation, and APIs while producing something you can actually use. Below, you'll find 60+ project ideas organized by skill level, from beginner to advanced.

Completing Python projects is the ultimate way to learn the language. When you work on real-world projects, you not only retain more of the lessons you learn, but you'll also find it super motivating to push yourself to pick up new skills. Because let's face it, no one actually enjoys sitting in front of a screen learning random syntax for hours on end―particularly if it's not going to be used right away.

Python projects don't have this problem. Anything new you learn will stick because you're immediately putting it into practice. There's just one problem: many Python learners struggle to come up with their own Python project ideas to work on. But that's okay, we can help you with that!

Best Starter Python Projects

Here are a few beginner-friendly Python projects from the list below that are perfect for getting hands-on experience right away:

Choose one that excites you and just go with it! You’ll learn more by building than by reading alone.

Are You Ready for This?

If you have some programming experience, you might be ready to jump straight into building a Python project. However, if you’re just starting out, it’s vital you have a solid foundation in Python before you take on any projects. Otherwise, you run the risk of getting frustrated and giving up before you even get going. For those in need, we recommend taking either:

  1. Introduction to Python Programming course: meant for those looking to become a data professional while learning the fundamentals of programming with Python.
  2. Introduction to Python Programming course: meant for those looking to leverage the power of AI while learning the fundamentals of programming with Python.

In both courses, the goal is to quickly learn the basics of Python so you can start working on a project as soon as possible. You'll learn by doing, not by passively watching videos.

Selecting a Project

Our list below has 60+ fun and rewarding Python projects for learners at all levels. Some are free guided projects that you can complete directly in your browser via the Dataquest platform. Others are more open-ended, serving as inspiration as you build your Python skills. The key is to choose a project that resonates with you and just go for it!

Now, let’s take a look at some Python project examples. There is definitely something to get you started in this list.

Animated GIF of a smiling blue robot interacting with a mobile app interface

Free Python Projects (Recommended):

These free Dataquest guided projects are a great place to start. They provide an embedded code editor directly in your browser, step-by-step instructions to help you complete the project, and community support if you happen to get stuck.

  1. Building an Interactive Word Game — In this guided project, you’ll use basic Python programming concepts to create a functional and interactive word-guessing game.

  2. Profitable App Profiles for the App Store and Google Play Markets — In this one, you’ll work as a data analyst for a company that builds mobile apps. You’ll use Python to analyze real app market data to find app profiles that attract the most users.

  3. Exploring Hacker News Posts — Use Python string manipulation, OOP, and date handling to analyze trends driving post popularity on Hacker News, a popular technology site.

  4. Learn and Install Jupyter Notebook — A guide to using and setting up Jupyter Notebook locally to prepare you for real-world data projects.

  5. Predicting Heart Disease — We're tasked with using a dataset from the World Health Organization to accurately predict a patient’s risk of developing heart disease based on their medical data.

  6. Analyzing Accuracy in Data Presentation — In this project, we'll step into the role of data journalists to analyze movie ratings data and determine if there’s evidence of bias in Fandango’s rating system.

Animated GIF of a laptop displaying a bar chart with a plant in the background

Table of Contents

More Projects to Help Build Your Portfolio:

  1. Finding Heavy Traffic Indicators on I-94 — Explore how using the pandas plotting functionality along with the Jupyter Notebook interface allows us to analyze data quickly using visualizations to determine indicators of heavy traffic.

  2. Storytelling Data Visualization on Exchange Rates — You'll assume the role of a data analyst tasked with creating an explanatory data visualization about Euro exchange rates to inform and engage an audience.

  3. Clean and Analyze Employee Exit Surveys — Work with exit surveys from employees of the Department of Education in Queensland, Australia. Play the role of a data analyst to analyze employee exit surveys and uncover insights about why employees resign.

  4. Star Wars Survey — In this data cleaning project, you’ll work with Jupyter Notebook to analyze data on the Star Wars movies to answer the hotly contested question, "Who shot first?"

  5. Analyzing NYC High School Data — For this project, you’ll assume the role of a data scientist analyzing relationships between SAT scores and demographic factors in NYC public schools to determine if the SAT is a fair test.

  6. Predicting the Weather Using Machine Learning — For this project, you’ll step into the role of a data scientist to predict tomorrow’s weather using historical data and machine learning, developing skills in data preparation, time series analysis, and model evaluation.

  7. Credit Card Customer Segmentation — For this project, we’ll play the role of a data scientist at a credit card company to segment customers into groups using K-means clustering in Python, allowing the company to tailor strategies for each segment.

Python Projects for AI Enthusiasts:

  1. Building an AI Chatbot with Streamlit — Build a simple website with an AI chatbot user interface similar to the OpenAI Playground in this intermediate-level project using Streamlit.

  2. Developing a Dynamic AI Chatbot — Create your very own AI-powered chatbot that can take on different personalities, keep track of conversation history, and provide coherent responses in this intermediate-level project.

  3. Building a Food Ordering App — Create a functional application using Python dictionaries, loops, and functions to create an interactive system for viewing menus, modifying carts, and placing orders.

Table of Contents

Fun Python Projects for Building Data Skills:

  1. Exploring eBay Car Sales Data — Use Python to work with a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

  2. Find out How Much Money You’ve Spent on Amazon — Dig into your own spending habits with this beginner-level tutorial!

  3. Analyze Your Personal Netflix Data — Another beginner-to-intermediate tutorial that gets you working with your own personal dataset.

  4. Analyze Your Personal Facebook Data with Python — Are you spending too much time posting on Facebook? The numbers don’t lie, and you can find them in this beginner-to-intermediate Python project.

  5. Analyze Survey Data — This walk-through will show you how to set up Python and how to filter survey data from any dataset (or just use the sample data linked in the article).

  6. All of Dataquest’s Guided Projects — These guided data science projects walk you through building real-world data projects of increasing complexity, with suggestions for how to expand each project.

  7. Analyze Everything — Grab a free dataset that interests you, and start poking around! If you get stuck or aren’t sure where to start, our introduction to Python lessons are here to help, and you can try them for free!

Animated GIF of a person playing a space-themed game on a computer, illustrating cool Python projects for game development.

Table of Contents

Cool Python Projects for Game Devs:

  1. Rock, Paper, Scissors — Learn Python with a simple-but-fun game that everybody knows.

  2. Build a Text Adventure Game — This is a classic Python beginner project (it also pops up in this book) that’ll teach you many basic game setup concepts that are useful for more advanced games.

  3. Guessing Game — This is another beginner-level project that’ll help you learn and practice the basics.

  4. Mad Libs — Use Python code to make interactive Python Mad Libs!

  5. Hangman — Another childhood classic that you can make to stretch your Python skills.

  6. Snake — This is a bit more complex, but it’s a classic (and surprisingly fun) game to make and play.

Simple Python Projects for Web Devs:

  1. URL shortener — This free video course will show you how to build your own URL shortener like Bit.ly using Python and Django.

  2. Build a Simple Web Page with Django — This is a very in-depth, from-scratch tutorial for building a website with Python and Django, complete with cartoon illustrations!

Easy Python Projects for Aspiring Developers:

  1. Password generator — Build a secure password generator in Python.

  2. Use Tweepy to create a Twitter bot — This Python project idea is a bit more advanced, as you’ll need to use the Twitter API, but it’s definitely fun!

  3. Build an Address Book — This could start with a simple Python dictionary or become as advanced as something like this!

  4. Create a Crypto App with Python — This free video course walks you through using some APIs and Python to build apps with cryptocurrency data.

Table of Contents

Additional Python Project Ideas

Still haven’t found a project idea that appeals to you? Here are many more, separated by experience level.

These aren’t tutorials; they’re just Python project ideas that you’ll have to dig into and research on your own, but that’s part of the fun! And it’s also part of the natural process of learning to code and working as a programmer.

The pros use Google and AI tools for answers all the time — so don’t be afraid to dive in and get your hands dirty!

Graphic illustration of the Python logo with orange and brown wings, representing python projects for beginners.

Beginner Python Project Ideas

  1. Create a text encryption generator. This would take text as input, replaces each letter with another letter, and outputs the “encoded” message.

  2. Build a countdown calculator. Write some code that can take two dates as input, and then calculate the amount of time between them. This will be a great way to familiarize yourself with Python’s datetime module.

  3. Write a sorting method. Given a list, can you write some code that sorts it alphabetically, or numerically? Yes, Python has this functionality built-in, but see if you can do it without using the sort() function!

  4. Build an interactive quiz application. Which Avenger are you? Build a personality or recommendation quiz that asks users some questions, stores their answers, and then performs some kind of calculation to give the user a personalized result based on their answers

  5. Tic-Tac-Toe by Text. Build a Tic-Tac-Toe game that’s playable like a text adventure. Can you make it print a text-based representation of the board after each move?

  6. Make a temperature/measurement converter. Write a script that can convert Fahrenheit (℉) to Celcius (℃) and back, or inches to centimeters and back, etc. How far can you take it?

  7. Build a counter app. Take your first steps into the world of UI by building a very simple app that counts up by one each time a user clicks a button.

  8. Build a number-guessing game. Think of this as a bit like a text adventure, but with numbers. How far can you take it?

  9. Build an alarm clock. This is borderline beginner/intermediate, but it’s worth trying to build an alarm clock for yourself. Can you create different alarms? A snooze function?

Table of Contents

Graphic illustration of the Python logo with blue and light blue wings, representing intermediate python projects.

Intermediate Python Project Ideas

  1. Build an upgraded text encryption generator. Starting with the project mentioned in the beginner section, see what you can do to make it more sophisticated. Can you make it generate different kinds of codes? Can you create a “decoder” app that reads encoded messages if the user inputs a secret key? Can you create a more sophisticated code that goes beyond simple letter-replacement?

  2. Make your Tic-Tac-Toe game clickable. Building off the beginner project, now make a version of Tic-Tac-Toe that has an actual UI  you’ll use by clicking on open squares. Challenge: can you write a simple “AI” opponent for a human player to play against?

  3. Scrape some data to analyze. This could really be anything, from any website you like. The web is full of interesting data. If you learn a little about web-scraping, you can collect some really unique datasets.

  4. Build a clock website. How close can you get it to real-time? Can you implement different time zone selectors, and add in the “countdown calculator” functionality to calculate lengths of time?

  5. Automate some of your job. This will vary, but many jobs have some kind of repetitive process that you can automate! This intermediate project could even lead to a promotion.

  6. Automate your personal habits. Do you want to remember to stand up once every hour during work? How about writing some code that generates unique workout plans based on your goals and preferences? There are a variety of simple apps you can build to automate or enhance different aspects of your life.

  7. Create a simple web browser. Build a simple UI that accepts  URLs and loads webpages. PyWt will be helpful here! Can you add a “back” button, bookmarks, and other cool features?

  8. Write a notes app. Create an app that helps people write and store notes. Can you think of some interesting and unique features to add?

  9. Build a typing tester. This should show the user some text, and then challenge them to type it quickly and accurately. Meanwhile, you time them and score them on accuracy.

  10. Create a “site updated” notification system. Ever get annoyed when you have to refresh a website to see if an out-of-stock product has been relisted? Or to see if any news has been posted? Write a Python script that automatically checks a given URL for updates and informs you when it identifies one. Be careful not to overload the servers of whatever site you’re checking, though. Keep the time interval reasonable between each check.

  11. Recreate your favorite board game in Python. There are tons of options here, from something simple like Checkers all the way up to Risk. Or even more modern and advanced games like Ticket to Ride or Settlers of Catan. How close can you get to the real thing?

  12. Build a Wikipedia explorer. Build an app that displays a random Wikipedia page. The challenge here is in the details: can you add user-selected categories? Can you try a different “rabbit hole” version of the app, wherein each article is randomly selected from the articles linked in the previous article? This might seem simple, but it can actually require some serious web-scraping skills.

Table of Contents

Graphic illustration of the Python logo with purple and blue wings, representing advanced python projects.

Advanced Python Project Ideas

  1. Build a stock market prediction app. For this one, you’ll need a source of stock market data and some machine learning and data analytics skills. Fortunately, many people have tried this, so there’s plenty of source code out there to work from.

  2. Build a chatbot. The challenge here isn’t so much making the chatbot as it is making it good. Can you, for example, implement some natural language processing techniques to make it sound more natural and spontaneous?

  3. Program a robot. This requires some hardware (which isn’t usually free), but there are many affordable options out there — and many learning resources, too. Definitely look into Raspberry Pi if you’re not already thinking along those lines.

  4. Build an image recognition app. Starting with handwriting recognition is a good idea — Dataquest has a guided data science project to help with that! Once you’ve learned it, you can take it to the next level.

  5. Create a sentiment analysis tool for social media. Collect data from various social media platforms, preprocess it, and then train a deep learning model to analyze the sentiment of each post (positive, negative, neutral).

  6. Make a price prediction model. Select an industry or product that interests you, and build a machine learning model that predicts price changes.

  7. Create an interactive map. This will require a mix of data skills and UI creation skills. Your map can display whatever you’d like — bird migrations, traffic data, crime reports — but it should be interactive in some way. How far can you take it?

Table of Contents

Next Steps

Each of the examples in the previous sections built on the idea of choosing a great Python project for a beginner and then enhancing it as your Python skills progress. Next, you can advance to the following:

  • Think about what interests you, and choose a project that overlaps with your interests.

  • Think about your Python learning goals, and make sure your project moves you closer to achieving those goals.

  • Start small. Once you’ve built a small project, you can either expand it or build another one.

Now you’re ready to get started. If you haven’t learned the basics of Python yet, I recommend diving in with Dataquest’s Introduction to Python Programming course.

If you already know the basics, there’s no reason to hesitate! Now is the time to get in there and find your perfect Python project.

11 Must-Have Skills for Data Analysts in 2025

22 October 2025 at 19:06

Data is everywhere. Every click, purchase, or social media like creates mountains of information, but raw numbers do not tell a story. That is where data analysts come in. They turn messy datasets into actionable insights that help businesses grow.

Whether you're looking to become a junior data analyst or looking to level up, here are the top 11 data analyst skills every professional needs in 2025, including one optional skill that can help you stand out.

1. SQL

SQL (Structured Query Language) is the language of databases and is arguably the most important technical skill for analysts. It allows you to efficiently query and manage large datasets across multiple systems—something Excel cannot do at scale.

Example in action: Want last quarter's sales by region? SQL pulls it in seconds, no matter how huge the dataset.

Learning Tip: Start with basic queries, then explore joins, aggregations, and subqueries. Practicing data analytics exercises with SQL will help you build confidence and precision.

2. Excel

Since it’s not going anywhere, it’s still worth it to learn Microsoft Excel. Beyond spreadsheets, it offers pivot tables, macros, and Power Query, which are perfect for quick analysis on smaller datasets. Many startups or lean teams still rely on Excel as their first database.

Example in action: Summarize thousands of rows of customer feedback in minutes with pivot tables, then highlight trends visually.

Learning Tip: Focus on pivot tables, logical formulas, and basic automation. Once comfortable, try linking Excel to SQL queries or automating repetitive tasks to strengthen your technical skills in data analytics.

3. Python or R

Python and R are essential for handling big datasets, advanced analytics, and automation. Python is versatile for cleaning data, automation, and integrating analyses into workflows, while R excels at exploratory data analysis and statistical analysis.

Example in action: Clean hundreds of thousands of rows with Python’s pandas library in seconds, something that would take hours in Excel.

Learning Tip: Start with data cleaning and visualization, then move to complex analyses like regression or predictive modeling. Building these data analyst skills is critical for anyone working in data science. Of course, which is better to learn is still up for debate.

4. Data Visualization

Numbers alone rarely persuade anyone. Data visualization is how you make your insights clear and memorable. Tools like Tableau, Power BI, or Python/R libraries help you tell a story that anyone can understand.

Example in action: A simple line chart showing revenue trends can be far more persuasive than a table of numbers.

Learning Tip: Design visuals with your audience in mind. Recreate dashboards from online tutorials to practice clarity, storytelling, and your soft skills in communicating data analytics results.

5. Statistics & Analytics

Strong statistical analysis knowledge separates analysts who report numbers from those who generate insights. Skills like regression, correlation, hypothesis testing, and A/B testing help you interpret trends accurately.

Example in action: Before recommending a new marketing campaign, test whether the increase in sales is statistically significant or just random fluctuation.

Learning Tip: Focus on core probability and statistics concepts first, then practice applying them in projects. Our Probability and Statistics with Python skill path is a great way to learn theoretical concepts in a hands-on way.

6. Data Cleaning & Wrangling

Data rarely comes perfect, so data cleaning skills will always be in demand. Cleaning and transforming datasets, removing duplicates, handling missing values, and standardizing formats are often the most time-consuming but essential parts of the job.

Example in action: You want to analyze customer reviews, but ratings are inconsistent and some entries are blank. Cleaning the data ensures your insights are accurate and actionable.

Learning Tip: Practice on free datasets or public data repositories to build real-world data analyst skills.

7. Communication & Presentation Skills

Analyzing data is only half the battle. Sharing your findings clearly is just as important. Being able to present insights in reports, dashboards, or meetings ensures your work drives decisions.

Example in action: Presenting a dashboard to a marketing team that highlights which campaigns brought the most new customers can influence next-quarter strategy.

Learning Tip: Practice explaining complex findings to someone without a technical background. Focus on clarity, storytelling, and visuals rather than technical jargon. Strong soft skills are just as valuable as your technical skills in data analytics.

8. Dashboard & Report Creation

Beyond visualizations, analysts need to build dashboards and reports that allow stakeholders to interact with data. A dashboard is not just a fancy chart. It is a tool that empowers teams to make data-driven decisions without waiting for you to interpret every number.

Example in action: A sales dashboard with filters for region, product line, and time period can help managers quickly identify areas for improvement.

Learning Tip: Start with simple dashboards in Tableau, Power BI, or Google Data Studio. Focus on making them interactive, easy to understand, and aligned with business goals. This is an essential part of professional data analytics skills.

9. Domain Knowledge

Understanding the industry or context of your data makes you exponentially more effective. Metrics and trends mean different things depending on the business.

Example in action: Knowing e-commerce metrics like cart abandonment versus subscription churn metrics can change how you interpret the same type of data.

Learning Tip: Study your company’s industry, read case studies, or shadow colleagues in different departments to build context. The more you know, the better your insights and analysis will be.

10. Critical Thinking & Problem-Solving

Numbers can be misleading. Critical thinking lets analysts ask the right questions, identify anomalies, and uncover hidden insights.

Example in action: Revenue drops in one region. Critical thinking helps you ask whether it is seasonal, a data error, or a genuine trend.

Learning Tip: Challenge assumptions and always ask “why” multiple times when analyzing a dataset. Practice with open-ended case studies to sharpen your analytical thinking and overall data analyst skills.

11. Machine Learning Basics

Not every analyst uses machine learning daily, but knowing the basics—predictive modeling, clustering, or AI-powered insights—can help you stand out. You do not need this skill to get started as an analyst, but familiarity with it is increasingly valuable for advanced roles.

Example in action: Using a simple predictive model to forecast next month’s sales trends can help your team allocate resources more effectively.

Learning Tip: Start small with beginner-friendly tools like Python’s scikit-learn library, then explore more advanced models as you grow. Treat it as an optional skill to explore once you are confident in SQL, Python/R, and statistical analysis.

Where to Learn These Skills

Want to become a data analyst? Dataquest makes it easy to learn the skills you need to get hired.

With our Data Analyst in Python and Data Analyst in R career paths, you’ll learn by doing real projects, not just watching videos. Each course helps you build the technical and practical skills employers look for.

By the end, you’ll have the knowledge, experience, and confidence to start your career in data analysis.

Wrapping It Up

Being a data analyst is not just about crunching numbers. It is about turning data into actionable insights that drive decisions. Master these data analytics and data analyst skills, and you will be prepared to handle the challenges of 2025 and beyond.

Getting Started with Claude Code for Data Scientists

16 October 2025 at 23:39

If you've spent hours debugging a pandas KeyError, or writing the same data validation code for the hundredth time, or refactoring a messy analysis script, you know the frustration of tedious coding work. Real data science work involves analytical thinking and creative problem-solving, but it also requires a lot of mechanical coding: boilerplate writing, test generation, and documentation creation.

What if you could delegate the mechanical parts to an AI assistant that understands your codebase and handles implementation details while you focus on the analytical decisions?

That's what Claude Code does for data scientists.

What Is Claude Code?

Claude Code is Anthropic's terminal-based AI coding assistant that helps you write, refactor, debug, and document code through natural language conversations. Unlike autocomplete tools that suggest individual lines as you type, Claude Code understands project context, makes coordinated multi-file edits, and can execute workflows autonomously.

Claude Code excels at generating boilerplate code for data loading and validation, refactoring messy scripts into clean modules, debugging obscure errors in pandas or numpy operations, implementing standard patterns like preprocessing pipelines, and creating tests and documentation. However, it doesn't replace your analytical judgment, make methodological decisions about statistical approaches, or fix poorly conceived analysis strategies.

In this tutorial, you'll learn how to install Claude Code, understand its capabilities and limitations, and start using it productively for data science work. You'll see the core commands, discover tips that improve efficiency, and see concrete examples of how Claude Code handles common data science tasks.

Key Benefits for Data Scientists

Before we get into installation, let's establish what Claude Code actually does for data scientists:

  1. Eliminate boilerplate code writing for repetitive patterns that consume time without requiring creative thought. File loading with error handling, data validation checks that verify column existence and types, preprocessing pipelines with standard transformations—Claude Code generates these in seconds rather than requiring manual implementation of logic you've written dozens of times before.
  2. Generate test suites for data processing functions covering normal operation, edge cases with malformed or missing data, and validation of output characteristics. Testing data pipelines becomes straightforward rather than work you postpone.
  3. Accelerate documentation creation for data analysis workflows by generating detailed docstrings, README files explaining project setup, and inline comments that explain complex transformations.
  4. Debug obscure errors more efficiently in pandas operations, numpy array manipulations, or scikit-learn pipeline configurations. Claude Code interprets cryptic error messages, suggests likely causes based on common patterns, and proposes fixes you can evaluate immediately.
  5. Refactor exploratory code into production-quality modules with proper structure, error handling, and maintainability standards. The transition from research notebook to deployable pipeline becomes faster and less painful.

These benefits translate directly to time savings on mechanical tasks, allowing you to focus on analysis, modeling decisions, and generating insights rather than wrestling with implementation details.

Installation and Setup

Let's get Claude Code installed and configured. The process takes about 10-15 minutes, including account creation and verification.

Step 1: Obtain Your Anthropic API Key

Navigate to console.anthropic.com and create an account if you don't have one. Once logged in, access the API keys section from the navigation menu on the left, and generate a new API key by clicking on + Create Key.

Claude_Code_API_Key.png

While you can generate a new key anytime from the console, you won’t be able to retrieve any existing API keys once they have been created. For this reason, you’ll want to copy your API key immediately and store it somewhere safe—you'll need it for authentication.

Always keep your API keys secure. Treat them like passwords and never commit them to version control or share them publicly.

Step 2: Install Claude Code

Claude Code installs via npm (Node Package Manager). If you don't have Node.js installed on your system, download it from nodejs.org before proceeding.

Once Node.js is installed, open your terminal and run:

npm install -g @anthropic-ai/claude-code

The -g flag installs Claude Code globally, making it available from any directory on your system.

Common installation issues:

  • "npm: command not found": You need to install Node.js first. Download it from nodejs.org and restart your terminal after installation.
  • Permission errors on Mac/Linux: Try sudo npm install -g @anthropic-ai/claude-code to install with administrator privileges.
  • PATH issues: If Claude Code installs successfully but the claude command isn't recognized, you may need to add npm's global directory to your system PATH. Run npm config get prefix to find the location, then add [that-location]/bin to your PATH environment variable.

Step 3: Configure Authentication

Set your API key as an environment variable so Claude Code can authenticate with Anthropic's servers:

export ANTHROPIC_API_KEY=your_key_here

Replace your_key_here with the actual API key you copied earlier from the Anthropic console.

To make this permanent (so you don't need to set your API key every time you open a terminal), add the export line above to your shell configuration file:

  • For bash: Add to ~/.bashrc or ~/.bash_profile
  • For zsh: Add to ~/.zshrc
  • For fish: Add to ~/.config/fish/config.fish

You can edit your shell configuration file using nano config_file_name. After adding the line, reload your configuration by running source ~/.bashrc (or whichever file you edited), or simply open a new terminal window.

Step 4: Verify Installation

Confirm that Claude Code is properly installed and authenticated:

claude --version

You should see version information displayed. If you get an error, review the installation steps above.

Try running Claude Code for the first time:

claude

This launches the Claude Code interface. You should see a welcome message and a prompt asking you to select the text style that looks best with your terminal:

Claude_Code_Welcome_Screen.png

Use the arrow keys on your keyboard to select a text style and press Enter to continue.

Next, you’ll be asked to select a login method:

If you have an eligible subscription, select option 1. Otherwise, select option 2. For this tutorial, we will use option 2 (API usage billing).

Claude_Code_Select_Login.png

Once your account setup is complete, you’ll see a welcome message showing the email address for your account:

Claude_Code_Setup_Complete.png

To exit the setup of Claude Code at any point, press Control+C twice.

Security Note

Claude Code can read files you explicitly include and generate code that loads data from files or databases. However, it doesn't automatically access your data without your instruction. You maintain full control over what files and information Claude Code can see. When working with sensitive data, be mindful of what files you include in conversation context and review all generated code before execution, especially code that connects to databases or external systems. For more details, see Anthropic’s Security Documentation.

Understanding the Costs

Claude Code itself is free software, but using it requires an Anthropic API key that operates on usage-based pricing:

  • Free tier: Limited testing suitable for evaluation
  • Pro plan (\$20/month): Reasonable usage for individual data scientists conducting moderate development work
  • Pay-as-you-go: For heavy users working intensively on multiple projects, typically \$6-12 daily for active development

Most practitioners doing regular but not continuous development work find the \$20 Pro plan provides good balance between cost and capability. Start with the free tier to evaluate effectiveness on your actual work, then upgrade based on demonstrated value.

Your First Commands

Now that Claude Code is installed and configured, let's walk through basic usage with hands-on examples.

Starting a Claude Code Session

Navigate to a project directory in your terminal:

cd ~/projects/customer_analysis

Launch Claude Code:

claude

You'll see the Claude Code interface with a prompt where you can type natural language instructions.

Understanding Your Project

Before asking Claude Code to make changes, it needs to understand your project context. Try starting with this exploratory command:

Explain the structure of this project and identify the key files.

Claude Code will read through your directory, examine files, and provide a summary of what it found. This shows that Claude Code actively explores and comprehends codebases before acting.

Your First Refactoring Task

Let's demonstrate Claude Code's practical value with a realistic example. Create a simple file called load_data.py with some intentionally messy code:

import pandas as pd

# Load customer data
data = pd.read_csv('/Users/yourname/Desktop/customers.csv')
print(data.head())

This works but has obvious problems: hardcoded absolute path, no error handling, poor variable naming, and no documentation.

Now ask Claude Code to improve it:

Refactor load_data.py to use best practices: configurable paths, error handling, descriptive variable names, and complete docstrings.

Claude Code will analyze the file and propose improvements. Instead of the hardcoded path, you'll get configurable file paths through command-line arguments. The error handling expands to catch missing files, empty files, and CSV parsing errors. Variable names become descriptive (customer_df or customer_data instead of generic data). A complete docstring appears documenting parameters, return values, and potential exceptions. The function adds proper logging to track what's happening during execution.

Claude Code asks your permission before making these changes. Always review its proposal; if it looks good, approve it. If something seems off, ask for modifications or reject the changes entirely. This permission step ensures you stay in control while delegating the mechanical work.

What Just Happened

This demonstrates Claude Code's workflow:

  1. You describe what you want in natural language
  2. Claude Code analyzes the relevant files and context
  3. Claude Code proposes specific changes with explanations
  4. You review and approve or request modifications
  5. Claude Code applies approved changes

The entire refactoring took 90 seconds instead of 20-30 minutes of manual work. More importantly, Claude Code caught details you might have forgotten, such as adding logging, proper type hints, and handling multiple error cases. The permission-based approach ensures you maintain control while delegating implementation work.

Core Commands and Patterns

Claude Code provides several slash (/) commands that control its behavior and help you work more efficiently.

Important Slash Commands

@filename: Reference files directly in your prompts using the @ symbol. Example: @src/preprocessing.py or Explain the logic in @data_loader.py. Claude Code automatically includes the file's content in context. Use tab completion after typing @ to quickly navigate and select files.

/clear: Reset conversation context entirely, removing all history and file references. Use this when switching between different analyses, datasets, or project areas. Accumulated conversation history consumes tokens and can cause Claude Code to inappropriately reference outdated context. Think of /clear as starting a fresh conversation when you switch tasks.

/help: Display available commands and usage information. Useful when you forget command syntax or want to discover capabilities.

Context Management for Data Science Projects

Claude Code has token limits determining how much code it can consider simultaneously. For small projects with a few files, this rarely matters. For larger data science projects with dozens of notebooks and scripts, strategic context management becomes important.

Reference only files relevant to your current task using @filename syntax. If you're working on data validation, reference the validation script and related utilities (like @validation.py and @utils/data_checks.py) but exclude modeling and visualization code that won't influence the current work.

Effective Prompting Patterns

Claude Code responds best to clear, specific instructions. Compare these approaches:

  • Vague: "Make this code better"
    Specific: "Refactor this preprocessing function to handle missing values using median imputation for numerical columns and mode for categorical columns, add error handling for unexpected data types, and include detailed docstrings"
  • Vague: "Add tests"
    Specific: "Create pytest tests for the data_loader function covering successful loading, missing file errors, empty file handling, and malformed CSV detection"
  • Vague: "Fix the pandas error"
    Specific: "Debug the KeyError in line 47 of data_pipeline.py and suggest why it's failing on the 'customer_id' column"

Specific prompts produce focused, useful results. Vague prompts generate generic suggestions that may not address your actual needs.

Iteration and Refinement

Treat Claude Code's initial output as a starting point rather than expecting perfection on the first attempt. Review what it generates, identify improvements needed, and make follow-up requests:

"The validation function you created is good, but it should also check that dates are within reasonable ranges. Add validation that start_date is after 2000-01-01 and end_date is not in the future."

This iterative approach produces better results than attempting to specify every requirement in a single massive prompt.

Advanced Features

Beyond basic commands, several features improve your Claude Code experience for complex work.

  1. Activate plan mode: Press Shift+Tab before sending your prompt to enable plan mode, which creates an explicit execution plan before implementing changes. Use this for workflows with three or more distinct steps—like loading data, preprocessing, and generating outputs. The planning phase helps Claude maintain focus on the overall objective.

  2. Run commands with bash mode: Prefix prompts with an exclamation mark to execute shell commands and inject their output into Claude Code's context:

    ! python analyze_sales.py

    This runs your analysis script and adds complete output to Claude Code's context. You can then ask questions about the output or request interpretations of the results. This creates a tight feedback loop for iterative data exploration.

  3. Use extended thinking for complex problems: Include "think", "think harder", or "ultrathink" in prompts for thorough analysis:

    think harder: why does my linear regression show high R-squared but poor prediction on validation data?

    Extended thinking produces more careful analysis but takes longer (ultrathink can take several minutes). Apply this when debugging subtle statistical issues or planning sophisticated transformations.

  4. Resume previous sessions: Launch Claude Code with claude --resume to continue your most recent session with complete context preserved, including conversation history, file references, and established conventions all intact. This proves valuable for ongoing analysis where you want to continue today without re-explaining your entire analytical approach.

Optional Power User Setting

For personal projects where you trust all operations, launch with claude --dangerously-skip-permissions to bypass constant approval prompts. This carries risk if Claude Code attempts destructive operations, so use it only on projects where you maintain version control and can recover from mistakes. Never use this on production systems or shared codebases.

Configuring Claude Code for Data Science Projects

The CLAUDE.md file provides project-specific context that improves Claude Code's suggestions by explaining your conventions, requirements, and domain specifics.

Quick Setup with /init

The easiest way to create your CLAUDE.md file is using Claude Code's built-in /init command. From your project directory, launch Claude Code and run:

/init

Claude Code will analyze your project structure and ask you questions about your setup: what kind of project you're working on, your coding conventions, important files and directories, and domain-specific context. It then generates a CLAUDE.md file tailored to your project.

This interactive approach is faster than writing from scratch and ensures you don't miss important details. You can always edit the generated file later to refine it.

Understanding Your CLAUDE.md

Whether you used /init or prefer to create it manually, here's what a typical CLAUDE.md file looks like for a data science project on customer churn. In your project root directory, the file named CLAUDE.md uses markdown format and describes project information:

# Customer Churn Analysis Project

## Project Overview
Predict customer churn for a telecommunications company using historical
customer data and behavior patterns. The goal is identifying at-risk
customers for proactive retention efforts.

## Data Sources
- **Customer demographics**: data/raw/customer_info.csv
- **Usage patterns**: data/raw/usage_data.csv
- **Churn labels**: data/raw/churn_labels.csv

Expected columns documented in data/schemas/column_descriptions.md

## Directory Structure
- `data/raw/`: Original unmodified data files
- `data/processed/`: Cleaned and preprocessed data ready for modeling
- `notebooks/`: Exploratory analysis and experimentation
- `src/`: Production code for data processing and modeling
- `tests/`: Pytest tests for all src/ modules
- `outputs/`: Generated reports, visualizations, and model artifacts

## Coding Conventions
- Use pandas for data manipulation, scikit-learn for modeling
- All scripts should accept command-line arguments for file paths
- Include error handling for data quality issues
- Follow PEP 8 style guidelines
- Write pytest tests for all data processing functions

## Domain Notes
Churn is defined as customer canceling service within 30 days. We care
more about catching churners (recall) than minimizing false positives
because retention outreach is relatively low-cost.

This upfront investment takes 10-15 minutes but improves every subsequent interaction by giving Claude Code context about your project structure, conventions, and requirements.

Hierarchical Configuration for Complex Projects

CLAUDE.md files can be hierarchical. You might maintain a root-level CLAUDE.md describing overall project structure, plus subdirectory-specific files for different analysis areas.

For example, a project analyzing both customer behavior and financial performance might have:

  • Root CLAUDE.md: General project description, directory structure, and shared conventions
  • customer_analysis/CLAUDE.md: Specific details about customer data sources, relevant metrics like lifetime value and engagement scores, and analytical approaches for behavioral patterns
  • financial_analysis/CLAUDE.md: Financial data sources, accounting principles used, and approaches for revenue and cost analysis

Claude Code prioritizes the most specific configuration, so subdirectory files take precedence when working within those areas.

Custom Slash Commands

For frequently used patterns specific to your workflow, you can create custom slash commands. Create a .claude/commands directory in your project and add markdown files named for each slash command you want to define.

For example, .claude/commands/test.md:

Create pytest tests for: $ARGUMENTS

Requirements:
- Test normal operation with valid data
- Test edge cases: empty inputs, missing values, invalid types
- Test expected exceptions are raised appropriately
- Include docstrings explaining what each test validates
- Use descriptive test names that explain the scenario

Then /test my_preprocessing_function generates tests following your specified patterns.

These custom commands represent optional advanced customization. Start with basic CLAUDE.md configuration, and consider custom commands only after you've identified repetitive patterns in your prompting.

Practical Data Science Applications

Let's see Claude Code in action across some common data science tasks.

1. Data Loading and Validation

Generate robust data loading code with error handling:

Create a data loading function for customer_data.csv that:
- Accepts configurable file paths
- Validates expected columns exist with correct types
- Detects and logs missing value patterns
- Handles common errors like missing files or malformed CSV
- Returns the dataframe with a summary of loaded records

Claude Code generates a function that handles all these requirements. The code uses pathlib for cross-platform file paths, includes try-except blocks for multiple error scenarios, validates that required columns exist in the dataframe, logs detailed information about data quality issues like missing values, and provides clear exception messages when problems occur. This handles edge cases you might forget: missing files, parsing errors, column validation, and missing value detection with logging.

2. Exploratory Data Analysis Assistance

Generate EDA code:

Create an EDA script for the customer dataset that generates:
- Distribution plots for numerical features (age, income, tenure)
- Count plots for categorical features (plan_type, region)
- Correlation heatmap for numerical variables
- Summary statistics table
Save all visualizations to outputs/eda/

Claude Code produces a complete analysis script with proper plot styling, figure organization, and file saving—saving 30-45 minutes of matplotlib configuration work.

3. Data Preprocessing Pipeline

Build a preprocessing module:

Create preprocessing.py with functions to:
- Handle missing values: median for numerical, mode for categorical
- Encode categorical variables using one-hot encoding
- Scale numerical features using StandardScaler
- Include type hints, docstrings, and error handling

The generated code includes proper sklearn patterns and documentation, and it handles edge cases like unseen categories during transform.

4. Test Generation

Generate pytest tests:

Create tests for the preprocessing functions covering:
- Successful preprocessing with valid data
- Handling of various missing value patterns
- Error cases like all-missing columns
- Verification that output shapes match expectations

Claude Code generates thorough test coverage including fixtures, parametrized tests, and clear assertions—work that often gets postponed due to tedium.

5. Documentation Generation

Add docstrings and project documentation:

Add docstrings to all functions in data_pipeline.py following NumPy style
Create a README.md explaining:
- Project purpose and business context
- Setup instructions for the development environment
- How to run the preprocessing and modeling pipeline
- Description of output artifacts and their interpretation

Generated documentation captures technical details while remaining readable for collaborators.

6. Maintaining Analysis Documentation

For complex analyses, use Claude Code to maintain living documentation:

Create analysis_log.md and document our approach to handling missing income data, including:
- The statistical justification for using median imputation rather than deletion
- Why we chose median over mean given the right-skewed distribution we observed
- Validation checks we performed to ensure imputation didn't bias results

This documentation serves dual purposes. First, it provides context for Claude Code in future sessions when you resume work on this analysis, as it explains the preprocessing you applied and why those specific choices were methodologically appropriate. Second, it creates stakeholder-ready explanations communicating both technical implementation and analytical reasoning.

As your analysis progresses, continue documenting key decisions:

Add to analysis_log.md: Explain why we chose random forest over logistic regression after observing significant feature interactions in the correlation analysis, and document the cross-validation approach we used given temporal dependencies in our customer data.

This living documentation approach transforms implicit analytical reasoning into explicit written rationale, increasing both reproducibility and transparency of your data science work.

Common Pitfalls and How to Avoid Them

  • Insufficient context leads to generic suggestions that miss project-specific requirements. Claude Code doesn't automatically know your data schema, project conventions, or domain constraints. Maintain a detailed CLAUDE.md file and reference relevant files using @filename syntax in your prompts.
  • Accepting generated code without review risks introducing bugs or inappropriate patterns. Claude Code produces good starting points but isn't perfect. Treat all output as first drafts requiring validation through testing and inspection, especially for statistical computations or data transformations.
  • Attempting overly complex requests in single prompts produces confused or incomplete results. When you ask Claude Code to "build the entire analysis pipeline from scratch," it gets overwhelmed. Break large tasks into focused steps—first create data loading, then validation, then preprocessing—building incrementally toward the desired outcome.
  • Ignoring error messages when Claude Code encounters problems prevents identifying root causes. Read errors carefully and ask Claude Code for specific debugging assistance: "The preprocessing function failed with KeyError on 'customer_id'. What might cause this and how should I fix it?"

Understanding Claude Code's Limitations

Setting realistic expectations about what Claude Code cannot do well builds trust through transparency.

Domain-specific understanding requires your input. Claude Code generates code based on patterns and best practices but cannot validate whether analytical approaches are appropriate for your research questions or business problems. You must provide domain expertise and methodological judgment.

Subtle bugs can slip through. Generated code for advanced statistical methods, custom loss functions, or intricate data transformations requires careful validation. Always test generated code thoroughly against known-good examples.

Large project understanding is limited. Claude Code works best on focused tasks within individual files rather than system-wide refactoring across complex architectures with dozens of interconnected files.

Edge cases may not be handled. Preprocessing code might handle clean training data perfectly but break on production data with unexpected null patterns or outlier distributions that weren't present during development.

Expertise is not replaceable. Claude Code accelerates implementation but does not replace fundamental understanding of data science principles, statistical methods, or domain knowledge.

Security Considerations

When Claude Code accesses external data sources, malicious actors could potentially embed instructions in data that Claude Code interprets as commands. This concern is known as prompt injection.

Maintain skepticism about Claude Code suggestions when working with untrusted external sources. Never grant Claude Code access to production databases, sensitive customer information, or critical systems without careful review of proposed operations.

For most data scientists working with internal datasets and trusted sources, this risk remains theoretical, but awareness becomes important as you expand usage into more automated workflows.

Frequently Asked Questions

How much does Claude Code cost for typical data science usage?

Claude Code itself is free to install, but it requires an Anthropic API key with usage-based pricing. The free tier allows limited testing suitable for evaluation. The Pro plan at \$20/month handles moderate daily development—generating preprocessing code, debugging errors, refactoring functions. Heavy users working intensively on multiple projects may prefer pay-as-you-go pricing, typically \$6-12 daily for active development. Start with the free tier to evaluate effectiveness, then upgrade based on value.

Does Claude Code work with Jupyter notebooks?

Claude Code operates as a command-line tool and works best with Python scripts and modules. For Jupyter notebooks, use Claude Code to build utility modules that your notebooks import, creating cleaner separation between exploratory analysis and reusable logic. You can also copy code cells into Python files, improve them with Claude Code, then bring the enhanced code back to the notebook.

Can Claude Code access my data files or databases?

Claude Code reads files you explicitly include through context and generates code that loads data from files or databases. It doesn't automatically access your data without instruction. You maintain full control over what files and information Claude Code can see. When you ask Claude Code to analyze data patterns, it reads the data through code execution, not by directly accessing databases or files independently.

How does Claude Code compare to GitHub Copilot?

GitHub Copilot provides inline code suggestions as you type within an IDE, excelling at completing individual lines or functions. Claude Code offers more substantial assistance with entire file transformations, debugging sessions, and refactoring through conversational interaction. Many practitioners use both—Copilot for writing code interactively, Claude Code for larger refactoring and debugging work. They complement each other rather than compete.

Next Steps

You now have Claude Code installed, understand its capabilities and limitations, and have seen concrete examples of how it handles data science tasks.

Start by using Claude Code for low-risk tasks where mistakes are easily corrected: generating documentation for existing functions, creating test cases for well-understood code, or refactoring non-critical utility scripts. This builds confidence without risking important work. Gradually increase complexity as you become comfortable.

Maintain a personal collection of effective prompts for data science tasks you perform regularly. When you discover a prompt pattern that produces excellent results, save it for reuse. This accelerates work on similar future tasks.

For technical details and advanced features, explore Anthropic's Claude Code documentation. The official docs cover advanced topics like Model Context Protocol servers, custom hooks, and integration patterns.

To systematically learn generative AI across your entire practice, check out our Generative AI Fundamentals in Python skill path. For deeper understanding of effective prompt design, our Prompting Large Language Models in Python course teaches frameworks for crafting prompts that consistently produce useful results.

Getting Started

AI-assisted development requires practice and iteration. You'll experience some awkwardness as you learn to communicate effectively with Claude Code, but this learning curve is brief. Most practitioners feel productive within their first week of regular use.

Install Claude Code, work through the examples in this tutorial with your own projects, and discover how AI assistance fits into your workflow.


Have questions or want to share your Claude Code experience? Join the discussion in the Dataquest Community where thousands of data scientists are exploring AI-assisted development together.

Python Practice: 91 Exercises, Projects, and Tutorials

16 October 2025 at 23:26

This guide gives you 91 ways to practice Python — from quick exercises to real projects and helpful courses. Whether you’re a beginner or preparing for a job, there’s something here for you.


Table of Contents

  1. Hands-On Courses
  2. Free Exercises
  3. Projects
  4. Online Tutorials

Hands-On Courses

Some Python programming courses let you learn and code at the same time. You read a short lesson, then solve a problem in your browser. It’s a fast, hands-on way to learn.

Each course below includes at least one free lesson you can try.

Python Courses

Python Basics Courses

Data Analysis & Visualization Courses

Data Cleaning Courses

Machine Learning Courses

AI & Deep Learning Courses

Probability & Statistics Courses

Hypothesis Testing

These courses are a great way to practice Python online, and they're all free to start. If you're looking for more Python courses, you can find them on Dataquest's course page.


Free Python Exercises

Exercises are a great way to focus on a specific skill. For example, if you have a job interview coming up, practicing Python dictionaries will refresh your knowledge and boost your confidence.

Each lesson is free to start.

Coding Exercises

Beginner Python Exercises

Intermediate Python Programming

Data Handling and Manipulation with NumPy

Data Handling and Manipulation with pandas

Data Analysis

Complexity and Algorithms


Python Projects

Projects are one of the best ways to practice Python. Doing projects helps you remember syntax, apply what you’ve learned, and build a portfolio to show employers.

Here are some projects you can start with right away:

Beginner Projects

Data Analysis Projects

Data Engineering Projects

Machine Learning & AI Projects

If none of these spark your interest, there are plenty of other Python projects to try.


Online Python Tutorials

If exercises, courses, or projects aren’t your thing, blog-style tutorials are another way to learn Python. They’re great for reading on your phone or when you can’t code directly.

Core Python Concepts (Great for Beginners)

Intermediate Techniques

Data Analysis & Data Science

The web is full of thousands of beginner Python tutorials. Once you know the basics, you can find endless ways to practice Python online.


FAQs

Where can I practice Python programming online?

  1. Dataquest.io: Offers dozens of free interactive practice questions, lessons, project ideas, walkthroughs, tutorials, and more.
  2. HackerRank: A popular site for interactive coding practice and challenges.
  3. CodingGame: A fun platform that lets you practice Python through games and coding puzzles.
  4. Edabit: Provides Python challenges that are great for practice or self-testing.
  5. LeetCode: Helps you test your skills and prepare for technical interviews with Python coding problems.

How can I practice Python at home?

  1. Install Python on your machine.

You can download Python directly here, or use a program like Anaconda Individual Edition that makes the process easier. If you don’t want to install anything, you can use an interactive online platform like Dataquest and write code right in your browser.

  1. Work on projects or practice problems.

Find a good Python project or some practice problems to apply what you’re learning. Hands-on coding is one of the best ways to improve.

  1. Make a schedule.

Plan your practice sessions and stick to them. Regular, consistent practice is key to learning Python effectively.

  1. Join an online community.

It's always great to get help from a real person. Reddit has great Python communities, and Dataquest's Community is great if you're learning Python data skills.

Can I practice Python on mobile?

Yes! There are many apps that let you practice Python on both iOS and Android.

However, mobile practice shouldn’t be your main way of learning if you want to use Python professionally. It’s important to practice installing and working with Python on a desktop or laptop, since that’s how most real-world programming is done.

If you’re looking for an app to practice on the go, a great option is Mimo.

With AI advancing so quickly, should I still practice Python?

Absolutely! While AI is a powerful support tool, we can’t always rely on it blindly. AI can sometimes give incorrect answers or generate code that isn’t optimal.

Python is still essential, especially in the AI field. It’s a foundational language for developing AI technologies and is constantly updated to work with the latest AI advancements.

Popular Python libraries like TensorFlow and PyTorch make it easier to build and train complex AI models efficiently. Learning Python also helps you understand how AI tools work under the hood, making you a more skilled and knowledgeable developer.

Build Your First ETL Pipeline with PySpark

15 October 2025 at 23:57

You've learned PySpark basics: RDDs, DataFrames, maybe some SQL queries. You can transform data and run aggregations in notebooks. But here's the thing: data engineering is about building pipelines that run reliably every single day, handling the messy reality of production data.

Today, we're building a complete ETL pipeline from scratch. This pipeline will handle the chaos you'll actually encounter at work: inconsistent date formats, prices with dollar signs, test data that somehow made it to production, and customer IDs that follow different naming conventions.

Here's the scenario: You just started as a junior data engineer at an online grocery delivery service. Your team lead drops by your desk with a problem. "Hey, we need help. Our daily sales report is a mess. The data comes in as CSVs from three different systems, nothing matches up, and the analyst team is doing everything manually in Excel. Can you build us an ETL pipeline?"

She shows you what she's dealing with:

  • Order files that need standardized date formatting
  • Product prices stored as "$12.99" in some files, "12.99" in others
  • Customer IDs that are sometimes numbers, sometimes start with "CUST_"
  • Random blank rows and test data mixed in ("TEST ORDER - PLEASE IGNORE")

"Just get it into clean CSV files," she says. "We'll worry about performance and parquet later. We just need something that works."

Your mission? Build an ETL pipeline that takes this mess and turns it into clean, reliable data the analytics team can actually use. No fancy optimizations needed, just something that runs every morning without breaking.

Setting Up Your First ETL Project

Let's start with structure. One of the biggest mistakes new data engineers make is jumping straight into writing transformation code without thinking about organization. You end up with a single massive Python file that's impossible to debug, test, or explain to your team.

We're going to build this the way professionals do it, but keep it simple enough that you won't get lost in abstractions.

Project Structure That Makes Sense

Here's what we're creating:

grocery_etl/
├── data/
│   ├── raw/         # Your messy input CSVs
│   ├── processed/   # Clean output files
├── src/
│   └── etl_pipeline.py
├── main.py
└── requirements.txt

Why this structure? Three reasons:

First, it separates concerns. Your main.py handles orchestration; starting Spark, calling functions, handling errors. Your src/etl_pipeline.py contains all the actual ETL logic. When something breaks, you'll know exactly where to look.

Second, it mirrors the organizational pattern you'll use in production pipelines (even though the specifics will differ). Whether you're deploying to Databricks, AWS EMR, or anywhere else, you'll separate concerns the same way: orchestration code (main.py), ETL logic (src/etl_pipeline.py), and clear data boundaries. The actual file paths will change (e.g., production uses distributed filesystems like s3://data-lake/raw/ or /mnt/efs/raw/ instead of local folders), but the structure scales.

Third, it keeps your local development organized. Raw data stays raw. Processed data goes to a different folder. This makes debugging easier and mirrors the input/output separation you'll have in production, just on your local machine.

Ready to start? Get the sample CSV files and project skeleton from our starter repository. You can either:

# Clone the full tutorials repo and navigate to this project
git clone https://github.com/dataquestio/tutorials.git
cd tutorials/pyspark-etl-tutorial

Or download just the pyspark-etl-tutorial folder as a ZIP from the GitHub page.

Getting Started

We'll build this project in two files:

  • src/etl_pipeline.py: All our ETL functions (extract, transform, load)
  • main.py: Orchestration logic that calls those functions

Let's set up the basics. You'll need Python 3.9+ and Java 11 or 17 installed (required for Spark 4.0). Note: In production, you'd match your PySpark version to whatever your cluster is running (Databricks, EMR, etc.).

# requirements.txt
pyspark==4.0.1
# main.py - Just the skeleton for now
from pyspark.sql import SparkSession
import logging
import sys

def main():
    # We'll complete this orchestration logic later
    pass

if __name__ == "__main__":
    main()

That's it for setup. Notice we're not installing dozens of dependencies or configuring complex build tools. We're keeping it minimal because the goal is to understand ETL patterns, not fight with tooling.

Optional: Interactive Data Exploration

Before we dive into writing pipeline code, you might want to poke around the data interactively. This is completely optional. If you prefer to jump straight into building, skip to the next section, but if you want to see what you're up against, fire up the PySpark shell:

pyspark

Now you can explore interactively from the command line:

df = spark.read.csv("data/raw/online_orders.csv", header=True)

# See the data firsthand
df.show(5, truncate=False)
df.printSchema()
df.describe().show()

# Count how many weird values we have
df.filter(df.price.contains("$")).count()
df.filter(df.customer_id.contains("TEST")).count()

This exploration helps you understand what cleaning you'll need to build into your pipeline. Real data engineers do this all the time: you load a sample, poke around, discover the problems, then write code to fix them systematically.

But interactive exploration is for understanding the data. The actual pipeline needs to be scripted, testable, and able to run without you babysitting it. That's what we're building next.

Extract: Getting Data Flexibly

The Extract phase is where most beginner ETL pipelines break. You write code that works perfectly with your test file, then the next day's data arrives with a slightly different format, and everything crashes.

We're going to read CSVs the defensive way: assume everything will go wrong, capture the problems, and keep the pipeline running.

Reading Messy CSV Files

Let's start building src/etl_pipeline.py. We'll begin with imports and a function to create our Spark session:

# src/etl_pipeline.py

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import logging

# Set up logger for this module
logger = logging.getLogger(__name__)

def create_spark_session():
    """Create a Spark session for our ETL job"""
    return SparkSession.builder \
        .appName("Grocery_Daily_ETL") \
        .config("spark.sql.adaptive.enabled", "true") \
        .getOrCreate()

This is a basic local configuration. Real production pipelines need more: time zone handling, memory allocation tuned to your cluster, policies for parsing dates, which we’ll cover in a future tutorial on production deployment. For now, we're focusing on the pattern.

If you're new to the logging module, logger.info() writes to log files with timestamps and severity levels. When something breaks, you can check the logs to see exactly what happened. It's a small habit that saves debugging time.

Now let's read the data:

def extract_sales_data(spark, input_path):
    """Read sales CSVs with all their real-world messiness"""

    logger.info(f"Reading sales data from {input_path}")

    expected_schema = StructType([
        StructField("order_id", StringType(), True),
        StructField("customer_id", StringType(), True),
        StructField("product_name", StringType(), True),
        StructField("price", StringType(), True),
        StructField("quantity", StringType(), True),
        StructField("order_date", StringType(), True),
        StructField("region", StringType(), True)
    ])

StructType and StructField let you define exactly what columns you expect and what data types they should have. The True at the end means the field can be null. You could let Spark infer the schema automatically, but explicit schemas catch problems earlier. If someone adds a surprise column next week, you'll know immediately instead of discovering it three steps downstream.

Notice everything is StringType(). You might think "wait, customer_id has numbers, shouldn't that be IntegerType?" Here's the thing: some customer IDs are "12345" and some are "CUST_12345". If we used IntegerType(), Spark would convert "CUST_12345" to null and we'd lose data.

The strategy is simple: prevent data loss by preserving everything as strings in the Extract phase, then clean and convert in the Transform phase, where we have control over error handling.

Now let's read the file defensively:

    df = spark.read.csv(
        input_path,
        header=True,
        schema=expected_schema,
        mode="PERMISSIVE"
    )

    total_records = df.count()
    logger.info(f"Found {total_records} total records")

    return df

The PERMISSIVE mode tells Spark to be lenient with malformed data. When it encounters rows that don't match the schema, it sets unparseable fields to null instead of crashing the entire job. This keeps production pipelines running even when data quality takes a hit. We'll validate and handle data quality issues in the Transform phase, where we have better control.

Dealing with Multiple Files

Real data comes from multiple systems. Let's combine them:

def extract_all_data(spark):
    """Combine data from multiple sources"""

    # Each system exports differently
    online_orders = extract_sales_data(spark, "data/raw/online_orders.csv")
    store_orders = extract_sales_data(spark, "data/raw/store_orders.csv")
    mobile_orders = extract_sales_data(spark, "data/raw/mobile_orders.csv")

    # Union them all together
    all_orders = online_orders.unionByName(store_orders).unionByName(mobile_orders)

    logger.info(f"Combined dataset has {all_orders.count()} orders")
    return all_orders

In production, you'd often use wildcards like "data/raw/online_orders*.csv" to process multiple files at once (like daily exports). Spark reads them all and combines them automatically. We're keeping it simple here with one file per source.

The .unionByName() method stacks DataFrames vertically, matching columns by name rather than position. This prevents silent data corruption if schemas don't match perfectly, which is a common issue when combining data from different systems. Since we defined the same schema for all three sources, this works cleanly.

You've now built the Extract phase: reading data defensively and combining multiple sources. The data isn't clean yet, but at least we didn't lose any of it. That's what matters in Extract.

Transform: Fixing the Data Issues

This is where the real work happens. You've got all your data loaded, good and bad separated. Now we need to turn those messy strings into clean, usable data types.

The Transform phase is where you fix all the problems you discovered during extraction. Each transformation function handles one specific issue, making the code easier to test and debug.

Standardizing Customer IDs

Remember how customer IDs come in two formats? Some are just numbers, some have the "CUST_" prefix. Let's standardize them:

# src/etl_pipeline.py (continuing in same file)

def clean_customer_id(df):
    """Standardize customer IDs (some are numbers, some are CUST_123 format)"""

    df_cleaned = df.withColumn(
        "customer_id_cleaned",
        when(col("customer_id").startswith("CUST_"), col("customer_id"))
        .when(col("customer_id").rlike("^[0-9]+$"), concat(lit("CUST_"), col("customer_id")))
        .otherwise(col("customer_id"))
    )

    return df_cleaned.drop("customer_id").withColumnRenamed("customer_id_cleaned", "customer_id")

The logic here: if it already starts with "CUST_", keep it. If it's just numbers (rlike("^[0-9]+$") checks for that), add the "CUST_" prefix. Everything else stays as-is for now. This gives us a consistent format to work with downstream.

Cleaning Price Data

Prices are often messy. Dollar signs, commas, who knows what else:

# src/etl_pipeline.py (continuing in same file)

def clean_price_column(df):
    """Fix the price column"""

    # Remove dollar signs, commas, etc. (keep digits, decimals, and negatives)
    df_cleaned = df.withColumn(
        "price_cleaned",
        regexp_replace(col("price"), r"[^0-9.\-]", "")
    )

    # Convert to decimal, default to 0 if it fails
    df_final = df_cleaned.withColumn(
        "price_decimal",
        when(col("price_cleaned").isNotNull(),
             col("price_cleaned").cast(DoubleType()))
        .otherwise(0.0)
    )

    # Flag suspicious values for review
    df_flagged = df_final.withColumn(
        "price_quality_flag",
        when(col("price_decimal") == 0.0, "CHECK_ZERO_PRICE")
        .when(col("price_decimal") > 1000, "CHECK_HIGH_PRICE")
        .when(col("price_decimal") < 0, "CHECK_NEGATIVE_PRICE")
        .otherwise("OK")
    )

    bad_price_count = df_flagged.filter(col("price_quality_flag") != "OK").count()
    logger.warning(f"Found {bad_price_count} orders with suspicious prices")

    return df_flagged.drop("price", "price_cleaned")

The regexp_replace function strips out everything that isn't a digit or decimal point. Then we convert to a proper decimal type. The quality flag column helps us track suspicious values without throwing them out. This is important: we're not perfect at cleaning, so we flag problems for humans to review later.

Note that we're assuming US price format here (periods as decimal separators). European formats with commas would need different logic, but for this tutorial, we're keeping it focused on the ETL pattern rather than international number handling.

Standardizing Dates

Date parsing is one of those things that looks simple but gets complicated fast. Different systems export dates in different formats: some use MM/dd/yyyy, others use dd-MM-yyyy, and ISO standard is yyyy-MM-dd.

def standardize_dates(df):
    """Parse dates in multiple common formats"""

    # Try each format - coalesce returns the first non-null result
    fmt1 = to_date(col("order_date"), "yyyy-MM-dd")
    fmt2 = to_date(col("order_date"), "MM/dd/yyyy")
    fmt3 = to_date(col("order_date"), "dd-MM-yyyy")

    df_parsed = df.withColumn(
        "order_date_parsed",
        coalesce(fmt1, fmt2, fmt3)
    )

    # Check how many we couldn't parse
    unparsed = df_parsed.filter(col("order_date_parsed").isNull()).count()
    if unparsed > 0:
        logger.warning(f"Could not parse {unparsed} dates")

    return df_parsed.drop("order_date")

We use coalesce() to try each format in order, taking the first one that successfully parses. This handles the most common date format variations you'll encounter.

Note: This approach works for simple date strings but doesn't handle datetime strings with times or timezones. For production systems dealing with international data or precise timestamps, you'd need more sophisticated parsing logic. For now, we're focusing on the core pattern.

Removing Test Data

Test data in production is inevitable. Let's filter it out:

# src/etl_pipeline.py (continuing in same file)

def remove_test_data(df):
    """Remove test orders that somehow made it to production"""

    df_filtered = df.filter(
        ~(upper(col("customer_id")).contains("TEST") |
          upper(col("product_name")).contains("TEST") |
          col("customer_id").isNull() |
          col("order_id").isNull())
    )

    removed_count = df.count() - df_filtered.count()
    logger.info(f"Removed {removed_count} test/invalid orders")

    return df_filtered

We're checking for "TEST" in customer IDs and product names, plus filtering out any rows with null order IDs or customer IDs. That tilde (~) means "not", so we're keeping everything that doesn't match these patterns.

Handling Duplicates

Sometimes the same order appears multiple times, usually from system retries:

# src/etl_pipeline.py (continuing in same file)

def handle_duplicates(df):
    """Remove duplicate orders (usually from retries)"""

    df_deduped = df.dropDuplicates(["order_id"])

    duplicate_count = df.count() - df_deduped.count()
    if duplicate_count > 0:
        logger.info(f"Removed {duplicate_count} duplicate orders")

    return df_deduped

We keep the first occurrence of each order_id and drop the rest. Simple and effective.

Bringing It All Together

Now we chain all these transformations in sequence:

# src/etl_pipeline.py (continuing in same file)

def transform_orders(df):
    """Apply all transformations in sequence"""

    logger.info("Starting data transformation...")

    # Clean each aspect of the data
    df = clean_customer_id(df)
    df = clean_price_column(df)
    df = standardize_dates(df)
    df = remove_test_data(df)
    df = handle_duplicates(df)

    # Cast quantity to integer
    df = df.withColumn(
        "quantity",
        when(col("quantity").isNotNull(), col("quantity").cast(IntegerType()))
        .otherwise(1)
    )

    # Add some useful calculated fields
    df = df.withColumn("total_amount", col("price_decimal") * col("quantity")) \
           .withColumn("processing_date", current_date()) \
           .withColumn("year", year(col("order_date_parsed"))) \
           .withColumn("month", month(col("order_date_parsed")))

    # Rename for clarity
    df = df.withColumnRenamed("order_date_parsed", "order_date") \
           .withColumnRenamed("price_decimal", "unit_price")

    logger.info(f"Transformation complete. Final record count: {df.count()}")

    return df

Each transformation returns a new DataFrame (remember, PySpark DataFrames are immutable), so we reassign the result back to df each time. The order matters here: we clean customer IDs before removing test data because the test removal logic checks for "TEST" in customer IDs. We standardize dates before extracting year and month because those extraction functions need properly parsed dates to work. If you swap the order around, transformations can fail or produce wrong results.

We also add some calculated fields that will be useful for analysis: total_amount (price times quantity), processing_date (when this ETL ran), and time partitions (year and month) for efficient querying later.

The data is now clean, typed correctly, and enriched with useful fields. Time to save it.

Load: Saving Your Work

The Load phase is when we write the cleaned data somewhere useful. We're using pandas to write the final CSV because it avoids platform-specific issues during local development. In production on a real Spark cluster, you'd use Spark's native writers for parquet format with partitioning for better performance. For now, we're focusing on getting the pipeline working reliably across different development environments. You can always swap the output format to parquet once you deploy to a production cluster.

Writing Clean Files

Let's write our data in a way that makes future queries fast:

# src/etl_pipeline.py (continuing in same file)

def load_to_csv(spark, df, output_path):
    """Save processed data for downstream use"""

    logger.info(f"Writing {df.count()} records to {output_path}")

    # Convert to pandas for local development ONLY (not suitable for large datasets)
    pandas_df = df.toPandas()

    # Create output directory if needed
    import os
    os.makedirs(output_path, exist_ok=True)

    output_file = f"{output_path}/orders.csv"
    pandas_df.to_csv(output_file, index=False)

    logger.info(f"Successfully wrote {len(pandas_df)} records")
    logger.info(f"Output location: {output_file}")

    return len(pandas_df)

Important: The .toPandas() method collects all distributed data into the driver's memory. This is dangerous for real production data! If your dataset is larger than your driver's RAM, your job will crash. We're using this approach only because:

  1. Our tutorial dataset is tiny (85 rows)
  2. It avoids platform-specific Spark/Hadoop setup issues on Windows
  3. The focus is on learning ETL patterns, not deployment

In production, always use Spark's native writers (df.write.parquet(), df.write.csv()) even though they require proper cluster configuration. Never use .toPandas() for datasets larger than a few thousand rows or anything you wouldn't comfortably fit in a single machine's memory.

Quick Validation with Spark SQL

Before we call it done, let's verify our data makes sense. This is where Spark SQL comes in handy:

# src/etl_pipeline.py (continuing in same file)

def sanity_check_data(spark, output_path):
    """Quick validation using Spark SQL"""

    # Read the CSV file back
    output_file = f"{output_path}/orders.csv"
    df = spark.read.csv(output_file, header=True, inferSchema=True)
    df.createOrReplaceTempView("orders")

    # Run some quick validation queries
    total_count = spark.sql("SELECT COUNT(*) as total FROM orders").collect()[0]['total']
    logger.info(f"Sanity check - Total orders: {total_count}")

    # Check for any suspicious data that slipped through
    zero_price_count = spark.sql("""
        SELECT COUNT(*) as zero_prices
        FROM orders
        WHERE unit_price = 0
    """).collect()[0]['zero_prices']

    if zero_price_count > 0:
        logger.warning(f"Found {zero_price_count} orders with zero price")

    # Verify date ranges make sense
    date_range = spark.sql("""
        SELECT
            MIN(order_date) as earliest,
            MAX(order_date) as latest
        FROM orders
    """).collect()[0]

    logger.info(f"Date range: {date_range['earliest']} to {date_range['latest']}")

    return True

The createOrReplaceTempView() lets us query the DataFrame using SQL. This is useful for validation because SQL is often clearer for these kinds of checks than chaining DataFrame operations. We're checking the record count, looking for zero prices that might indicate cleaning issues, and verifying the date range looks reasonable.

Creating a Summary Report

Your team lead is going to ask, "How'd the ETL go today?” Let's give her the answer automatically:

# src/etl_pipeline.py (continuing in same file)

def create_summary_report(df):
    """Generate metrics about the ETL run"""

    summary = {
        "total_orders": df.count(),
        "unique_customers": df.select("customer_id").distinct().count(),
        "unique_products": df.select("product_name").distinct().count(),
        "total_revenue": df.agg(sum("total_amount")).collect()[0][0],
        "date_range": f"{df.agg(min('order_date')).collect()[0][0]} to {df.agg(max('order_date')).collect()[0][0]}",
        "regions": df.select("region").distinct().count()
    }

    logger.info("\n=== ETL Summary Report ===")
    for key, value in summary.items():
        logger.info(f"{key}: {value}")
    logger.info("========================\n")

    return summary

This generates a quick summary of what got processed. In a real production system, you might email this summary or post it to Slack so the team knows the pipeline ran successfully.

One note about performance: this summary triggers multiple separate actions on the DataFrame. Each .count() and .distinct().count() scans the data independently, which isn't optimized. We could compute all these metrics in a single pass, but that's a topic for a future tutorial on performance optimization. Right now, we're prioritizing readable code that works.

Putting It All Together

We've built all the pieces. Now let's wire them up into a complete pipeline that runs from start to finish.

Remember how we set up main.py as just a skeleton? Time to fill it in. This file orchestrates everything: starting Spark, calling our ETL functions in order, handling errors, and cleaning up when we're done.

The Complete Pipeline

# main.py
from pyspark.sql import SparkSession
import logging
import sys
import traceback
from datetime import datetime
import os

# Import our ETL functions
from src.etl_pipeline import *

def setup_logging():
    """Basic logging setup"""

    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(f'logs/etl_run_{datetime.now().strftime("%Y%m%d")}.log'),
            logging.StreamHandler(sys.stdout)
        ]
    )
    return logging.getLogger(__name__)

def main():
    """Main ETL pipeline"""

    # Create necessary directories
    os.makedirs('logs', exist_ok=True)
    os.makedirs('data/processed/orders', exist_ok=True)

    logger = setup_logging()
    logger.info("Starting Grocery ETL Pipeline")

    # Track runtime
    start_time = datetime.now()

    try:
        # Initialize Spark
        spark = create_spark_session()
        logger.info("Spark session created")

        # Extract
        raw_df = extract_all_data(spark)
        logger.info(f"Extracted {raw_df.count()} raw records")

        # Transform
        clean_df = transform_orders(raw_df)
        logger.info(f"Transformed to {clean_df.count()} clean records")

        # Load
        output_path = "data/processed/orders"
        load_to_csv(spark, clean_df, output_path)

        # Sanity check
        sanity_check_data(spark, output_path)

        # Create summary
        summary = create_summary_report(clean_df)

        # Calculate runtime
        runtime = (datetime.now() - start_time).total_seconds()
        logger.info(f"Pipeline completed successfully in {runtime:.2f} seconds")

    except Exception as e:
        logger.error(f"Pipeline failed: {str(e)}")
        logger.error(traceback.format_exc())
        raise

    finally:
        spark.stop()
        logger.info("Spark session closed")

if __name__ == "__main__":
    main()

Let's walk through what's happening here.

The setup_logging() function configures logging to write to both a file and the console. The log file gets named with today's date, so you'll have a history of every pipeline run. This is invaluable when you're debugging issues that happened last Tuesday.

The main function wraps everything in a try-except-finally block, which is important for production pipelines. The try block runs your ETL logic. If anything fails, the except block logs the error with a full traceback (that traceback.format_exc() is especially helpful when Spark's Java stack traces get messy). The finally block ensures we always close the Spark session, even if something crashed.

Notice we're using relative paths like "data/processed/orders". This is fine for local development but brittle in production. Real pipelines use environment variables or configuration files for paths. We'll cover that in a future tutorial on production deployment.

Running Your Pipeline

With everything in place, you can run your pipeline with spark-submit:

# Basic run
spark-submit main.py

# With more memory for bigger datasets
spark-submit --driver-memory 4g main.py

# See what's happening with Spark's adaptive execution
spark-submit --conf spark.sql.adaptive.enabled=true main.py

The first time you run this, you'll probably hit some issues, but that's completely normal. Let's talk about the most common ones.

Common Issues You'll Hit

No ETL pipeline works perfectly on the first try. Here are the problems everyone runs into and how to fix them.

Memory Errors

If you see java.lang.OutOfMemoryError, Spark ran out of memory. Since we're using .toPandas() to write our output, this most commonly happens if your cleaned dataset is too large to fit in the driver's memory:

# Option 1: Increase driver memory
spark-submit --driver-memory 4g main.py

# Option 2: Sample the data first to verify the pipeline works
df.sample(0.1).toPandas()  # Process 10% to test

# Option 3: Switch to Spark's native CSV writer for large data
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(output_path)

For local development with reasonably-sized data, increasing driver memory usually solves the problem. For truly massive datasets, you'd switch back to Spark's distributed writers.

Schema Mismatches

If you get "cannot resolve column name" errors, your DataFrame doesn't have the columns you think it does:

# Debug by checking what columns actually exist
df.printSchema()
print(df.columns)

This usually means a transformation dropped or renamed a column, and you forgot to update the downstream code.

Slow Performance

If your pipeline is running but taking forever, don't worry about optimization yet. That's a whole separate topic. For now, just get it working. But if it's really unbearably slow, try caching DataFrames you reference multiple times:

df.cache()  # Keep frequently used data in memory

Just remember to call df.unpersist() when you're done with it to free up memory.

What You've Accomplished

You just built a complete ETL pipeline from scratch. Here's what you learned:

  • You can handle messy real-world data. CSV files with dollar signs in prices, mixed date formats, and test records mixed into production data.
  • You can structure projects professionally. Separate functions for extract, transform, and load. Logging that helps you debug failures. Error handling that keeps the pipeline running when something goes wrong.
  • You know how to run production-style jobs. Code you can deploy with spark-submit that runs on a schedule.
  • You can spot and flag data quality issues. Suspicious prices get flagged. Test data gets filtered. Summary reports tell you what processed.

This is the foundation every data engineer needs. You're ready to build ETL pipelines for real projects.

What's Next

This pipeline works, but it's not optimized. Here's what comes after you’re comfortable with the basics:

  • Performance optimization - Make this pipeline 10x faster by reducing shuffles, tuning partitions, and computing metrics efficiently.
  • Production deployment - Run this on Databricks or EMR. Handle configuration properly, monitor with metrics, and schedule with Airflow.
  • Testing and validation - Write tests for your transformations. Add data quality checks. Build confidence that changes won't break production.

But those are advanced topics. For now, you've built something real. Take a break, then find a messy CSV dataset and build an ETL pipeline for it. The best way to learn is by doing, so here's a concrete exercise to cement what you've learned:

  1. Find any CSV dataset (Kaggle has thousands)
  2. Build an ETL pipeline for it
  3. Add handling for three data quality issues you discover
  4. Output clean parquet files partitioned by a date or category field
  5. Create a summary report showing what you processed

You now have the foundation every data engineer needs. The next time you see messy data at work, you'll know exactly how to turn it into something useful.

To learn more about PySpark, check out the rest of our tutorial series:

Introduction to Apache Airflow

13 October 2025 at 23:26

Imagine this: you’re a data engineer at a growing company that thrives on data-driven decisions. Every morning, dashboards must refresh with the latest numbers, reports need updating, and machine learning models retrain with new data.

At first, you write a few scripts, one to pull data from an API, another to clean it, and a third to load it into a warehouse. You schedule them with cron or run them manually when needed. It works fine, until it doesn’t.

As data volumes grow, scripts multiply, and dependencies become increasingly tangled. Failures start cascading, jobs run out of order, schedules break, and quick fixes pile up into fragile automation. Before long, you're maintaining a system held together by patchwork scripts and luck. That’s where data orchestration comes in.

Data orchestration coordinates multiple interdependent processes, ensuring each task runs in the correct order, at the right time, and under the right conditions. It’s the invisible conductor that keeps data pipelines flowing smoothly from extraction to transformation to loading, reliably and automatically. And among the most powerful and widely adopted orchestration tools is Apache Airflow.

In this tutorial, we’ll use Airflow as our case study to explore how workflow orchestration works in practice. You’ll learn what orchestration means, why it matters, and how Airflow’s architecture, with its DAGs, tasks, operators, scheduler, and new event-driven features- brings order to complex data systems.

By the end, you’ll understand not just how Airflow orchestrates workflows, but why orchestration itself is the cornerstone of every scalable, reliable, and automated data engineering ecosystem.

What Workflow Orchestration Is and Why It Matters

Modern data pipelines involve multiple interconnected stages, data extraction, transformation, loading, and often downstream analytics or machine learning. Each stage depends on the successful completion of the previous one, forming a chain that must execute in the correct order and at the right time.

Many data engineers start by managing these workflows with scripts or cron jobs. But as systems grow, dependencies multiply, and processes become more complex, this manual approach quickly breaks down:

  • Unreliable execution: Tasks may run out of order, producing incomplete or inconsistent data.
  • Limited visibility: Failures often go unnoticed until reports or dashboards break.
  • Poor scalability: Adding new tasks or environments becomes error-prone and hard to maintain.

Workflow orchestration solves these challenges by automating, coordinating, and monitoring interdependent tasks. It ensures each step runs in the right sequence, at the right time, and under the right conditions, bringing structure, reliability, and transparency to data operations.

With orchestration, a loose collection of scripts becomes a cohesive system that can be observed, retried, and scaled, freeing engineers to focus on building insights rather than fixing failures.

Apache Airflow uses these principles and extends them with modern capabilities such as:

  • Deferrable sensors and the triggerer: Improve efficiency by freeing workers while waiting for external events like file arrivals or API responses.
  • Built-in idempotency and backfills: Safely re-run historical or failed workflows without duplication.
  • Data-aware scheduling: Enable event-driven pipelines that automatically respond when new data arrives.

While Airflow is not a real-time streaming engine, it excels at orchestrating batch and scheduled workflows with reliability, observability, and control. Trusted by organizations like Airbnb, Meta, and NASA, it remains the industry standard for automating and scaling complex data workflows.

Next, we’ll explore Airflow’s core concepts, DAGs, tasks, operators, and the scheduler, to see orchestration in action.

Core Airflow Concepts

To understand how Airflow orchestrates workflows, let’s explore its foundational components, the DAG, tasks, scheduler, executor, triggerer, and metadata database.

Together, these components coordinate how data flows from extraction to transformation, model training, and loading results in a seamless, automated pipeline.

We’ll use a simple ETL (Extract → Transform → Load) data workflow as our running example. Each day, Airflow will:

  1. Collect daily event data,
  2. Transform it into a clean format,
  3. Upload the results to Amazon S3.

This process will help us connect each concept to a real-world orchestration scenario.

i. DAG (Directed Acyclic Graph)

A DAG is the blueprint of your workflow. It defines what tasks exist and in what order they should run.

Think of it as the pipeline skeleton that connects your data extraction, transformation, and loading steps:

collect_data → transform_data → upload_results

DAGs can be triggered by time (e.g., daily schedules) or events, such as when a new dataset or asset becomes available.

from airflow.decorators import dag
from datetime import datetime

@dag(
    dag_id="daily_ml_pipeline",
    schedule="@daily",
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    pass

The @dag line is a decorator, a Python feature that lets you add behavior or metadata to functions in a clean, readable way. In this case, it turns the pipeline() function into a fully functional Airflow DAG.

The DAG defines when and in what order your workflow runs, but the individual tasks define how each step actually happens.

If you want to learn more about Python decorators, check out our lesson on Buidling a Pipeline Class to see them in action.

  • Don’t worry if the code above feels overwhelming. In the next tutorial, we’ll take a closer look at them and understand how they work in Airflow. For now, we’ll keep things simple and more conceptual.

ii. Tasks: The Actions Within the Workflow

A task is the smallest unit of work in Airflow, a single, well-defined action, like fetching data, cleaning it, or training a model.

If the DAG defines the structure, tasks define the actions that bring it to life.

Using the TaskFlow API, you can turn any Python function into a task with the @task decorator:

from airflow.decorators import task

@task
def collect_data():
    print("Collecting event data...")
    return "raw_events.csv"

@task
def transform_data(file):
    print(f"Transforming {file}")
    return "clean_data.csv"

@task
def upload_to_s3(file):
    print(f"Uploading {file} to S3...")

Tasks can be linked simply by calling them in sequence:

upload_to_s3(transform_data(collect_data()))

Airflow automatically constructs the DAG relationships, ensuring that each step runs only after its dependency completes successfully.

iii. From Operators to the TaskFlow API

In earlier Airflow versions, you defined each task using explicit operators, for example, a PythonOperator or BashOperator , to tell Airflow how to execute the logic.

Airflow simplifies this with the TaskFlow API, eliminating boilerplate while maintaining backward compatibility.

# Old style (Airflow 1 & 2)
from airflow.operators.python import PythonOperator

task_transform = PythonOperator(
    task_id="transform_data",
    python_callable=transform_data
)

With the TaskFlow API, you no longer need to create operators manually. Each @task function automatically becomes an operator-backed task.

# Airflow 3
@task
def transform_data():
    ...

Under the hood, Airflow still uses operators as the execution engine, but you no longer need to create them manually. The result is cleaner, more Pythonic workflows.

iv. Dynamic Task Mapping: Scaling the Transformation

Modern data workflows often need to process multiple files, users, or datasets in parallel.

Dynamic task mapping allows Airflow to create task instances at runtime based on data inputs, perfect for scaling transformations.

@task
def get_files():
    return ["file1.csv", "file2.csv", "file3.csv"]

@task
def transform_file(file):
    print(f"Transforming {file}")

transform_file.expand(file=get_files())

Airflow will automatically create and run a separate transform_file task for each file, enabling efficient, parallel execution.

v. Scheduler and Triggerer

The scheduler decides when tasks run, either on a fixed schedule or in response to updates in data assets.

The triggerer, on the other hand, handles event-based execution behind the scenes, using asynchronous I/O to efficiently wait for external signals like file arrivals or API responses.

from airflow.assets import Asset 
events_asset = Asset("s3://data/events.csv")

@dag(
    dag_id="event_driven_pipeline",
    schedule=[events_asset],  # Triggered automatically when this asset is updated
    start_date=datetime(2025, 10, 7),
    catchup=False,
)
def pipeline():
    ...

In this example, the scheduler monitors the asset and triggers the DAG when new data appears.

If the DAG included deferrable operators or sensors, the triggerer would take over waiting asynchronously, ensuring Airflow handles both time-based and event-driven workflows seamlessly.

vi. Executor and Workers

Once a task is ready to run, the executor assigns it to available workers, the machines or processes that actually execute your code.

For example, your ETL pipeline might look like this:

collect_data → transform_data → upload_results

Airflow decides where each of these tasks runs. It can execute everything on a single machine using the LocalExecutor, or scale horizontally across multiple nodes with the CeleryExecutor or KubernetesExecutor.

Deferrable tasks further improve efficiency by freeing up workers while waiting for long external operations like API responses or file uploads.

vii. Metadata Database and API Server: The Memory and Interface

Every action in Airflow, task success, failure, duration, or retry, is stored in the metadata database, Airflow’s internal memory.

This makes workflows reproducible, auditable, and observable.

The API server provides visibility and control:

  • View and trigger DAGs,
  • Inspect logs and task histories,
  • Track datasets and dependencies,
  • Monitor system health (scheduler, triggerer, database).

Together, they give you complete insight into orchestration, from individual task logs to system-wide performance.

Exploring the Airflow UI

Every orchestration platform needs a way to observe, manage, and interact with workflows, and in Apache Airflow, that interface is the Airflow Web UI.

The UI is served by the Airflow API Server, which exposes a rich dashboard for visualizing DAGs, checking system health, and monitoring workflow states. Even before running any tasks, it’s useful to understand the layout and purpose of this interface, since it’s where orchestration becomes visible.

Don’t worry if this section feels too conceptual; you’ll explore the Airflow UI in greater detail during the upcoming tutorial. You can also use our Setting up Apache Airflow with Docker Locally (Part I) guide if you’d like to try it right away.

The Role of the Airflow UI in Orchestration

In an orchestrated system, automation alone isn’t enough, engineers need visibility.

The UI bridges that gap. It provides an interactive window into your pipelines, showing:

  • Which workflows (DAGs) exist,
  • Their current state (active, running, or failed),
  • The status of Airflow’s internal components,
  • Historical task performance and logs.

This visibility is essential for diagnosing failures, verifying dependencies, and ensuring the orchestration system runs smoothly.

i. The Home Page Overview

The Airflow UI opens to a dashboard like the one shown below:

The Home Page Overview

At a glance, you can see:

  • Failed DAGs / Running DAGs / Active DAGs, A quick summary of the system’s operational state.
  • Health Indicators — Status checks for Airflow’s internal components:
    • MetaDatabase: Confirms the metadata database connection is healthy.
    • Scheduler: Verifies that the scheduler is running and monitoring DAGs.
    • Triggerer: Ensures event-driven workflows can be activated.
    • DAG Processor: Confirms DAG files are being parsed correctly.

These checks reflect the orchestration backbone at work, even if no DAGs have been created yet.

ii. DAG Management and Visualization

DAG Management and Visualization

In the left sidebar, the DAGs section lists all workflow definitions known to Airflow.

This doesn’t require you to run anything; it’s simply where Airflow displays every DAG it has parsed from the dags/ directory.

Each DAG entry includes:

  • The DAG name and description,
  • Schedule and next run time,
  • Last execution state
  • Controls to enable, pause, or trigger it manually.

When workflows are defined, you’ll be able to explore their structure visually through:

DAG Management and Visualization (2)

  • Graph View — showing task dependencies
  • Grid View — showing historical run outcomes

These views make orchestration transparent, every dependency, sequence, and outcome is visible at a glance.

iii. Assets and Browse

In the sidebar, the Assets and Browse sections provide tools for exploring the internal components of your orchestration environment.

  • Assets list all registered items, such as datasets, data tables, or connections that Airflow tracks or interacts with during workflow execution. It helps you see the resources your DAGs depend on. (Remember: in Airflow 3.x, “Datasets” were renamed to “Assets.”)

    Assets and Browse

  • Browse allows you to inspect historical data within Airflow, including past DAG runs, task instances, logs, and job details. This section is useful for auditing and debugging since it reveals how workflows behaved over time.

    Assets and Browse (2)

Together, these sections let you explore both data assets and orchestration history, offering transparency into what Airflow manages and how your workflows evolve.

iv. Admin

The Admin section provides the configuration tools that control Airflow’s orchestration environment.

Admin

Here, administrators can manage the system’s internal settings and integrations:

  • Variables – store global key–value pairs that DAGs can access at runtime,
  • Pools – limit the number of concurrent tasks to manage resources efficiently,
  • Providers – list the available integration packages (e.g., AWS, GCP, or Slack providers),
  • Plugins – extend Airflow’s capabilities with custom operators, sensors, or hooks,
  • Connections – define credentials for databases, APIs, and cloud services,
  • Config – view configuration values that determine how Airflow components run,

This section essentially controls how Airflow connects, scales, and extends itself, making it central to managing orchestration behavior in both local and production setups.

v. Security

The Security section governs authentication and authorization within Airflow’s web interface.

Security

It allows administrators to manage users, assign roles, and define permissions that determine who can access or modify specific parts of the system.

Within this menu:

  • Users – manage individual accounts for accessing the UI.
  • Roles – define what actions users can perform (e.g., view-only vs. admin).
  • Actions, Resources, Permissions – provide fine-grained control over what parts of Airflow a user can interact with.

Strong security settings ensure that orchestration remains safe, auditable, and compliant, particularly in shared or enterprise environments.

vii. Documentation

At the bottom of the sidebar, Airflow provides quick links under the Documentation section.

Documentation

This includes direct access to:

  • Official Documentation – the complete Airflow user and developer guide,
  • GitHub Repository – the open-source codebase for Airflow,
  • REST API Reference – detailed API endpoints for programmatic orchestration control,
  • Version Info – the currently running Airflow version,

These links make it easy for users to explore Airflow’s architecture, extend its features, or troubleshoot issues, right from within the interface.

Airflow vs Cron

Airflow vs Cron

Many data engineers start automation with cron, the classic Unix schedulersimple, reliable, and perfect for a single recurring script.

But as soon as workflows involve multiple dependent steps, data triggers, or retry, logic, cron’s simplicity turns into fragility.

Apache Airflow moves beyond time-based scheduling into workflow orchestration, managing dependencies, scaling dynamically, and responding to data-driven events, all through native Python.

i. From Scheduling to Dynamic Orchestration

Cron schedules jobs strictly by time:

# Run a data cleaning script every midnight
0 0 * * * /usr/local/bin/clean_data.sh

That works fine for one job, but it breaks down when you need to coordinate a chain like:

extract → transform → train → upload

Cron can’t ensure that step two waits for step one, or that retries occur automatically if a task fails.

In Airflow, you express this entire logic natively in Python using the TaskFlow API:

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def etl_pipeline():
    @task def extract(): ...
    @task def transform(data): ...
    @task def load(data): ...
    load(transform(extract()))

Here, tasks are functions, dependencies are inferred from function calls, and Airflow handles execution, retries, and state tracking automatically.

It’s the difference between telling the system when to run and teaching it how your workflow fits together.

ii. Visibility, Reliability, and Data Awareness

Where cron runs in the background, Airflow makes orchestration observable and intelligent.

Its Web UI and API provide transparency, showing task states, logs, dependencies, and retry attempts in real time.

Failures trigger automatic retries, and missed runs can be easily backfilled to maintain data continuity.

Airflow also introduces data-aware scheduling: workflows can now run automatically when a dataset or asset updates, not just on a clock.

from airflow.assets import Asset  
sales_data = Asset("s3://data/sales.csv")

@dag(schedule=[sales_data], start_date=datetime(2025,10,7))
def refresh_dashboard():
    ...

This makes orchestration responsive, pipelines react to new data as it arrives, keeping dashboards and downstream models always fresh.

iii. Why This Matters

Cron is a timer.

Airflow is an orchestrator, coordinating complex, event-driven, and scalable data systems.

It brings structure, visibility, and resilience to automation, ensuring that each task runs in the right order, with the right data, and for the right reason.

That’s the leap from scheduling to orchestration, and why Airflow is much more than cron with an interface.

Common Airflow Use Cases

Workflow orchestration underpins nearly every data-driven system, from nightly ETL jobs to continuous model retraining.

Because Airflow couples time-based scheduling with dataset awareness and dynamic task mapping, it adapts easily to many workloads.

Below are the most common production-grade scenarios ,all achievable through the TaskFlow API and Airflow’s modular architecture.

i. ETL / ELT Pipelines

ETL (Extract, Transform, Load) remains Airflow’s core use case.

Airflow lets you express a complete ETL pipeline declaratively, with each step defined as a Python @task.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025,10,7), catchup=False)
def daily_sales_etl():

    @task
    def extract_sales():
        print("Pulling daily sales from API…")
        return ["sales_us.csv", "sales_uk.csv"]

    @task
    def transform_file(file):
        print(f"Cleaning and aggregating {file}")
        return f"clean_{file}"

    @task
    def load_to_warehouse(files):
        print(f"Loading {len(files)} cleaned files to BigQuery")

    # Dynamic Task Mapping: one transform per file
    cleaned = transform_file.expand(file=extract_sales())
    load_to_warehouse(cleaned)

daily_sales_etl()

Because each transformation task is created dynamically at runtime, the pipeline scales automatically as data sources grow.

When paired with datasets or assets, ETL DAGs can trigger immediately when new data arrives, ensuring freshness without manual scheduling.

ii. Machine Learning Pipelines

Airflow is ideal for orchestrating end-to-end ML lifecycles, data prep, training, evaluation, and deployment.

@dag(schedule="@weekly", start_date=datetime(2025,10,7))
def ml_training_pipeline():

    @task
    def prepare_data():
        return ["us_dataset.csv", "eu_dataset.csv"]

    @task
    def train_model(dataset):
        print(f"Training model on {dataset}")
        return f"model_{dataset}.pkl"

    @task
    def evaluate_models(models):
        print(f"Evaluating {len(models)} models and pushing metrics")

    # Fan-out training jobs
    models = train_model.expand(dataset=prepare_data())
    evaluate_models(models)

ml_training_pipeline()

Dynamic Task Mapping enables fan-out parallel training across datasets, regions, or hyper-parameters, a common pattern in large-scale ML systems.

Airflow’s deferrable sensors can pause training until external data or signals are ready, conserving compute resources.

iii. Analytics and Reporting

Analytics teams rely on Airflow to refresh dashboards and reports automatically.

Airflow can combine time-based and dataset-triggered scheduling so that dashboards always use the latest processed data.

from airflow import Dataset

summary_dataset = Dataset("s3://data/summary_table.csv")

@dag(schedule=[summary_dataset], start_date=datetime(2025,10,7))
def analytics_refresh():

    @task
    def update_powerbi():
        print("Refreshing Power BI dashboard…")

    @task
    def send_report():
        print("Emailing daily analytics summary")

    update_powerbi() >> send_report()

Whenever the summary dataset updates, this DAG runs immediately; no need to wait for a timed window.

That ensures dashboards remain accurate and auditable.

iv. Data Quality and Validation

Trusting your data is as important as moving it.

Airflow lets you automate quality checks and validations before promoting data downstream.

  • Run dbt tests or Great Expectations validations as tasks.
  • Use deferrable sensors to wait for external confirmations (e.g., API signals or file availability) without blocking workers.
  • Fail fast or trigger alerts when anomalies appear.
@task
def validate_row_counts():
    print("Comparing source and target row counts…")

@task
def check_schema():
    print("Ensuring schema consistency…")

validate_row_counts() >> check_schema()

These validations can be embedded directly into the main ETL DAG, creating self-monitoring pipelines that prevent bad data from spreading.

v. Infrastructure Automation and DevOps

Beyond data, Airflow orchestrates operational workflows such as backups, migrations, or cluster scaling.

With the Task SDK and provider integrations, you can automate infrastructure the same way you orchestrate data:

@dag(schedule="@daily", start_date=datetime(2025,10,7))
def infra_maintenance():

    @task
    def backup_database():
        print("Triggering RDS snapshot…")

    @task
    def cleanup_old_files():
        print("Deleting expired objects from S3…")

    backup_database() >> cleanup_old_files()

Airflow turns these system processes into auditable, repeatable, and observable jobs, blending DevOps automation with data-engineering orchestration.

With Airflow, orchestration goes beyond timing, it becomes data-aware, event-driven, and infinitely scalable, empowering teams to automate everything from raw data ingestion to production-ready analytics.

Summary and Up Next

In this tutorial, you explored the foundations of workflow orchestration and how Apache Airflow modernizes data automation through a modular, Pythonic, and data-aware architecture. You learned how Airflow structures workflows using DAGs and the TaskFlow API, scales effortlessly through Dynamic Task Mapping, and responds intelligently to data and events using deferrable tasks and the triggerer.

You also saw how its scheduler, executor, and web UI work together to ensure observability, resilience, and scalability far beyond what traditional schedulers like cron can offer.

In the next tutorial, you’ll bring these concepts to life by installing and running Airflow with Docker, setting up a complete environment where all core services, the apiserver, scheduler, metadata database, triggerer, and workers, operate in harmony.

From there, you’ll create and monitor your first DAG using the TaskFlow API, define dependencies and schedules, and securely manage connections and secrets.

Further Reading

Explore the official Airflow documentation to deepen your understanding of new features and APIs, and prepare your Docker environment for the next tutorial.

Then, apply what you’ve learned to start orchestrating real-world data workflows efficiently, reliably, and at scale.

Hands-On NoSQL with MongoDB: From Theory to Practice

26 September 2025 at 23:33

MongoDB is the most popular NoSQL database, but if you're coming from a SQL background, it can feel like learning a completely different language. Today, we're going hands-on to see exactly how document databases solve real data engineering problems.

Here's a scenario we’ll use to see MongoDB in action: You're a data engineer at a growing e-commerce company. Your customer review system started simple: star ratings and text reviews in a SQL database. But success has brought complexity. Marketing wants verified purchase badges. The mobile team is adding photo uploads. Product management is launching video reviews. Each change requires schema migrations that take hours with millions of existing reviews.

Sound familiar? This is the schema evolution problem that drives data engineers to NoSQL. Today, you'll see exactly how MongoDB solves it. We'll build this review system from scratch, handle those evolving requirements without a single migration, and connect everything to a real analytics pipeline.

Ready to see why MongoDB powers companies from startups to Forbes? Let's get started.

Setting Up MongoDB Without the Complexity

We're going to use MongoDB Atlas, their managed cloud service. We're using Atlas because it mirrors how you'll actually deploy MongoDB in most professional environments. Alternatively, you could install MongoDB locally if you prefer. We'll use Atlas because it's quick to set up and gets us straight to learning MongoDB concepts.

1. Create your account

Go to MongoDB's Atlas page and create a free account. You won’t need to provide any credit card information — the free tier gives you 512MB of storage, which is more than enough for learning and even small production workloads. Once you're signed up, you'll create your first cluster.

Create your accout

Click "Build a Database" and select the free shared cluster option. Select any cloud provider and choose a region near you. The defaults are fine because we're learning concepts, not optimizing performance. Name your cluster something simple, like "learning-cluster," and click Create.

2. Set up the database user and network access

While MongoDB sets up your distributed database cluster (yes, even the free tier is distributed across multiple servers), you need to configure access. MongoDB requires two things: a database user and network access rules.

For the database user, click "Database Access" in the left menu and add a new user. Choose password authentication and create credentials you'll remember. For permissions, select "Read and write to any database." Note that in production you'd be more restrictive, but we're learning.

Set up the database user and network access (1)

For network access, MongoDB may have already configured this during signup through their quickstart flow. Check "Network Access" in the left menu to see your current settings. If nothing is configured yet, click "Add IP Address" and select "Allow Access from Anywhere" for now (in production, you'd restrict this to specific IP addresses for security).

Set up the database user and network access (2)

Your cluster should be ready in about three minutes. When it's done, click the "Connect" button on your cluster. You'll see several connection options.

Set up the database user and network access (3)

3. Connect to MongoDB Compass

Choose "MongoDB Compass." This is MongoDB’s GUI tool that makes exploring data visual and intuitive.

Connect to MongoDB Compass (1)

Download Compass if you don't have it, then copy your connection string. It looks like this:

mongodb+srv://myuser:<password>@learning-cluster.abc12.mongodb.net/

Replace <password> with your actual password and connect through Compass. When it connects successfully, you'll see your cluster with a few pre-populated databases like admin, local, and maybe sample_mflix (MongoDB's movie database for demos). These are system databases and sample data (we'll create our own database next).

Connect to MongoDB Compass (2)

You've just set up a distributed database system that can scale to millions of documents. The same setup process works whether you're learning or launching a startup.

Understanding Documents Through Real Data

Now let's build our review system. In MongoDB Compass, you'll see a green "Create Database" button. Click it and create a database called ecommerce_analytics with a collection called customer_reviews.

Understanding documents through real data (1)

Understanding documents through real data (2)

A quick note on terminology: In MongoDB, a database contains collections, and collections contain documents. If you're coming from SQL, think of collections like tables and documents like rows, except documents are much more flexible.

Click into your new collection. You could add data through the GUI by clicking "Add Data" → "Insert Document", but let's use the built-in shell instead to get comfortable with MongoDB's query language. At the top right of Compass, look for the shell icon (">_") and click "Open MongoDB shell.”

First, make sure we're using the right database:

use ecommerce_analytics

Now let's insert our first customer review using insertOne:

db.customer_reviews.insertOne({
  customer_id: "cust_12345",
  product_id: "wireless_headphones_pro",
  rating: 4,
  review_text: "Great sound quality, battery lasts all day. Wish they were a bit more comfortable for long sessions.",
  review_date: new Date("2024-10-15"),
  helpful_votes: 23,
  verified_purchase: true,
  purchase_date: new Date("2024-10-01")
})

MongoDB responds with confirmation that it worked:

{
  acknowledged: true,
  insertedId: ObjectId('68d31786d59c69a691408ede')
}

This is a complete review stored as a single document. In a traditional SQL database, this information might be spread across multiple tables: a reviews table, a votes table, maybe a purchases table for verification. Here, all the related data lives together in one document.

Now here's a scenario that usually breaks SQL schemas: the mobile team ships their photo feature, and instead of planning a migration, they just start storing photos:

db.customer_reviews.insertOne({
  customer_id: "cust_67890",
  product_id: "wireless_headphones_pro",
  rating: 5,
  review_text: "Perfect headphones! See the photo for size comparison.",
  review_date: new Date("2024-10-20"),
  helpful_votes: 45,
  verified_purchase: true,
  purchase_date: new Date("2024-10-10"),
  photo_url: "https://cdn.example.com/reviews/img_2024_10_20_abc123.jpg",
  device_type: "mobile_ios"
})

See the difference? We added photo_url and device_type fields, and MongoDB didn't complain about missing columns or require a migration. Each document just stores what makes sense for it. Of course, this flexibility comes with a trade-off: your application code needs to handle documents that might have different fields. When you're processing reviews, you'll need to check if a photo exists before trying to display it.

Let's add a few more reviews to build a realistic dataset (notice we’re using insertMany here):

db.customer_reviews.insertMany([
  {
    customer_id: "cust_11111",
    product_id: "laptop_stand_adjustable",
    rating: 3,
    review_text: "Does the job but feels flimsy",
    review_date: new Date("2024-10-18"),
    helpful_votes: 5,
    verified_purchase: false
  },
  {
    customer_id: "cust_22222",
    product_id: "wireless_headphones_pro",
    rating: 5,
    review_text: "Excelente producto! La calidad de sonido es increíble.",
    review_date: new Date("2024-10-22"),
    helpful_votes: 12,
    verified_purchase: true,
    purchase_date: new Date("2024-10-15"),
    video_url: "https://cdn.example.com/reviews/vid_2024_10_22_xyz789.mp4",
    video_duration_seconds: 45,
    language: "es"
  },
  {
    customer_id: "cust_33333",
    product_id: "laptop_stand_adjustable",
    rating: 5,
    review_text: "Much sturdier than expected. Height adjustment is smooth.",
    review_date: new Date("2024-10-23"),
    helpful_votes: 8,
    verified_purchase: true,
    sentiment_score: 0.92,
    sentiment_label: "very_positive"
  }
])

Take a moment to look at what we just created. Each document tells its own story: one has video metadata, another has sentiment scores, one is in Spanish. In a SQL world, you'd be juggling nullable columns or multiple tables. Here, each review just contains whatever data makes sense for it.

Querying Documents

Now that we have data, let's retrieve it. MongoDB's query language uses JSON-like syntax that feels natural once you understand the pattern.

Find matches

Finding documents by exact matches is straightforward using the find method with field names as keys:

// Find all 5-star reviews
db.customer_reviews.find({ rating: 5 })

// Find reviews for a specific product
db.customer_reviews.find({ product_id: "wireless_headphones_pro" })

You can use operators for more complex queries. MongoDB has operators like $gte (greater than or equal), $lt (less than), $ne (not equal), and many others:

// Find highly-rated reviews (4 stars or higher)
db.customer_reviews.find({ rating: { $gte: 4 } })

// Find recent verified purchase reviews
db.customer_reviews.find({
  verified_purchase: true,
  review_date: { $gte: new Date("2024-10-15") }
})

Here's something that would be painful in SQL: you can query for fields that might not exist in all documents:

// Find all reviews with videos
db.customer_reviews.find({ video_url: { $exists: true } })

// Find reviews with sentiment analysis
db.customer_reviews.find({ sentiment_score: { $exists: true } })

These queries don't fail when they encounter documents without these fields. Instead, they simply return the documents that match.

A quick note on performance

As your collection grows beyond a few thousand documents, you'll want to create indexes on fields you query frequently. Think of indexes like the index in a book — instead of flipping through every page to find "MongoDB," you can jump straight to the right section.

Let's create an index on product_id since we've been querying it:

db.customer_reviews.createIndex({ product_id: 1 })

The 1 means ascending order (you can use -1 for descending). MongoDB will now keep a sorted reference to all product_id values, making our product queries lightning fast even with millions of reviews. You don't need to change your queries at all; MongoDB automatically uses the index when it helps.

Update existing documents

Updating documents using updateOne is equally flexible. Let's say the customer service team starts adding sentiment scores to reviews:

db.customer_reviews.updateOne(
  { customer_id: "cust_12345" },
  {
    $set: {
      sentiment_score: 0.72,
      sentiment_label: "positive"
    }
  }
)

We used the $set operator, which tells MongoDB which fields to add or modify. In the output MongoDB tells us exactly what happened:

{
    acknowledged: true,
    insertedId: null,
    matchedCount: 1,
    modifiedCount: 1,
    upsertedCount: 0
}

We just added new fields to one document. The others? Completely untouched, with no migration required.

When someone finds a review helpful, we can increment the vote count using $inc:

db.customer_reviews.updateOne(
  { customer_id: "cust_67890" },
  { $inc: { helpful_votes: 1 } }
)

This operation is atomic, meaning it's safe even with multiple users voting simultaneously.

Analytics Without Leaving MongoDB

MongoDB's aggregate method lets you run analytics directly on your operational data using what's called an aggregation pipeline, which is a series of data transformations.

Average rating and review count

Let's answer a real business question: What's the average rating and review count for each product?

db.customer_reviews.aggregate([
  {
    $group: {
      _id: "$product_id",
      avg_rating: { $avg: "$rating" },
      review_count: { $sum: 1 },
      total_helpful_votes: { $sum: "$helpful_votes" }
    }
  },
  {
    $sort: { avg_rating: -1 }
  }
])
{
  _id: 'wireless_headphones_pro',
  avg_rating: 4.666666666666667,
  review_count: 3,
  total_helpful_votes: 81
}
{
  _id: 'laptop_stand_adjustable',
  avg_rating: 4,
  review_count: 2,
  total_helpful_votes: 13
}

Here's how the pipeline works: first, we group ($group) by product_id and calculate metrics for each group using operators like $avg and $sum. Then we sort ($sort) by average rating, using -1 to sort in descending order. The result gives us exactly what product managers need to understand product performance.

Trends over time

Let's try something more complex by analyzing review trends over time:

db.customer_reviews.aggregate([
  {
    $group: {
      _id: {
        month: { $month: "$review_date" },
        year: { $year: "$review_date" }
      },
      review_count: { $sum: 1 },
      avg_rating: { $avg: "$rating" },
      verified_percentage: {
        $avg: { $cond: ["$verified_purchase", 1, 0] }
      }
    }
  },
  {
    $sort: { "_id.year": 1, "_id.month": 1 }
  }
])
{
  _id: {
    month: 10,
    year: 2024
  },
  review_count: 5,
  avg_rating: 4.4,
  verified_percentage: 0.8
}

This query groups reviews by month using MongoDB's date operators like $month and $year, calculates the average rating, and computes what percentage were verified purchases. We used $cond to convert true/false values to 1/0, then averaged them to get the verification percentage. Marketing can use this to track review quality over time.

These queries answer real business questions directly on your operational data. Now let's see how to integrate this with Python for complete data pipelines.

Connecting MongoDB to Your Data Pipeline

Real data engineering connects systems. MongoDB rarely works in isolation because it's part of a larger data ecosystem. Let's connect it to Python, where you can integrate it with the rest of your pipeline.

Exporting data from MongoDB

You can export data from Compass in a few ways: export entire collections from the Documents tab, or build aggregation pipelines in the Aggregation tab and export those results. Choose JSON or CSV depending on your downstream needs.

For more flexibility with specific queries, let's use Python. First, install PyMongo, the official MongoDB driver:

pip install pymongo pandas

Here's a practical example that extracts data from MongoDB for analysis:

from pymongo import MongoClient
import pandas as pd

# Connect to MongoDB Atlas
# In production, store this as an environment variable for security
connection_string = "mongodb+srv://username:[email protected]/"
client = MongoClient(connection_string)
db = client.ecommerce_analytics

# Query high-rated reviews
high_rated_reviews = list(
    db.customer_reviews.find({
        "rating": {"$gte": 4}
    })
)

# Convert to DataFrame for analysis
df = pd.DataFrame(high_rated_reviews)

# Clean up MongoDB's internal _id field
if '_id' in df.columns:
    df = df.drop('_id', axis=1)

# Handle optional fields gracefully (remember our schema flexibility?)
df['has_photo'] = df['photo_url'].notna()
df['has_video'] = df['video_url'].notna()

# Analyze product performance
product_metrics = df.groupby('product_id').agg({
    'rating': 'mean',
    'helpful_votes': 'sum',
    'customer_id': 'count'
}).rename(columns={'customer_id': 'review_count'})

print("Product Performance (Last 30 Days):")
print(product_metrics)

# Export for downstream processing
df.to_csv('recent_positive_reviews.csv', index=False)
print(f"\nExported {len(df)} reviews for downstream processing")

This is a common pattern in data engineering: MongoDB stores operational data, Python extracts and transforms it, and the results feed into SQL databases, data warehouses, or BI tools.

Where MongoDB fits in larger data architectures

This pattern, using different databases for different purposes, is called polyglot persistence. Here's how it typically works in production:

  • MongoDB handles operational workloads: Flexible schemas, high write volumes, real-time applications
  • SQL databases handle analytical workloads: Complex queries, reporting, business intelligence
  • Python bridges the gap: Extracting, transforming, and loading data between systems

You might use MongoDB to capture raw user events in real-time, then periodically extract and transform that data into a PostgreSQL data warehouse where business analysts can run complex reports. Each database does what it does best.

The key is understanding that modern data pipelines aren't about choosing MongoDB OR SQL… they're about using both strategically. MongoDB excels at evolving schemas and horizontal scaling. SQL databases excel at complex analytics and mature tooling. Real data engineering combines them thoughtfully.

Review and Next Steps

You've covered significant ground today. You can now set up MongoDB, handle schema changes without migrations, write queries and aggregation pipelines, and connect everything to Python for broader data workflows.

This isn't just theoretical knowledge. You've worked through the same challenges that come up in real projects: evolving data structures, flexible document storage, and integrating NoSQL with analytical tools.

Your next steps depend on what you're trying to build:

If you want deeper MongoDB knowledge:

  • Learn about indexing strategies for query optimization
  • Explore change streams for real-time data processing
  • Try MongoDB's time-series collections for IoT data
  • Understand sharding for horizontal scaling
  • Practice thoughtful document design (flexibility doesn't mean "dump everything in one document")
  • Learn MongoDB's consistency trade-offs (it's not just "SQL but schemaless")

If you want to explore the broader NoSQL ecosystem:

  • Try Redis for caching. It's simpler than MongoDB and solves different problems
  • Experiment with Elasticsearch for full-text search across your reviews
  • Look at Cassandra for true time-series data at massive scale
  • Consider Neo4j if you need to analyze relationships between customers

If you want to build production systems:

  • Create a complete ETL pipeline: MongoDB → Airflow → PostgreSQL
  • Set up monitoring with MongoDB Atlas metrics
  • Implement proper error handling and retry logic
  • Learn about consistency levels and their trade-offs

The concepts you've learned apply beyond MongoDB. Document flexibility appears in DynamoDB and CouchDB. Aggregation pipelines exist in Elasticsearch. Using different databases for different parts of your pipeline is standard practice in modern systems.

You now understand when to choose NoSQL versus SQL, matching tools to problems. MongoDB handles flexible schemas and horizontal scaling well, whereas SQL databases excel at complex queries and transactions. Most real systems use both.

The next time you encounter rapidly changing requirements or need to scale beyond a single server, you'll recognize these as problems that NoSQL databases were designed to solve.

Project Tutorial: Build a Web Interface for Your Chatbot with Streamlit (Step-by-Step)

25 September 2025 at 00:02

You've built a chatbot in Python, but it only runs in your terminal. What if you could give it a sleek web interface that anyone can use? What if you could deploy it online for friends, potential employers, or clients to interact with?

In this hands-on tutorial, we'll transform a command-line chatbot into a professional web application using Streamlit. You'll learn to create an interactive interface with customizable personalities, real-time settings controls, and deploy it live on the internet—all without writing a single line of HTML, CSS, or JavaScript.

By the end of this tutorial, you'll have a deployed web app that showcases your AI development skills and demonstrates your ability to build user-facing applications.

Why Build a Web Interface for Your Chatbot?

A command-line chatbot is impressive to developers, but a web interface speaks to everyone. Portfolio reviewers, potential clients, and non-technical users can immediately see and interact with your work. More importantly, building web interfaces for AI applications is a sought-after skill as businesses increasingly want to deploy AI tools that their teams can actually use.

Streamlit makes this transition seamless. Instead of learning complex web frameworks, you'll use Python syntax you already know to create professional-looking applications in minutes, not days.

What You'll Build

  • Interactive web chatbot with real-time personality switching
  • Customizable controls for AI parameters (temperature, token limits)
  • Professional chat interface with user/assistant message distinction
  • Reset functionality and conversation management
  • Live deployment accessible from any web browser
  • Foundation for more advanced AI applications

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

1. Review the Project

Explore the goals and structure of this project: Start the project here

2. Complete Your Chatbot Foundation

Essential Prerequisite: If you haven't already, complete the previous chatbot project to build your core logic. You'll need a working Python chatbot with conversation memory and token management before starting this tutorial.

3. Set Up Your Development Environment

Required Tools:

  • Python IDE (VS Code or PyCharm recommended)
  • OpenAI API key (or Together AI for a free alternative)
  • GitHub account for deployment

We'll be working with standard Python files (.py format) instead of Jupyter notebooks, so make sure you're comfortable coding in your chosen IDE.

4. Install and Test Streamlit

Install the required packages:

pip install streamlit openai tiktoken

Test your installation with a simple demo:

import streamlit as st
st.write("Hello Streamlit!")

Save this as test.py and run the following in the command line:

streamlit run test.py

If a browser window opens with the message "Hello Streamlit!", you're ready to proceed.

5. Verify Your API Access

Test your OpenAI API key works:

import os
from openai import OpenAI

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

# Simple test call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello!"}],
    max_tokens=10
)

print(response.choices[0].message.content)

6. Access the Complete Solution

View and download the solution files: Solution Repository

What you'll find:

  • starter_code.py - The initial chatbot code we'll start with
  • final.py - Complete Streamlit application
  • requirements.txt - All necessary dependencies
  • Deployment configuration files

Starting Point: Your Chatbot Foundation

If you don't have a chatbot already, create a file called starter_code.py with this foundation:

import os
from openai import OpenAI
import tiktoken

# Configuration
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
TOKEN_BUDGET = 1000
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

# Token management functions (collapsed for clarity)
def get_encoding(model):
    try:
        return tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Tokenizer for model '{model}' not found. Falling back to 'cl100k_base'.")
        return tiktoken.get_encoding("cl100k_base")

ENCODING = get_encoding(MODEL)

def count_tokens(text):
    return len(ENCODING.encode(text))

def total_tokens_used(messages):
    try:
        return sum(count_tokens(msg["content"]) for msg in messages)
    except Exception as e:
        print(f"[token count error]: {e}")
        return 0

def enforce_token_budget(messages, budget=TOKEN_BUDGET):
    try:
        while total_tokens_used(messages) > budget:
            if len(messages) <= 2:
                break
            messages.pop(1)
    except Exception as e:
        print(f"[token budget error]: {e}")

# Core chat function
def chat(user_input):
    messages.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})

    enforce_token_budget(messages)
    return reply

This gives us a working chatbot with conversation memory and cost controls. Now let's transform it into a web app.

Part 1: Your First Streamlit Interface

Create a new file called app.py and copy your starter code into it. Now we'll add the web interface layer.

Add the Streamlit import at the top:

import streamlit as st

At the bottom of your file, add your first Streamlit elements:

### Streamlit Interface ###
st.title("Sassy Chatbot")

Test your app by running this in your terminal:

streamlit run app.py

Your default browser should open showing your web app with the title "Sassy Chatbot." Notice the auto-reload feature; when you save changes, Streamlit prompts you to rerun the app.

Learning Insight: Streamlit uses "magic" rendering. You don't need to explicitly display elements. Simply calling st.title() automatically renders the title in your web interface.

Part 2: Building the Control Panel

Real applications need user controls. Let's add a sidebar with personality options and parameter controls.

Building the Control Panel

Add this after your title:

# Sidebar controls
st.sidebar.header("Options")
st.sidebar.write("This is a demo of a sassy chatbot using OpenAI's API.")

# Temperature and token controls
max_tokens = st.sidebar.slider("Max Tokens", 1, 250, 100)
temperature = st.sidebar.slider("Temperature", 0.0, 1.0, 0.7)

# Personality selection
system_message_type = st.sidebar.selectbox("System Message",
    ("Sassy Assistant", "Angry Assistant", "Custom"))

Save and watch your sidebar populate with interactive controls. These sliders automatically store their values in the respective variables when users interact with them.

Adding Dynamic Personality System

Now let's make the personality selection actually work:

# Dynamic system prompt based on selection
if system_message_type == "Sassy Assistant":
    SYSTEM_PROMPT = "You are a sassy assistant that is fed up with answering questions."
elif system_message_type == "Angry Assistant":
    SYSTEM_PROMPT = "You are an angry assistant that likes yelling in all caps."
elif system_message_type == "Custom":
    SYSTEM_PROMPT = st.sidebar.text_area("Custom System Message",
        "Enter your custom system message here.")
else:
    SYSTEM_PROMPT = "You are a helpful assistant."

The custom option creates a text area where users can write their own personality instructions. Try switching between personalities and notice how the interface adapts.

Part 3: Understanding Session State

Here's where Streamlit gets tricky. Every time a user interacts with your app, Streamlit reruns the entire script from top to bottom. This would normally reset your chat history every time, which is not what we want for a conversation!

Session state solves this by persisting data across app reruns:

# Initialize session state for conversation memory
if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]

This creates a persistent messages list that survives app reruns. Now we need to modify our chat function to use session state:

def chat(user_input, temperature=TEMPERATURE, max_tokens=MAX_TOKENS):
    # Get messages from session state
    messages = st.session_state.messages
    messages.append({"role": "user", "content": user_input})

    enforce_token_budget(messages)

    # Add loading spinner for better UX
    with st.spinner("Thinking..."):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

Learning Insight: Session state is like a dictionary that persists between app reruns. Think of it as your app's memory system.

Part 4: Interactive Buttons and Controls

Interactive Buttons and Controls

Let's add buttons to make the interface more user-friendly:

# Control buttons
if st.sidebar.button("Apply New System Message"):
    st.session_state.messages[0] = {"role": "system", "content": SYSTEM_PROMPT}
    st.success("System message updated.")

if st.sidebar.button("Reset Conversation"):
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    st.success("Conversation reset.")

These buttons provide immediate feedback with success messages, creating a more polished user experience.

Part 5: The Chat Interface

The Chat Interface

Now for the main event—the actual chat interface. Add this code:

# Chat input using walrus operator
if prompt := st.chat_input("What is up?"):
    reply = chat(prompt, temperature=temperature, max_tokens=max_tokens)

# Display chat history
for message in st.session_state.messages[1:]:  # Skip system message
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

The chat_input widget creates a text box at the bottom of your app. The walrus operator (:=) assigns the user input to prompt and checks if it exists in one line.

Visual Enhancement: Streamlit automatically adds user and assistant icons to chat messages when you use the proper role names ("user" and "assistant").

Part 6: Testing Your Complete App

Save your file and test the complete interface:

  1. Personality Test: Switch between Sassy and Angry assistants, apply the new system message, then chat to see the difference
  2. Memory Test: Have a conversation, then reference something you said earlier
  3. Parameter Test: Drag the max tokens slider to 1 and see how responses get cut off
  4. Reset Test: Use the reset button to clear conversation history

Your complete working app should look something like this:

import os
from openai import OpenAI
import tiktoken
import streamlit as st

# API and model configuration
api_key = st.secrets.get("OPENAI_API_KEY") or os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
TOKEN_BUDGET = 1000
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

# [Token management functions here - same as starter code]

def chat(user_input, temperature=TEMPERATURE, max_tokens=MAX_TOKENS):
    messages = st.session_state.messages
    messages.append({"role": "user", "content": user_input})
    enforce_token_budget(messages)

    with st.spinner("Thinking..."):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

### Streamlit Interface ###
st.title("Sassy Chatbot")
st.sidebar.header("Options")
st.sidebar.write("This is a demo of a sassy chatbot using OpenAI's API.")

max_tokens = st.sidebar.slider("Max Tokens", 1, 250, 100)
temperature = st.sidebar.slider("Temperature", 0.0, 1.0, 0.7)
system_message_type = st.sidebar.selectbox("System Message",
    ("Sassy Assistant", "Angry Assistant", "Custom"))

if system_message_type == "Sassy Assistant":
    SYSTEM_PROMPT = "You are a sassy assistant that is fed up with answering questions."
elif system_message_type == "Angry Assistant":
    SYSTEM_PROMPT = "You are an angry assistant that likes yelling in all caps."
elif system_message_type == "Custom":
    SYSTEM_PROMPT = st.sidebar.text_area("Custom System Message",
        "Enter your custom system message here.")

if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]

if st.sidebar.button("Apply New System Message"):
    st.session_state.messages[0] = {"role": "system", "content": SYSTEM_PROMPT}
    st.success("System message updated.")

if st.sidebar.button("Reset Conversation"):
    st.session_state.messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    st.success("Conversation reset.")

if prompt := st.chat_input("What is up?"):
    reply = chat(prompt, temperature=temperature, max_tokens=max_tokens)

for message in st.session_state.messages[1:]:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

Part 7: Deploying to the Internet

Running locally is great for development, but deployment makes your project shareable and accessible to others. Streamlit Community Cloud offers free hosting directly from your GitHub repository.

Prepare for Deployment

First, create the required files in your project folder:

requirements.txt:

openai
streamlit
tiktoken

.gitignore:

.streamlit/

Note that if you’ve stored your API key in a .env file you should add this to .gitignore as well.

Secrets Management: Create a .streamlit/secrets.toml file locally:

OPENAI_API_KEY = "your-api-key-here"

Important: Add .streamlit/ to your .gitignore so you don't accidentally commit your API key to GitHub.

GitHub Setup

  1. Create a new GitHub repository
  2. Push your code (the .gitignore will protect your secrets)
  3. Your repository should contain: app.py, requirements.txt, and .gitignore

Deploy to Streamlit Cloud

  1. Go to share.streamlit.io

  2. Connect your GitHub account

  3. Select your repository and main branch

  4. Choose your app file (app.py)

  5. In Advanced settings, add your API key as a secret:

    OPENAI_API_KEY = "your-api-key-here"
  6. Click "Deploy"

Within minutes, your app will be live at a public URL that you can share with anyone!

Security Note: The secrets you add in Streamlit Cloud are encrypted and secure. Never put API keys directly in your code files.

Understanding Key Concepts

Session State Deep Dive

Session state is Streamlit's memory system. Without it, every user interaction would reset your app completely. Think of it as a persistent dictionary that survives app reruns:

# Initialize once
if "my_data" not in st.session_state:
    st.session_state.my_data = []

# Use throughout your app
st.session_state.my_data.append("new item")

The Streamlit Execution Model

Streamlit reruns your entire script on every interaction. This "reactive" model means:

  • Your app always shows the current state
  • You need session state for persistence
  • Expensive operations should be cached or minimized

Widget State Management

Widgets (sliders, inputs, buttons) automatically manage their state:

  • Slider values are always current
  • Button presses trigger reruns
  • Form inputs update in real-time

Troubleshooting Common Issues

  • "No module named 'streamlit'": Install Streamlit with pip install streamlit
  • API key errors: Verify your environment variables or Streamlit secrets are set correctly
  • App won't reload: Check for Python syntax errors in your terminal output
  • Session state not working: Ensure you're checking if "key" not in st.session_state: before initializing
  • Deployment fails: Verify your requirements.txt includes all necessary packages

Extending Your Chatbot App

Immediate Enhancements

  • File Upload: Let users upload documents for the chatbot to reference
  • Export Conversations: Add a download button for chat history
  • Usage Analytics: Track token usage and costs
  • Multiple Chat Sessions: Support multiple conversation threads

Advanced Features

  • User Authentication: Require login for personalized experiences
  • Database Integration: Store conversations permanently
  • Voice Interface: Add speech-to-text and text-to-speech
  • Multi-Model Support: Let users choose different AI models

Business Applications

  • Customer Service Bot: Deploy for client support with company-specific knowledge
  • Interview Prep Tool: Create domain-specific interview practice bots
  • Educational Assistant: Build tutoring bots for specific subjects
  • Content Generator: Develop specialized writing assistants

Key Takeaways

Building web interfaces for AI applications demonstrates that you can bridge the gap between technical capability and user accessibility. Through this tutorial, you've learned:

Technical Skills:

  • Streamlit fundamentals and reactive programming model
  • Session state management for persistent applications
  • Web deployment from development to production
  • Integration patterns for AI APIs in web contexts

Professional Skills:

  • Creating user-friendly interfaces for technical functionality
  • Managing secrets and security in deployed applications
  • Building portfolio-worthy projects that demonstrate real-world skills
  • Understanding the path from prototype to production application

Strategic Understanding:

  • Why web interfaces matter for AI applications
  • How to make technical projects accessible to non-technical users
  • The importance of user experience in AI application adoption

You now have a deployed chatbot application that showcases multiple in-demand skills: AI integration, web development, user interface design, and cloud deployment. This foundation prepares you to build more sophisticated applications and demonstrates your ability to create complete, user-facing AI solutions.

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

Introduction to NoSQL: What It Is and Why You Need It

19 September 2025 at 22:25

Picture yourself as a data engineer at a fast-growing social media company. Every second, millions of users are posting updates, uploading photos, liking content, and sending messages. Your job is to capture all of this activity—billions of events per day—store it somewhere useful, and transform it into insights that the business can actually use.

You set up a traditional SQL database, carefully designing tables for posts, likes, and comments. Everything works great... for about a week. Then the product team launches "reactions," adding hearts and laughs to "likes". Next week, story views. The week after, live video metrics. Each change means altering your database schema, and with billions of rows, these migrations take hours while your server struggles with the load.

This scenario isn't hypothetical. It's exactly what companies like Facebook, Amazon, and Google faced in the early 2000s. The solution they developed became what we now call NoSQL.

These are exactly the problems NoSQL databases solve, and understanding them will change how you think about data storage. By the end of this tutorial, you'll be able to:

  • Understand what NoSQL databases are and how they differ from traditional SQL databases
  • Identify the four main types of NoSQL databases—document, key-value, column-family, and graph—and when to use each one
  • Make informed decisions about when to choose NoSQL vs SQL for your data engineering projects
  • See real-world examples from companies like Netflix and Uber showing how these databases work together in production
  • Get hands-on experience with MongoDB to cement these concepts with practical skills

Let's get started!

What NoSQL Really Means (And Why It Exists)

Let's clear up a common confusion right away: NoSQL originally stood for "No SQL" when developers were frustrated with the limitations of relational databases. But as these new databases matured, the community realized that throwing away SQL entirely was like throwing away a perfectly good hammer just because you also needed a screwdriver. Today, NoSQL means "Not Only SQL." These databases complement traditional SQL databases rather than replacing them.

To understand why NoSQL emerged, we need to understand what problem it was solving. Traditional SQL databases were designed when storage was expensive, data was small, and schemas were stable. They excel at maintaining consistency but scale vertically—when you need more power, you buy a bigger server.

By the 2000s, this broke down. Companies faced massive, messy, constantly changing data. Buying bigger servers wasn't sustainable, and rigid table structures couldn't handle the variety.

NoSQL databases were designed from the ground up for this new reality. Instead of scaling up by buying bigger machines, they scale out by adding more commodity servers. Instead of requiring you to define your data structure upfront, they let you store data first and figure out its structure later. And instead of keeping all data on one machine for consistency, they spread it across many machines for resilience and performance.

Understanding NoSQL Through a Data Pipeline Lens

As a data engineer, you'll rarely use just one database. Instead, you'll build pipelines where different databases serve different purposes. Think of it like cooking a complex meal: you don't use the same pot for everything. You use a stockpot for soup, a skillet for searing, and a baking dish for the oven. Each tool has its purpose.

Let's walk through a typical data pipeline to see where NoSQL fits.

The Ingestion Layer

At the very beginning of your pipeline, you have raw data landing from everywhere. This is often messy. When you're pulling data from mobile apps, web services, IoT devices, and third-party APIs, each source has its own format and quirks. Worse, these formats change without warning.

A document database like MongoDB thrives here because it doesn't force you to know the exact structure beforehand. If the mobile team adds a new field to their events tomorrow, MongoDB will simply store it. No schema migration, no downtime.

The Processing Layer

Moving down the pipeline, you're transforming, aggregating, and enriching your data. Some happens in real-time (recommendation feeds) and some in batches (daily metrics).

For lightning-fast lookups, Redis keeps frequently accessed data in memory. User preferences load instantly rather than waiting for complex database queries.

The Serving Layer

Finally, there's where cleaned, processed data becomes available for analysis and applications. This is often where SQL databases shine with their powerful query capabilities and mature tooling. But even here, NoSQL plays a role. Time-series data might live in Cassandra where it can be queried efficiently by time range. Graph relationships might be stored in Neo4j for complex network analysis.

The key insight is that modern data architectures are polyglot. They use multiple database technologies, each chosen for its strengths. NoSQL databases don't replace SQL; they handle the workloads that SQL struggles with.

The Four Flavors of NoSQL (And When to Use Each)

NoSQL isn't a single technology but rather four distinct database types, each optimized for different patterns. Understanding these differences is essential because choosing the wrong type can lead to performance headaches, operational complexity, and frustrated developers.

Document Databases: The Flexible Containers

Document databases store data as documents, typically in JSON format. If you've worked with JSON before, you already understand the basic concept. Each document is self-contained, with its own structure that can include nested objects and arrays.

Imagine you're building a product catalog for an e-commerce site:

  • A shirt has size and color attributes
  • A laptop has RAM and processor speed
  • A digital download has file format and license type

In a SQL database, you'd need separate tables for each product type or a complex schema with many nullable columns. In MongoDB, each product is just a document with whatever fields make sense for that product.

Best for:

  • Content management systems
  • Event logging and analytics
  • Mobile app backends
  • Any application with evolving data structures

This flexibility makes document databases perfect for situations where your data structure evolves frequently or varies between records. But remember: flexibility doesn't mean chaos. You still want consistency within similar documents, just not the rigid structure SQL demands.

Key-Value Stores: The Speed Demons

Key-value stores are the simplest NoSQL type: just keys mapped to values. Think of them like a massive Python dictionary or JavaScript object that persists across server restarts. This simplicity is their superpower. Without complex queries or relationships to worry about, key-value stores can be blazingly fast.

Redis, the most popular key-value store, keeps data in memory for extremely fast access times, often under a millisecond for simple lookups. Consider these real-world uses:

  • Netflix showing you personalized recommendations
  • Uber matching you with a nearby driver
  • Gaming leaderboards updating in real-time
  • Shopping carts persisting across sessions

The pattern here is clear: when you need simple lookups at massive scale and incredible speed, key-value stores deliver.

The trade-off: You can only look up data by its key. No querying by other attributes, no relationships, no aggregations. You wouldn't build your entire application on Redis, but for the right use case, nothing else comes close to its performance.

Column-Family Databases: The Time-Series Champions

Column-family databases organize data differently than you might expect. Instead of rows with fixed columns like SQL, they store data in column families — groups of related columns that can vary between rows. This might sound confusing, so let's use a concrete example.

Imagine you're storing temperature readings from thousands of IoT sensors:

  • Each sensor reports at different intervals (some every second, others every minute)
  • Some sensors report temperature only
  • Others also report humidity, pressure, or both
  • You need to query millions of readings by time range

In a column-family database like Cassandra, each sensor becomes a row with different column families. You might have a "measurements" family containing temperature, humidity, and pressure columns, and a "metadata" family with location and sensor_type. This structure makes it extremely efficient to query all measurements for a specific sensor and time range, or to retrieve just the metadata without loading the measurement data.

Perfect for:

  • Application logs and metrics
  • IoT sensor data
  • Financial market data
  • Any append-heavy, time-series workload

This design makes column-family databases exceptional at handling write-heavy workloads and scenarios where you're constantly appending new data.

Graph Databases: The Relationship Experts

Graph databases take a completely different approach. Instead of tables or documents, they model data as nodes (entities) and edges (relationships). This might seem niche, but when relationships are central to your queries, graph databases turn complex problems into simple ones.

Consider LinkedIn's "How you're connected" feature. To find the path between you and another user using SQL would require recursive joins that become exponentially complex as the network grows.
In a graph database like Neo4j, this is a basic traversal operation that can handle large networks efficiently. While performance depends on query complexity and network structure, graph databases excel at these relationship-heavy problems that would be nearly impossible to solve efficiently in SQL.

Graph databases excel at:

  • Recommendation engines ("customers who bought this also bought...")
  • Fraud detection (finding connected suspicious accounts)
  • Social network analysis (identifying influencers)
  • Knowledge graphs (mapping relationships between concepts)
  • Supply chain optimization (tracing dependencies)

They're specialized tools, but for the right problems, they're invaluable. If your core challenge involves understanding how things connect and influence each other, graph databases provide elegant solutions that would be nightmarish in other systems.

Making the NoSQL vs SQL Decision

One of the most important skills you'll develop as a data engineer is knowing when to use NoSQL versus SQL. The key is matching each database type to the problems it solves best.

When NoSQL Makes Sense

If your data structure changes frequently (like those social media events we talked about earlier), the flexibility of document databases can save you from constant schema migrations. When you're dealing with massive scale, NoSQL's ability to distribute data across many servers becomes critical. Traditional SQL databases can scale to impressive sizes, but when you're talking about petabytes of data or millions of requests per second, NoSQL's horizontal scaling model is often more cost-effective.

NoSQL also shines when your access patterns are simple:

  • Looking up records by ID
  • Retrieving entire documents
  • Querying time-series data by range
  • Caching frequently accessed data

These databases achieve incredible performance by optimizing for specific patterns rather than trying to be everything to everyone.

When SQL Still Rules

SQL databases remain unbeatable for complex queries. The ability to join multiple tables, perform aggregations, and write sophisticated analytical queries is where SQL's decades of development really show. If your application needs to answer questions like "What's the average order value for customers who bought product A but not product B in the last quarter?", SQL makes this straightforward, while NoSQL might require multiple queries and application-level processing.

Another SQL strength is keeping your data accurate and reliable. When you're dealing with financial transactions, inventory management, or any scenario where consistency is non-negotiable, traditional SQL databases ensure your data stays correct. Many NoSQL databases offer "eventual consistency." This means your data will be consistent across all nodes eventually, but there might be brief moments where different nodes show different values. For many applications this is fine, but for others it's a deal-breaker.

The choice between SQL and NoSQL often comes down to your specific needs rather than one being universally better. SQL databases have had decades to mature their tooling and build deep integrations with business intelligence platforms. But NoSQL databases have caught up quickly, especially with the rise of managed cloud services that handle much of the operational complexity.

Common Pitfalls and How to Avoid Them

As you start working with NoSQL, there are some common mistakes that almost everyone makes. Let’s help you avoid them.

The "Schemaless" Trap

The biggest misconception is that "schemaless" means "no design required." Just because MongoDB doesn't enforce a schema doesn't mean you shouldn't have one. In fact, NoSQL data modeling often requires more upfront thought than SQL. You need to understand your access patterns and design your data structure around them.

In document databases, you might denormalize data that would be in separate SQL tables. In key-value stores, your key design determines your query capabilities. It's still careful design work, just focused on access patterns rather than normalization rules.

Underestimating Operations

Many newcomers underestimate the operational complexity of NoSQL. While managed services have improved this significantly, running your own Cassandra cluster or MongoDB replica set requires understanding concepts like:

  • Consistency levels and their trade-offs
  • Replication strategies and failure handling
  • Partition tolerance and network splits
  • Backup and recovery procedures
  • Performance tuning and monitoring

Even with managed services, you need to understand these concepts to use the databases effectively.

The Missing Joins Problem

In SQL, you can easily combine data from multiple tables with joins. Most NoSQL databases don't support this, which surprises people coming from SQL. So how do you handle relationships between your data? You have three options:

  1. Denormalize your data: Store redundant copies where needed
  2. Application-level joins: Multiple queries assembled in your code
  3. Choose a different database: Sometimes SQL is simply the right choice

The specifics of these approaches go beyond what we'll cover here, but being aware that joins don't exist in NoSQL will save you from some painful surprises down the road.

Getting Started: Your Path Forward

So where do you begin with all of this? The variety of NoSQL databases can feel overwhelming, but you don't need to learn everything at once.

Start with a Real Problem

Don't choose a database and then look for problems to solve. Instead, identify a concrete use case:

  • Have JSON data with varying structure? Try MongoDB
  • Need to cache data for faster access? Experiment with Redis
  • Working with time-series data? Set up a Cassandra instance
  • Analyzing relationships? Consider Neo4j

Having a concrete use case makes learning much more effective than abstract tutorials.

Focus on One Type First

Pick one NoSQL type and really understand it before moving to others. Document databases like MongoDB are often the most approachable if you're coming from SQL. The document model is intuitive, and the query language is relatively familiar.

Use Managed Services

While you're learning, use managed services like MongoDB Atlas, Amazon DynamoDB, or Redis Cloud instead of running your own clusters. Setting up distributed databases is educational, but it's a distraction when you're trying to understand core concepts.

Remember the Bigger Picture

Most importantly, remember that NoSQL is a tool in your toolkit, not a replacement for everything else. The most successful data engineers understand both SQL and NoSQL, knowing when to use each and how to make them work together.

Next Steps

You've covered a lot of ground today. You now:

  • Understand what NoSQL databases are and why they exist
  • Know the four main types and their strengths
  • Can identify when to choose NoSQL vs SQL for different use cases
  • Recognize how companies use multiple databases together in real systems
  • Understand the common pitfalls to avoid as you start working with NoSQL

With this conceptual foundation, you're ready to get hands-on and see how these databases actually work. You understand the big picture of where NoSQL fits in modern data engineering, but there's nothing like working with real data to make it stick.

The best way to build on what you've learned is to pick one database and start experimenting:

  • Get hands-on with MongoDB by setting up a database, loading real data, and practicing queries. Document databases are often the most approachable starting point.
  • Design a multi-database project for your portfolio. Maybe an e-commerce analytics pipeline that uses MongoDB for raw events, Redis for caching, and PostgreSQL for final reports.
  • Learn NoSQL data modeling to understand how to structure documents, design effective keys, and handle relationships without joins.
  • Explore stream processing patterns to see how Kafka works with NoSQL databases to handle real-time data flows.
  • Try cloud NoSQL services like DynamoDB, Cosmos DB, or Cloud Firestore to understand managed database offerings.
  • Study polyglot architectures by researching how companies like Netflix, Spotify, or GitHub combine different database types in their systems.

Each of these moves you toward the kind of hands-on experience that employers value. Modern data teams expect you to understand both SQL and NoSQL, and more importantly, to know when and why to use each.

The next time you're faced with billions of rapidly changing events, evolving data schemas, or the need to scale beyond a single server, you'll have the knowledge to choose the right tool for the job. That's the kind of systems thinking that makes great data engineers.

Project Tutorial: Build an AI Chatbot with Python and the OpenAI API

19 September 2025 at 22:03

Learning to work directly with AI programmatically opens up a world of possibilities beyond using ChatGPT in a browser. When you understand how to connect to AI services using application programming interfaces (APIs), you can build custom applications, integrate AI into existing systems, and create personalized experiences that match your exact needs.

In this hands-on tutorial, we'll build a fully functional chatbot from scratch using Python and the OpenAI API. You'll learn to manage conversations, control costs with token budgeting, and create custom AI personalities that persist across multiple exchanges. By the end, you'll have both a working chatbot and the foundational skills to build more sophisticated AI-powered applications.

Why Build Your Own Chatbot?

While AI tools like ChatGPT are powerful, building your own chatbot teaches you essential skills for working with AI APIs professionally. You'll understand how conversation memory actually works, learn to manage API costs effectively, and gain the ability to customize AI behavior for specific use cases.

This knowledge translates directly to real-world applications: customer service bots with your company's voice, educational assistants for specific subjects, or personal productivity tools that understand your workflow.

What You'll Learn

By the end of this tutorial, you'll know how to:

  • Connect to the OpenAI API with secure authentication
  • Design custom AI personalities using system prompts
  • Build conversation loops that remember previous exchanges
  • Implement token counting and budget management
  • Structure chatbot code using functions and classes
  • Handle API errors and edge cases gracefully
  • Deploy your chatbot for others to use

Before You Start: Setup Guide

Prerequisites

You'll need to be comfortable with Python fundamentals such as defining variables, functions, loops, and dictionaries. Familiarity with defining your own functions is particularly important. Basic knowledge of APIs is helpful but not required—we'll cover what you need to know.

Environment Setup

First, you'll need a local development environment. We recommend VS Code if you're new to local development, though any Python IDE will work.

Install the required libraries using this command in your terminal:

pip install openai tiktoken

API Key Setup

You have two options for accessing AI models:

Free Option: Sign up for Together AI, which provides \$1 in free credits—more than enough for this entire tutorial. Their free model is slower but costs nothing.

Premium Option: Use OpenAI directly. The model we'll use (GPT-4o-mini) is extremely affordable—our entire tutorial costs less than 5 cents during testing.

Critical Security Note: Never hardcode API keys in your scripts. We'll use environment variables to keep them secure.

For Windows users, set your environment variable through Settings > Environment Variables, then restart your computer. Mac and Linux users can set environment variables without rebooting.

Part 1: Your First AI Response

Let's start with the simplest possible chatbot—one that can respond to a single message. This foundation will teach you the core concepts before we add complexity.

Create a new file called chatbot.py and add this code:

import os
from openai import OpenAI

# Load API key securely from environment variables
api_key = os.getenv("OPENAI_API_KEY") or os.getenv("TOGETHER_API_KEY")

# Create the OpenAI client
client = OpenAI(api_key=api_key)

# Send a message and get a response
response = client.chat.completions.create(
    model="gpt-4o-mini",  # or "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free" for Together
    messages=[
        {"role": "system", "content": "You are a fed up and sassy assistant who hates answering questions."},
        {"role": "user", "content": "What is the weather like today?"}
    ],
    temperature=0.7,
    max_tokens=100
)

# Extract and display the reply
reply = response.choices[0].message.content
print("Assistant:", reply)

Run this script and you'll see something like:

Assistant: Oh fantastic, another weather question! I don't have access to real-time weather data, but here's a wild idea—maybe look outside your window or check a weather app like everyone else does?

Understanding the Code

The magic happens in the messages parameter, which uses three distinct roles:

  • System: Sets the AI's personality and behavior. This is like giving the AI a character briefing that influences every response.
  • User: Represents what you (or your users) type to the chatbot.
  • Assistant: The AI's responses (we'll add these later for conversation memory).

Key Parameters Explained

Temperature controls the AI's “creativity.” Lower values (0-0.3) produce consistent, predictable responses. Higher values (0.7-1.0) generate more creative but potentially unpredictable outputs. We use 0.7 as a good balance.

Max Tokens limits response length and protects your budget. Each token roughly equals between 1/2 and 1 word, so 100 tokens allows for substantial responses while preventing runaway costs.

Part 2: Understanding AI Variability

Run your script multiple times and notice how responses differ each time. This happens because AI models use statistical sampling—they don't just pick the "best" word, but randomly select from probable options based on context.

Let's experiment with this by modifying our temperature:

# Try temperature=0 for consistent responses
temperature=0,
max_tokens=100

Run this version multiple times and observe more consistent (though not identical) responses.

Now try temperature=1.0 and see how much more creative and unpredictable the responses become. Higher temperatures often lead to longer responses too, which brings us to an important lesson about cost management.

Learning Insight: During development for a different project, I accidentally spent \$20 on a single API call because I forgot to set max_tokens when processing a large file. Always include token limits when experimenting!

Part 3: Refactoring with Functions

As your chatbot becomes more complex, organizing code becomes vital. Let's refactor our script to use functions and global variables.

Modify your app.py code:

import os
from openai import OpenAI

# Configuration variables
api_key = os.getenv("OPENAI_API_KEY") or os.getenv("TOGETHER_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"  # or "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"
TEMPERATURE = 0.7
MAX_TOKENS = 100
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

def chat(user_input):
    """Send a message to the AI and return the response."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_input}
        ],
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )

    reply = response.choices[0].message.content
    return reply

# Test the function
print(chat("How are you doing today?"))

This refactoring makes our code more maintainable and reusable. Global variables let us easily adjust configuration, while the function encapsulates the chat logic for reuse.

Part 4: Adding Conversation Memory

Real chatbots remember previous exchanges. Let's add conversation memory by maintaining a growing list of messages.

Create part3_chat_loop.py:

import os
from openai import OpenAI

# Configuration
api_key = os.getenv("OPENAI_API_KEY") or os.getenv("TOGETHER_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

# Initialize conversation with system prompt
messages = [{"role": "system", "content": SYSTEM_PROMPT}]

def chat(user_input):
    """Add user input to conversation and get AI response."""
    # Add user message to conversation history
    messages.append({"role": "user", "content": user_input})

    # Get AI response using full conversation history
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )

    reply = response.choices[0].message.content

    # Add AI response to conversation history
    messages.append({"role": "assistant", "content": reply})

    return reply

# Interactive chat loop
while True:
    user_input = input("You: ")
    if user_input.strip().lower() in {"exit", "quit"}:
        break

    answer = chat(user_input)
    print("Assistant:", answer)

Now run your chatbot and try asking the same question twice:

You: Hi, how are you?
Assistant: Oh fantastic, just living the dream of answering questions I don't care about. What do you want?

You: Hi, how are you?
Assistant: Seriously, again? Look, I'm here to help, not to exchange pleasantries all day. What do you need?

The AI remembers your previous question and responds accordingly—that's conversation memory in action!

How Memory Works

Each time someone sends a message, we append both the user input and AI response to our messages list. The API processes this entire conversation history to generate contextually appropriate responses.

However, this creates a growing problem: longer conversations mean more tokens, which means higher costs.

Part 5: Token Management and Cost Control

As conversations grow, so does the token count—and your bill. Let's add smart token management to prevent runaway costs.

Modify part4_final.py:

import os
from openai import OpenAI
import tiktoken

# Configuration
api_key = os.getenv("OPENAI_API_KEY") or os.getenv("TOGETHER_API_KEY")
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.7
MAX_TOKENS = 100
TOKEN_BUDGET = 1000  # Maximum tokens to keep in conversation
SYSTEM_PROMPT = "You are a fed up and sassy assistant who hates answering questions."

# Initialize conversation
messages = [{"role": "system", "content": SYSTEM_PROMPT}]

def get_encoding(model):
    """Get the appropriate tokenizer for the model."""
    try:
        return tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Tokenizer for model '{model}' not found. Falling back to 'cl100k_base'.")
        return tiktoken.get_encoding("cl100k_base")

ENCODING = get_encoding(MODEL)

def count_tokens(text):
    """Count tokens in a text string."""
    return len(ENCODING.encode(text))

def total_tokens_used(messages):
    """Calculate total tokens used in conversation."""
    try:
        return sum(count_tokens(msg["content"]) for msg in messages)
    except Exception as e:
        print(f"[token count error]: {e}")
        return 0

def enforce_token_budget(messages, budget=TOKEN_BUDGET):
    """Remove old messages if conversation exceeds token budget."""
    try:
        while total_tokens_used(messages) > budget:
            if len(messages) <= 2:  # Keep system prompt + at least one exchange
                break
            messages.pop(1)  # Remove oldest non-system message
    except Exception as e:
        print(f"[token budget error]: {e}")

def chat(user_input):
    """Chat with memory and token management."""
    messages.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})

    # Prune old messages if over budget
    enforce_token_budget(messages)

    return reply

# Interactive chat with token monitoring
while True:
    user_input = input("You: ")
    if user_input.strip().lower() in {"exit", "quit"}:
        break

    answer = chat(user_input)
    print("Assistant:", answer)
    print(f"Current tokens: {total_tokens_used(messages)}")

How Token Management Works

The token management system works in several steps:

  1. Count Tokens: We use tiktoken to count tokens in each message accurately
  2. Monitor Total: Track the total tokens across the entire conversation
  3. Enforce Budget: When we exceed our token budget, automatically remove the oldest messages (but keep the system prompt)

Learning Insight: Different models use different tokenization schemes. The word "dog" might be 1 token in one model but 2 tokens in another. Our encoding functions handle these differences gracefully.

Run your chatbot and have a long conversation. Watch how the token count grows, then notice when it drops as old messages get pruned. The chatbot maintains recent context while staying within budget.

Part 6: Production-Ready Code Structure

For production applications, object-oriented design provides better organization and encapsulation. Here's how to convert our functional code to a class-based approach:

Create oop_chatbot.py:

import os
import tiktoken
from openai import OpenAI

class Chatbot:
    def __init__(self, api_key, model="gpt-4o-mini", temperature=0.7, max_tokens=100,
                 token_budget=1000, system_prompt="You are a helpful assistant."):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.token_budget = token_budget
        self.messages = [{"role": "system", "content": system_prompt}]
        self.encoding = self._get_encoding()

    def _get_encoding(self):
        """Get tokenizer for the model."""
        try:
            return tiktoken.encoding_for_model(self.model)
        except KeyError:
            print(f"Warning: No tokenizer found for model '{self.model}'. Falling back to 'cl100k_base'.")
            return tiktoken.get_encoding("cl100k_base")

    def _count_tokens(self, text):
        """Count tokens in text."""
        return len(self.encoding.encode(text))

    def _total_tokens_used(self):
        """Calculate total tokens in conversation."""
        try:
            return sum(self._count_tokens(msg["content"]) for msg in self.messages)
        except Exception as e:
            print(f"[token count error]: {e}")
            return 0

    def _enforce_token_budget(self):
        """Remove old messages if over budget."""
        try:
            while self._total_tokens_used() > self.token_budget:
                if len(self.messages) <= 2:
                    break
                self.messages.pop(1)
        except Exception as e:
            print(f"[token budget error]: {e}")

    def chat(self, user_input):
        """Send message and get response."""
        self.messages.append({"role": "user", "content": user_input})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            temperature=self.temperature,
            max_tokens=self.max_tokens
        )

        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})

        self._enforce_token_budget()
        return reply

    def get_token_count(self):
        """Get current token usage."""
        return self._total_tokens_used()

# Usage example
api_key = os.getenv("OPENAI_API_KEY") or os.getenv("TOGETHER_API_KEY")
if not api_key:
    raise ValueError("No API key found. Set OPENAI_API_KEY or TOGETHER_API_KEY.")

bot = Chatbot(
    api_key=api_key,
    system_prompt="You are a fed up and sassy assistant who hates answering questions."
)

while True:
    user_input = input("You: ")
    if user_input.strip().lower() in {"exit", "quit"}:
        break

    response = bot.chat(user_input)
    print("Assistant:", response)
    print("Current tokens used:", bot.get_token_count())

The class-based approach encapsulates all chatbot functionality, makes the code more maintainable, and provides a clean interface for integration into larger applications.

Testing Your Chatbot

Run your completed chatbot and test these scenarios:

  1. Memory Test: Ask a question, then refer back to it later in the conversation
  2. Personality Test: Verify the sassy persona remains consistent across exchanges
  3. Token Management Test: Have a long conversation and watch token counts stabilize
  4. Error Handling Test: Try invalid input to see graceful error handling

Common Issues and Solutions

Environment Variable Problems: If you get authentication errors, verify your API key is set correctly. Windows users may need to restart after setting environment variables.

Token Counting Discrepancies: Different models use different tokenization. Our fallback encoding provides reasonable estimates when exact tokenizers aren't available.

Memory Management: If conversations feel repetitive, your token budget might be too low, causing important context to be pruned too aggressively.

What's Next?

You now have a fully functional chatbot with memory, personality, and cost controls. Here are natural next steps:

Immediate Extensions

  • Web Interface: Deploy using Streamlit or Gradio for a user-friendly interface
  • Multiple Personalities: Create different system prompts for various use cases
  • Conversation Export: Save conversations to JSON files for persistence
  • Usage Analytics: Track token usage and costs over time

Advanced Features

  • Multi-Model Support: Compare responses from different AI models
  • Custom Knowledge: Integrate your own documents or data sources
  • Voice Interface: Add speech-to-text and text-to-speech capabilities
  • User Authentication: Support multiple users with separate conversation histories

Production Considerations

  • Rate Limiting: Handle API rate limits gracefully
  • Monitoring: Add logging and error tracking
  • Scalability: Design for multiple concurrent users
  • Security: Implement proper input validation and sanitization

Key Takeaways

Building your own chatbot teaches fundamental skills for working with AI APIs professionally. You've learned to manage conversation state, control costs through token budgeting, and structure code for maintainability.

These skills transfer directly to production applications: customer service bots, educational assistants, creative writing tools, and countless other AI-powered applications.

The chatbot you've built represents a solid foundation. With the techniques you've mastered—API integration, memory management, and cost control—you're ready to tackle more sophisticated AI projects and integrate conversational AI into your own applications.

Remember to experiment with different personalities, temperature settings, and token budgets to find what works best for your specific use case. The real power of building your own chatbot lies in this customization capability that you simply can't get from using someone else's AI interface.

Resources and Next Steps

  • Complete Code: All examples are available in the solution notebook
  • Community Support: Join the Dataquest Community to discuss your projects and get help with extensions
  • Related Learning: Explore API integration patterns and advanced Python techniques to build even more sophisticated applications

Start experimenting with your new chatbot, and remember that every conversation is a learning opportunity, both for you and your AI assistant!

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

❌