Normal view

Received — 8 November 2025 ⏭ Data Science

Dataquest
Running and Managing Apache Airflow with Docker (Part II) 8 November 2025 at 02:17

Running and Managing Apache Airflow with Docker (Part II)

8 November 2025 at 02:17

In the previous tutorial, we set up Apache Airflow inside Docker, explored its architecture, and built our first real DAG using the TaskFlow API. We simulated an ETL process with two stages — Extract and Transform, demonstrating how Airflow manages dependencies, task retries, and dynamic parallel execution through Dynamic Task Mapping. By the end, we had a functional, scalable workflow capable of processing multiple datasets in parallel, a key building block for modern data pipelines.

In this tutorial, we’ll build on what you created earlier and take a significant step toward production-style orchestration. You’ll complete the ETL lifecycle by adding the Load stage and connecting Airflow to a local MySQL database. This will allow you to load transformed data directly from your pipeline and manage database connections securely using Airflow’s Connections and Environment Variables.

Beyond data loading, you’ll integrate Git and Git Sync into your Airflow environment to enable version control, collaboration, and continuous deployment of DAGs. These practices mirror how data engineering teams manage Airflow projects in real-world settings, promoting consistency, reliability, and scalability, while still keeping the focus on learning and experimentation.

By the end of this part, your Airflow setup will move beyond a simple sandbox and start resembling a production-aligned environment. You’ll have a complete ETL pipeline, from extraction and transformation to loading and automation, and a clear understanding of how professional teams structure and manage their workflows.

Working with MYSQL in Airflow

In the previous section, we built a fully functional Airflow pipeline that dynamically extracted and transformed market data from multiple regions , us, europe, asia, and africa. Each branch of our DAG handled its own extract and transform tasks independently, creating separate CSV files for each region under /opt/airflow/tmp. This setup mimics a real-world data engineering workflow where regional datasets are processed in parallel before being stored or analyzed further.

Now that our transformed datasets are generated, the next logical step is to load them into a database, a critical phase in any ETL pipeline. This not only centralizes your processed data but also allows for downstream analysis, reporting, and integration with BI tools like Power BI or Looker.

While production pipelines often write to cloud-managed databases such as Amazon RDS, Google Cloud SQL, or Azure Database for MySQL, we’ll keep things local and simple by using a MySQL instance on your machine. This approach allows you to test and validate your Airflow workflows without relying on external cloud resources or credentials. The same logic, however, can later be applied seamlessly to remote or cloud-hosted databases.

Prerequisite: Install and Set Up MySQL Locally

Before adding the Load step to our DAG, ensure that MySQL is installed and running on your machine.

Install MySQL

Windows/macOS: Download and install MySQL Community Server.

Linux (Ubuntu):

sudo apt update
sudo apt install mysql-server -y
sudo systemctl start mysql

Verify installation by running:

mysql -u root -p

Create a Database and User for Airflow

Inside your MySQL terminal or MySQL Workbench, run the following commands:

CREATE DATABASE IF NOT EXISTS airflow_db;
CREATE USER IF NOT EXISTS 'airflow'@'%' IDENTIFIED BY 'airflow';
GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow'@'%';
FLUSH PRIVILEGES;

This creates a simple local database called airflow_db and a user airflow with full access, perfect for development and testing.

Create a Database and User for Airflow

Network Configuration for Linux Users

When running Airflow in Docker and MySQL locally on Linux, Docker containers can’t automatically access localhost.

To fix this, you need to make your local machine reachable from inside Docker.

Open your docker-compose.yaml file and add the following line under the x-airflow-common service definition:

extra_hosts:
  - "host.docker.internal:host-gateway"

This line creates a bridge that allows Airflow containers to communicate with your local MySQL instance using the hostname host.docker.internal.

Switching to LocalExecutor

In part one of this tutorial, we worked with CeleryExecutor to run our Airflow. By default, the Docker Compose file uses CeleryExecutor, which requires additional components such as Redis, Celery workers, and the Flower dashboard for distributed task execution.

Since we’re running Airflow to make it production-ready, we can simplify things by using LocalExecutor, which runs tasks in parallel on a single machine, eliminating the need for an external queue or worker system.

Find this line in your docker-compose.yaml:

AIRFLOW__CORE__EXECUTOR: CeleryExecutor

Change it to:

AIRFLOW__CORE__EXECUTOR: LocalExecutor

Removing Unnecessary Services

Because we’re no longer using Celery, we can safely remove related components from the configuration. These include Redis, airflow-worker, and Flower.

You can search for the following sections and delete them:

The entire redis service block.
The airflow-worker service (Celery’s worker).
The flower service (Celery monitoring dashboard).
Any AIRFLOW__CELERY__... lines inside environment blocks.

Extending the DAG with a Load Step

Now let’s extend our existing DAG to include the Load phase of the ETL process. Already we had extract_market_data() and transform_market_data() created in the first part of this tutorial. This new task will read each transformed CSV file and insert its data into a MySQL table.

Here’s our updated daily_etl_pipeline_airflow3 DAG with the new load_to_mysql() task.
You can also find the complete version of this DAG in the cloned repository([email protected]:dataquestio/tutorials.git), inside the part-two/

folder under airflow-docker-tutorial .

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        """Load the transformed CSV data into a MySQL table."""
        import mysql.connector
        import os

        db_config = {
            "host": "host.docker.internal",  # enables Docker-to-local communication
            "user": "airflow",
            "password": "airflow",
            "database": "airflow_db",
            "port": 3306
        }

        df = pd.read_csv(transformed_file)

        # Derive the table name dynamically based on region
        table_name = f"transformed_market_data_{os.path.basename(transformed_file).split('_')[-1].replace('.csv', '')}"

        conn = mysql.connector.connect(**db_config)
        cursor = conn.cursor()

        # Create table if it doesn’t exist
        cursor.execute(f"""
            CREATE TABLE IF NOT EXISTS {table_name} (
                timestamp VARCHAR(50),
                market VARCHAR(50),
                company VARCHAR(255),
                price_usd DECIMAL,
                daily_change_percent DECIMAL
            );
        """)

        # Insert records
        for _, row in df.iterrows():
            cursor.execute(
                f"""
                INSERT INTO {table_name} (timestamp, market, company, price_usd, daily_change_percent)
                VALUES (%s, %s, %s, %s, %s)
                """,
                tuple(row)
            )

        conn.commit()
        conn.close()
        print(f"[LOAD] Data successfully loaded into MySQL table: {table_name}")

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

When you trigger this DAG, Airflow will automatically create three sequential tasks for each defined region (us, europe, asia, africa):

first extracting market data, then transforming it, and finally loading it into a region-specific MySQL table.

Create a Database and User for Airflow (2)

Each branch runs independently, so by the end of a successful run, your local MySQL database (airflow_db) will contain four separate tables, one for each region:

transformed_market_data_us
transformed_market_data_europe
transformed_market_data_asia
transformed_market_data_africa

Each table contains the cleaned and sorted dataset for its region, including company names, prices, and daily percentage changes.

Once your containers are running, open MySQL (via terminal or MySQL Workbench) and run:

SHOW TABLES;

Create a Database and User for Airflow (3)

You should see all four tables listed. Then, to inspect one of them, for example us, run:

SELECT * FROM transformed_market_data_us;

Create a Database and User for Airflow (4)

From above, we can see the dataset that Airflow extracted, transformed, and loaded for the U.S. market, confirming your pipeline has now completed all three stages of ETL: Extract → Transform → Load.

This integration demonstrates Airflow’s ability to manage data flow across multiple sources and databases seamlessly, a key capability in modern data engineering pipelines.

Absolutely, here’s the updated subsection with your requested note added in the right place.

It keeps the professional teaching tone and gently reminds learners that these connection values must match the local MySQL setup they created earlier.

Previewing the Loaded Data in Airflow

By now, you’ve confirmed that your transformed datasets are successfully loaded into MySQL, you can view them directly in MySQL Workbench or through a SQL client. But Airflow also provides a convenient way to query and preview this data right from the UI, using Connections and the SQLExecuteQueryOperator.

Connections in Airflow store the credentials and parameters needed to connect to external systems such as databases, APIs, or cloud services. Instead of hardcoding passwords or host details in your DAGs, you define a connection once in the Web UI and reference it securely using its conn_id.

To set this up:

Open the Airflow Web UI
Navigate to Admin → Connections → + Add a new record
Fill in the following details:

Field	Value
Conn Id	`local_mysql`
Conn Type	`MySQL`
Host	`host.docker.internal`
Schema	`airflow_db`
Login	`airflow`
Password	`airflow`
Port	`3306`

Note: These values must match the credentials you defined earlier when setting up your local MySQL instance.

Specifically, the database airflow_db, user airflow, and password airflow should already exist in your MySQL setup.

The host.docker.internal value ensures that your Airflow containers can communicate with MySQL running on your local machine.

Also note that when you use docker compose down -v, all volumes, including your Airflow connections, will be deleted. Always remember to re-add the connection afterward.

If your changes are not volume-related, you can safely shut down the containers using docker compose down (without -v), which preserves your existing connections and data.

Click Save to register the connection.

Now, Airflow knows how to connect to your MySQL database whenever a task specifies conn_id="local_mysql".

Let’s create a simple SQL query task to preview the data we just loaded.


    @task
    def extract_market_data(market: str):
        ...

    @task
    def transform_market_data(raw_file: str):
        ...

    @task
    def load_to_mysql(transformed_file: str):
        ...

        from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator

        preview_mysql = SQLExecuteQueryOperator(
            task_id="preview_mysql_table",
            conn_id="local_mysql",
            sql="SELECT * FROM transformed_market_data_us LIMIT 5;",
            do_xcom_push=True,  # makes query results viewable in Airflow’s XCom tab
        )
        # Dynamically create and link tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)
    load_to_mysql.expand(transformed_file=transformed_files)

dag = daily_etl_pipeline()

Next, link this task to your DAG so that it runs after the loading process, update this line load_to_mysql.expand(transformed_file=transformed_files) to this:

    load_to_mysql.expand(transformed_file=transformed_files) >> preview_mysql

When you trigger the DAG again (always remember to shut down the containers before making changes to your DAGs using docker compose down, and then, once saved, use docker compose up -d), Airflow will:

Connect to your MySQL database using the stored connection credentials.
Run the SQL query on the specified table.
Display the first few rows of your data as a JSON result in the XCom view.

To see it:

Go to Grid View
Click on the preview_mysql_table task
Choose XCom from the top menu

Previewing the Loaded Data in Airflow (5)

You’ll see your data represented in JSON format, confirming that the integration works, Airflow not only orchestrates the workflow but can also interactively query and visualize your results without leaving the platform.

This makes it easy to verify that your ETL pipeline is functioning correctly end-to-end: extraction, transformation, loading, and now validation, all visible and traceable inside Airflow.

Git-Based DAG Management and CI/CD for Deployment (with `git-sync`)

At this stage, your local Airflow environment is complete, you’ve built a fully functional ETL pipeline that extracts, transforms, and loads regional market data into MySQL, and even validated results directly from the Airflow UI.

Now it’s time to take the final step toward production readiness: managing your DAGs the way data engineering teams do in real-world systems, through Git-based deployment and continuous integration.

We’ll push our DAGs to a shared GitHub repository called airflow_dags, and connect Airflow to it using the git-sync container, which automatically keeps your DAGs in sync. This allows every team member (or environment) to pull from the same source, the Git repo, without manually copying files into containers.

Why Manage DAGs with Git

Every DAG is just a Python file, and like all code, it deserves version control. Storing DAGs in a Git repository brings the same advantages that software engineers rely on:

Versioning: track every change and roll back safely.
Collaboration: multiple developers can work on different workflows without conflict.
Reproducibility: every environment can pull identical DAGs from a single source.
Automation: changes sync automatically, eliminating manual uploads.

This structure makes Airflow easier to maintain and scales naturally as your pipelines grow in number and complexity.

Pushing Your DAGs to GitHub

To begin, create a public or private repository named airflow_dags (e.g., https://github.com/<your-username>/airflow_dags).

Then, in your project root (airflow-docker), initialize Git and push your local dags/ directory:

git init
git remote add origin https://github.com/<your-username>/airflow_dags.git
git add dags/
git commit -m "Add Airflow ETL pipeline DAGs"
git branch -M main
git push -u origin main

Once complete, your DAGs live safely in GitHub, ready for syncing.

How `git-sync` Works

git-sync is a lightweight sidecar container that continuously clones and updates a Git repository into a shared volume.

Once running, it:

Clones your repository (e.g., https://github.com/<your-username>/airflow_dags.git),
Pulls updates every 30 seconds by default,
Exposes the latest DAGs to Airflow automatically, no rebuilds or restarts required.

This is how Airflow stays up to date with your Git repo in real time.

Setting Up `git-sync` in Docker Compose

In your existing docker-compose.yaml, you already have a list of services that define your Airflow environment, like the api-server, scheduler, triggerer, and dag-processor. Each of these runs in its own container but works together as part of the same orchestration system.

The git-sync container will become another service in this list, just like those, but with a very specific purpose:

to keep your /dags folder continuously synchronized with your remote GitHub repository.

Instead of copying Python DAG files manually or rebuilding containers every time you make a change, the git-sync service will automatically pull updates from your GitHub repo (in our case, airflow_dags) into a shared volume that all Airflow services can read from.

This ensures that your environment always runs the latest DAGs from GitHub ,without downtime, restarts, or manual synchronization.

Remember in our docker-compose.yaml file, we had this kind of setup:

Setting Up Git in Docker Compose

Now, we’ll extend that structure by introducing git-sync as an additional service within the same services: section and also an addition in the volumes: section(other than postgres-db-volume: we we have to also add airflow-dags-volume: for uniformity accross all containers).

Below is a configuration that works seamlessly with Docker on any OS:

services:
  git-sync:
    image: registry.k8s.io/git-sync/git-sync:v4.1.0
    user: "0:0"    # run as root so it can create /dags/git-sync
    restart: always
    environment:
      GITSYNC_REPO: "https://github.com/<your-username>/airflow-dags.git"
      GITSYNC_BRANCH: "main"           # use BRANCH not REF
      GITSYNC_PERIOD: "30s"
      GITSYNC_DEPTH: "1"
      GITSYNC_ROOT: "/dags/git-sync"
      GITSYNC_DEST: "repo"
      GITSYNC_LINK: "current"
      GITSYNC_ONE_TIME: "false"
      GITSYNC_ADD_USER: "true"
      GITSYNC_CHANGE_PERMISSIONS: "1"
      GITSYNC_STALE_WORKTREE_TIMEOUT: "24h"
    volumes:
      - airflow-dags-volume:/dags
    healthcheck:
      test: ["CMD-SHELL", "test -L /dags/git-sync/current && test -d /dags/git-sync/current/dags && [ \"$(ls -A /dags/git-sync/current/dags 2>/dev/null)\" ]"]
      interval: 10s
      timeout: 3s
      retries: 10
      start_period: 10s

volumes:
  airflow-dags-volume:

In this setup, the git-sync service runs as a lightweight companion container that keeps your Airflow DAGs in sync with your GitHub repository.

The GITSYNC_REPO variable tells it where to pull code from, in this case, your DAG repository (airflow_dags). Make sure you replace <your-username> with your exact GitHub username. The GITSYNC_BRANCH specifies which branch to track, usually main, while GITSYNC_PERIOD defines how often to check for updates. Here, it’s set to every 30 seconds, meaning Airflow will always be within half a minute of your latest Git push.

The synchronization happens inside the directory defined by GITSYNC_ROOT, which becomes /dags/git-sync inside the container. Inside that root, GITSYNC_DEST defines where the repo is cloned (as repo), and GITSYNC_LINK creates a symbolic link called current pointing to the active clone.

This design allows Airflow to always reference a stable, predictable path (/dags/git-sync/current/dags) even as the repository updates in the background, no path changes, no reloads.

A few environment flags ensure stability and portability across systems. For instance, GITSYNC_ADD_USER and GITSYNC_CHANGE_PERMISSIONS make sure the synced files are accessible to Airflow even when permissions differ across Docker environments.

GITSYNC_DEPTH limits the clone to just the latest commit (keeping it lightweight), while GITSYNC_STALE_WORKTREE_TIMEOUT helps clean up old syncs if something goes wrong.

The shared volume, airflow-dags-volume, acts as the bridge between git-sync and Airflow. It stores all synced DAGs in one central location accessible by both containers. The health check at the end ensures that git-sync is functioning, it verifies that the /current/dags directory exists and contains files before Airflow tries to load them.

Finally, the healthcheck section ensures that Airflow doesn’t start until git-sync has successfully cloned your repository. It runs a small shell command that checks three things, whether the symbolic link /dags/git-sync/current exists, whether the dags directory is present inside it, and whether that directory actually contains files. Only when all these conditions pass does Docker mark the git-sync service as healthy. The interval and retry parameters control how often and how long these checks run, ensuring that Airflow’s scheduler, webserver, and other components wait patiently until the DAGs are fully available. This simple step prevents race conditions and guarantees a smooth startup every time.

Together, these settings ensure that your Airflow instance always runs the latest DAGs from GitHub, automatically, securely, and without manual file transfers.

Generally, this configuration does the following:

Creates a shared volume (airflow-dags-volume) where the DAGs are cloned.
Mounts it into both git-sync and Airflow services.
Runs git-sync as root to fix permission issues on Windows.
Keeps DAGs up to date every 30 seconds.

Adjusting the Airflow Configuration

We’ve now added git-sync as part of our Airflow services, sitting right alongside the api-server, scheduler, triggerer, and dag-processor.

This new service continuously pulls our DAGs from GitHub and stores them inside a shared volume (airflow-dags-volume) that both git-sync and Airflow can access.

However, our Airflow setup still expects to find DAGs through local directory mounts defined under each service (via x-airflow-common), not global named volumes. The default configuration maps these paths as follows:

volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins

This setup points Airflow to the local dags/ folder in your host machine, but now that we have git-sync, our DAGs will live inside a synchronized Git directory instead.

So we need to update the DAG volume mapping to pull from the new shared Git volume instead of the local one.

Replace the first line(- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags) under the volumes: section with: - airflow-dags-volume:/opt/airflow/dags

This tells Docker to mount the shared airflow-dags-volume (created by git-sync) into Airflow’s /opt/airflow/dags directory.

That way, any DAGs pulled by git-sync from your GitHub repository will immediately appear inside Airflow’s working environment, without needing to rebuild or copy files.

We also need to explicitly tell Airflow where the synced DAGs live.

In the environment section of your x-airflow-common block, add the following:

AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags/git-sync/current/dags

This line links Airflow directly to the directory created by the git-sync container.

Here’s how it connects:

Inside the git-sync configuration, we defined:
```
GITSYNC_ROOT: "/dags/git-sync"
GITSYNC_LINK: "current"
```
Together, these ensure that the most recent repository clone is always available under /dags/git-sync/current.
When we mount airflow-dags-volume:/opt/airflow/dags, this path becomes accessible inside the Airflow containers as

/opt/airflow/dags/git-sync/current/dags.

By setting AIRFLOW__CORE__DAGS_FOLDER to that exact path, Airflow automatically watches the live Git-synced DAG directory for changes, meaning every new commit to your GitHub repo will reflect instantly in the Airflow UI.

Finally, ensure that Airflow waits for git-sync to finish cloning before starting up.

In each Airflow service (airflow-scheduler, airflow-apiserver, dag-processor, and triggerer), depends_on section, add:

depends_on:
  git-sync:
    condition: service_healthy

This guarantees that Airflow only starts once the git-sync container has successfully pulled your repository, preventing race conditions during startup.

Once complete, Airflow will read its DAGs directly from the synchronized Git directory , /opt/airflow/dags/git-sync/current/dags , instead of your local project folder.

This change transforms your setup into a live, Git-driven workflow, where Airflow continuously tracks and loads the latest DAGs from GitHub automatically.

Automating Validation with GitHub Actions

Our Git integration wouldn’t be truly powerful without CI/CD (Continuous Integration and Continuous Deployment).

While git-sync ensures that any change pushed to GitHub automatically reflects in Airflow, that alone can be risky, not every change should make it to production immediately.

Imagine pushing a DAG with a missing import, a syntax error, or a bad dependency.

Airflow might fail to parse it, causing your scheduler or api-server to crash or restart repeatedly. That’s why we need a safety net, a way to automatically check that every DAG in our repository is valid before it ever reaches Airflow.

This is exactly where GitHub Actions comes in.

We can set up a lightweight CI pipeline that validates all DAGs whenever someone pushes to the main branch. If a broken DAG is detected, the pipeline fails, preventing the merge and protecting your Airflow environment from unverified code.

GitHub also provides notifications directly in your repository interface, showing failed workflows and highlighting the cause of the issue.

Inside your airflow_dags repository, create a GitHub Actions workflow file at:

.github/workflows/validate-dags.yml

name: Validate Airflow DAGs

on:
  push:
    branches: [ main ]
    paths:
      - 'dags/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Airflow
        run: pip install apache-airflow==3.1.1

      - name: Validate DAGs
        run: |
          echo "Validating DAG syntax..."
          airflow dags list || exit 1

This simple workflow automatically runs every time you push a new commit to the main branch (or modify anything in the dags/ directory).

It installs Apache Airflow in a lightweight test environment, loads all your DAGs, and checks that they parse successfully, no missing imports, syntax issues, or circular dependencies.

If even one DAG fails to load, the validation job will exit with an error, causing the GitHub Actions pipeline to fail.

GitHub then immediately notifies you (and any collaborators) through the repository’s Actions tab, issue alerts, and optional email notifications.

By doing this, you’re adding a crucial layer of protection to your workflow:

Pre-deployment safety: invalid DAGs never reach your running Airflow instance.
Automatic feedback: failed DAGs trigger GitHub notifications, allowing you to fix errors early.
Confidence in deployment: once the pipeline passes, you know every DAG is production-ready.

Together, this CI validation and your git-sync setup create a self-updating, automated Airflow environment that mirrors production deployment practices.

With this final step, your Airflow environment becomes a versioned, automated, and production-ready orchestration system, capable of handling real data pipelines the way modern engineering teams do.

You’ve now completed a full transformation:

from local DAG development to automated, Git-driven deployment, all within Docker, all powered by Apache Airflow.

Note that, both the git-sync service and the Airflow UI depend on your Docker containers running. As long as your containers are up, git-sync remains active, continuously checking for updates in your GitHub repository and syncing any new DAGs to your Airflow environment.

Once you stop or shut down the containers (docker compose down), this synchronization pauses. You also won’t be able to access the Airflow Web UI or trigger DAGs until the containers are started again.

When you restart with docker compose up -d, everything, including git-sync , resumes automatically, picking up the latest changes from GitHub and restoring your full Airflow setup just as you left it.

Summary and Up Next

In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your In this tutorial, you completed the ETL lifecycle in Apache Airflow by adding the Load phase to your workflow and connecting it to a local MySQL database. You learned how Airflow securely manages external connections, dynamically handles multiple data regions, and enables in-UI data previews through XCom and Connections.

You also took your setup a step closer to production by integrating Git-based DAG management with git-sync, and implementing GitHub Actions CI to validate DAGs automatically before deployment.

Together, these changes transformed your environment into a version-controlled, automated orchestration system that mirrors the structure of production-grade setups, a final step before deploying to the cloud.

In the next tutorial, you’ll move beyond simulated data and build a real-world data pipeline, extracting data from an API, transforming it with Python, and loading it into MySQL. You’ll also add retries, alerts, and monitoring, and deploy the full workflow through CI/CD, achieving a truly end-to-end, production-grade Airflow setup.

Dataquest
Running and Managing Apache Airflow with Docker (Part I) 7 November 2025 at 22:50

Running and Managing Apache Airflow with Docker (Part I)

Dataquest

By:Brayan Opiyo

7 November 2025 at 22:50

In the last tutorial, we explored what workflow orchestration is, why it matters, and how Apache Airflow structures, automates, and monitors complex data pipelines through DAGs, tasks, and the scheduler. We examined how orchestration transforms scattered scripts into a coordinated system that ensures reliability, observability, and scalability across modern data workflows.

In this two-part hands-on tutorial, we move from theory to practice. You’ll run Apache Airflow inside Docker, the most efficient and production-like way to deploy Airflow for development and testing. This containerized approach mirrors how Airflow operates in real-world environments, from on-premises teams to managed services like ECR and Cloud Composer.

In Part One, our focus goes beyond setup. You’ll learn how to work effectively with DAGs inside a Dockerized Airflow environment, writing, testing, visualizing, and managing them through the Web UI. You’ll use the TaskFlow API to build clean, Pythonic workflows and implement dynamic task mapping to run multiple processes in parallel. By the end of this part, you’ll have a fully functional Airflow environment running in Docker and a working DAG that extracts and transforms data automatically, the foundation of modern data engineering pipelines.

In Part Two, we’ll extend that foundation to handle data management and automation workflows. You’ll connect Airflow to a local MySQL database for data loading, manage credentials securely through the Admin panel and environment variables, and integrate Git with Git Sync to enable version control and continuous deployment. You’ll also see how CI/CD pipelines can automate DAG validation and deployment, ensuring your Airflow environment remains consistent and collaborative across development teams.

By the end of the series, you’ll not only understand how Airflow runs inside Docker but also how to design, orchestrate, and manage production-grade data pipelines the way data engineers do in real-world systems.

Why Use Docker for Airflow

While Airflow can be installed locally with pip install apache-airflow, this approach often leads to dependency conflicts, version mismatches, and complicated setups. Airflow depends on multiple services, an API server, scheduler, triggerer, metadata database, and dag-processors, all of which must communicate correctly. Installing and maintaining these manually on your local machine can be tedious and error-prone.

Docker eliminates these issues by packaging everything into lightweight, isolated containers. Each container runs a single Airflow component, but all work together seamlessly through Docker Compose. The result is a clean, reproducible environment that behaves consistently across operating systems.

In short:

Local installation: works for testing but often breaks due to dependency conflicts or version mismatches.
Cloud-managed services (like AWS ECS or Cloud Composer): excellent for production but not that much flexible for learning or prototyping.
Docker setup: combines realism with simplicity, providing the same multi-service environment used in production without the overhead of manual configuration.

Docker setup is ideal for learning and development and closely mirrors production environments, but additional configuration is needed for a full production deployment

Prerequisites

Before you begin, ensure the following are installed and ready on your system:

Docker Desktop – Required to build and run Airflow containers.
A code editor – Visual Studio Code or similar, for writing and editing DAGs.
Python 3.10 or higher – Used for authoring Airflow DAGs and helper scripts.

Running Airflow Using Docker

Now that your environment is ready (Docker is open and running), let’s get Airflow running using Docker Compose.

This tool orchestrates all Airflow services, api-server, scheduler, triggerer, database, and workers — so they start and communicate properly.

Clone the Tutorial Repository

We’ve already prepared the starter files you’ll need for this tutorial on GitHub.

Begin by cloning the repository:

git clone [email protected]:dataquestio/tutorials.git

Then navigate to the Airflow tutorial folder:

cd airflow-docker-tutorial

This is the directory where you’ll be working throughout the tutorial.

Inside, you’ll notice a structure similar to this:

airflow-docker-tutorial/
├── part-one/  
├── part-two/
├── docker-compose.yaml
└── README.md

The part-one/ and part-two/ folders contain the complete reference files for both tutorials (Part One and Part Two).

You don’t need to modify anything there, it’s only for comparison or review.
The docker-compose.yaml file is your starting point and will evolve as the tutorial progresses.

Explore the Docker Compose File

Open the docker-compose.yaml file in your code editor.

This file defines all the Airflow components and how they interact inside Docker.

It includes:

api-server – Airflow’s web user interface
Scheduler – Parses and triggers DAGs
Triggerer – Manages deferrable tasks efficiently
Metadata database – Tracks DAG runs and task states
Executors – Execute tasks

Each of these services runs in its own container, but together they form a single working Airflow environment.

You’ll be updating this file as you move through the tutorial to configure, extend, and manage your Airflow setup.

Create Required Folders

Airflow expects certain directories to exist before launching.

Create them inside the same directory as your docker-compose.yaml file:

mkdir -p ./dags ./logs ./plugins ./config

dags/ – your workflow scripts
logs/ – task execution logs
plugins/ – custom hooks and operators
config/ – optional configuration overrides (this will be auto-populated later when initializing the database)

Configure User Permissions

If you’re using Linux, set a user ID to prevent permission issues when Docker writes files locally:

echo -e "AIRFLOW_UID=$(id -u)" > .env

If you’re using macOS or Windows, manually create a .env file in the same directory with the following content:

AIRFLOW_UID=50000

This ensures consistent file ownership between your host system and the Docker containers.

Initialize the Metadata Database

Airflow keeps track of DAG runs, task states, and configurations in a metadata database.

Initialize it by running:

docker compose up airflow-init

Once initialization completes, you’ll see a message confirming that an admin user has been created with default credentials:

Username: airflow
Password: airflow

Start Airflow

Now start all Airflow services in the background:

docker compose up -d

Docker Compose will spin up the scheduler, API server, triggerer, database, and worker containers.

Step 6: Start Airflow

Now launch all the services in the background:

docker compose up -d

Docker Compose will start the scheduler, api-server, triggerer, database, and executor containers.

Start Airflow

Make sure the triggerer, dag-processor, scheduler, and api-server are shown as started as above. If that is not the case, rebuild the Docker container, since the build process might have been interrupted. Otherwise, navigate to http://localhost:8080 to access the Airflow UI exposed by the api-server.

You can also access this through your Docker app, by navigating to containers:

Start Airflow (2)

If the UI fails to load or some containers keep restarting, increase Docker’s memory allocation to at least 4 GB (8 GB recommended) in Docker Desktop → Settings → Resources.

Configuring the Airflow Project

Once Airflow is running and you visit http://localhost:8080, you’ll be seee Airflow Web UI.

Configuring the Airflow Project

This is the command center for your workflows, where you can visualize DAGs, monitor task runs, and manage system configurations. When you navigate to Dags, you’ll see a dashboard that lists several example DAGs provided by the Airflow team. These are sample workflows meant to demonstrate different operators, sensors, and features.

However, for this tutorial, we’ll build our own clean environment, so we’ll remove these example DAGs and customize our setup to suit our project.

Before doing that, though, it’s important to understand the docker-compose.yaml file, since this is where your Airflow environment is actually defined.

Understanding the `docker-compose.yaml` File

The docker-compose.yaml file tells Docker how to build, connect, and run all the Airflow components as containers.

If you open it, you’ll see multiple sections that look like this:

Understanding the Docker Compose File

Let’s break this down briefly:

x-airflow-common – This is the shared configuration block that all Airflow containers inherit from. It defines the base Docker image (apache/airflow:3.1.0), key environment variables, and mounted volumes for DAGs, logs, and plugins. It also specifies user permissions to ensure that files created inside the containers are accessible from your host machine. The depends_on lists dependencies such as the PostgreSQL database used to store Airflow metadata. In short, this section sets up the common foundation for every container in your environment.
services – This section defines the actual Airflow components that make up your environment. Each service, such as the api-server, scheduler, triggerer, dag-processor , and metadata database, runs as a separate container but uses the shared configuration from x-airflow-common. Together, they form a complete Airflow deployment where each container plays a specific role.
volumes - this section sets up persistent storage for containers. Airflow uses it by default for the Postgres database, keeping your DAGs, logs, and configurations saved across runs. In part 2, we’ll expand it to include Git integration.

Each of these sections works together to create a unified Airflow environment that’s easy to configure, extend, or simplify as needed.

Understanding these parts now will make the next steps - cleaning, customizing, and managing your Airflow setup - much clearer.

Resetting the Environment Before Making Changes

Before editing anything inside the docker-compose.yaml, it’s crucial to shut down your containers cleanly to avoid conflicts.

Run: docker compose down -v

Here’s what this does:

docker compose down stops and removes all containers.
The v flag removes volumes, which clears stored metadata, logs, and configurations.

This ensures that you start with a completely fresh environment the next time you launch Airflow — which can be helpful when your environment becomes misconfigured or broken. However, you shouldn’t do this routinely after every DAG or configuration change, as it will also remove your saved Connections, Variables, and other stateful data. In most cases, you can simply run docker compose down instead to stop the containers without wiping the environment.

Disabling Example DAGs

By default, Airflow loads several example DAGs to help new users explore its features. For our purposes, we want a clean workspace that only shows our own DAGs.

Open the docker-compose.yaml file in your code editor.
Locate the environment section under x-airflow-common and find this line: AIRFLOW__CORE__LOAD_EXAMPLES: 'true' . Change 'true' to 'false': AIRFLOW__CORE__LOAD_EXAMPLES: 'false'

This setting tells Airflow not to load any of the example workflows when it starts.

Once you’ve made the changes:

Save your docker-compose.yaml file.
Rebuild and start your Airflow environment again: docker compose up -d
Wait a few moments, then visit http://localhost:8080 again.

This time, when you log in, you’ll notice the example DAGs are gone, leaving you with a clean workspace ready for your own workflows.

Disabling Example DAGs

Let’s now build our first DAG.

Working with DAGs in Airflow

Now that your Airflow environment is clean and running, it’s time to create **** our first real workflow.

This is where you begin writing DAGs (Directed Acyclic Graphs), which sit at the very heart of how Airflow operates.

A DAG is more than just a piece of code, it’s a visual and logical representation of your workflow, showing how tasks connect, when they run, and in what order.

Each task in a DAG represents a distinct step in your process, such as pulling data, cleaning it, transforming it, or loading it into a database. In this tutorial we will create tasks that extract and transform data. We will the see the loading process in part two, and how airflow intergrates to git.

Airflow ensures these tasks execute in the correct order without looping back on themselves (that’s what acyclic means).

Setting Up Your DAG File

Let’s start by creating the foundation of our workflow( make sure to shut down the running containers by using docker compose down -v)

Open your airflow-docker project folder and, inside the dags/ directory, create a new file named:

our_first_dag.py

Every .py file you place in this folder becomes a workflow that Airflow can recognize and manage automatically.

You don’t need to manually register anything, Airflow continuously scans this directory and loads any valid DAGs it finds.

At the top of our file, let’s import the core libraries we need for our project:

from airflow.decorators import dag, task
from datetime import datetime, timedelta
import pandas as pd
import random
import os

Let’s pause to understand what each of these imports does and why they matter:

dag and task come from Airflow’s TaskFlow API.

These decorators turn plain Python functions into Airflow-managed tasks, giving you cleaner, more intuitive code while Airflow handles orchestration behind the scenes.
datetime and timedelta handle scheduling logic.

They help define when your DAG starts and how frequently it runs.
pandas, random, and os are standard Python libraries we’ll use to simulate a simple ETL process, generating, transforming, and saving data locally.

This setup might seem minimal, but it’s everything you need to start orchestrating real tasks.

Defining the DAG Structure

With our imports ready, the next step is to define the skeleton of our DAG, its blueprint.

Think of this as defining when and how your workflow runs.

default_args = {
    "owner": "Your name",
    "retries": 3,
    "retry_delay": timedelta(minutes=1),
}

@dag(
    dag_id="daily_etl_pipeline_airflow3",
    description="ETL workflow demonstrating dynamic task mapping and assets",
    schedule="@daily",
    start_date=datetime(2025, 10, 29),
    catchup=False,
    default_args=default_args,
    tags=["airflow3", "etl"],
)
def daily_etl_pipeline():
    ...

dag = daily_etl_pipeline()

Let’s break this down carefully:

default_args

This dictionary defines shared settings for all tasks in your DAG.

Here, each task will automatically retry up to three times with a one-minute delay between attempts, a good practice when your tasks depend on external systems like APIs or databases that can occasionally fail.
The @dag decorator

This tells Airflow that everything inside the daily_etl_pipeline() function(we can have this to any name) belongs to one cohesive workflow.

It defines:
- schedule="@daily" → when the DAG should run.
- start_date → the first execution date.
- catchup=False → prevents Airflow from running past-due DAGs automatically.
- tags → helps you categorize DAGs in the UI.
The daily_etl_pipeline() function

This is the container for your workflow logic, it’s where you’ll later define your tasks and how they depend on one another.

Think of it as the “script” that describes what happens in each run of your DAG.
dag = daily_etl_pipeline()

This single line instantiates the DAG. It’s what makes your workflow visible and schedulable inside Airflow.

This structure acts as the foundation for everything that follows.

If we think of a DAG as a movie script, this section defines the production schedule and stage setup before the actors (tasks) appear.

Creating Tasks with the TaskFlow API

Now it’s time to define the stars of our workflow, the tasks.

Tasks are the actual units of work that Airflow runs. Each one performs a specific action, and together they form your complete data pipeline.

Airflow’s TaskFlow API makes this remarkably easy: you simply decorate ordinary Python functions with @task, and Airflow takes care of converting them into fully managed, trackable workflow steps.

We’ll start with two tasks:

Extract → simulates pulling or generating data.
Transform → processes and cleans the extracted data.

(We’ll add the Load step in the next part of this tutorial.)

Extract Task — Generating Fake Data

@task
def extract_market_data():
    """
    Simulate extracting market data for popular companies.
    This task mimics pulling live stock prices or API data.
    """
    companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix", "NVIDIA", "Meta"]

    # Simulate today's timestamped price data
    records = []
    for company in companies:
        price = round(random.uniform(100, 1500), 2)
        change = round(random.uniform(-5, 5), 2)
        records.append({
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "company": company,
            "price_usd": price,
            "daily_change_percent": change,
        })

    df = pd.DataFrame(records)
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    raw_path = "/opt/airflow/tmp/market_data.csv"
    df.to_csv(raw_path, index=False)

    print(f"[EXTRACT] Market data successfully generated at {raw_path}")
    return raw_path

Let’s unpack what’s happening here:

The function simulates the extraction phase of an ETL pipeline by generating a small, timestamped dataset of popular companies and their simulated market prices.
Each record includes a company name, current price in USD, and a randomly generated daily percentage change, mimicking what you’d expect from a real API response or financial data feed.
The data is stored in a CSV file inside /opt/airflow/tmp, a shared directory accessible from within your Docker container, this mimics saving raw extracted data before it’s cleaned or transformed.
Finally, the function returns the path to that CSV file. This return value becomes crucial because Airflow automatically treats it as the output of this task. Any downstream task that depends on it, for example, a transformation step, can receive it as an input automatically.

In simpler terms, Airflow handles the data flow for you. You focus on defining what each task does, and Airflow takes care of passing outputs to inputs behind the scenes, ensuring your pipeline runs smoothly and predictably.

Transform Task — Cleaning and Analyzing Market Data

@task
def transform_market_data(raw_file: str):
    """
    Clean and analyze extracted market data.
    This task simulates transforming raw stock data
    to identify the top gainers and losers of the day.
    """
    df = pd.read_csv(raw_file)

    # Clean: ensure numeric fields are valid
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")

    # Sort companies by daily change (descending = top gainers)
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    # Select top 3 gainers and bottom 3 losers
    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    # Save transformed files
    os.makedirs("/opt/airflow/tmp", exist_ok=True)
    gainers_path = "/opt/airflow/tmp/top_gainers.csv"
    losers_path = "/opt/airflow/tmp/top_losers.csv"

    top_gainers.to_csv(gainers_path, index=False)
    top_losers.to_csv(losers_path, index=False)

    print(f"[TRANSFORM] Top gainers saved to {gainers_path}")
    print(f"[TRANSFORM] Top losers saved to {losers_path}")

    return {"gainers": gainers_path, "losers": losers_path}

Let’s unpack what this transformation does and why it’s important:

The function begins by reading the extracted CSV file produced by the previous task (extract_market_data). This is our “raw” dataset.
Next, it cleans the data, converting prices and percentage changes into numeric formats, a vital first step before analysis, since raw data often arrives as text.
It then sorts companies by their daily percentage change, allowing us to quickly identify which ones gained or lost the most value during the day.
Two smaller datasets are then created: one for the top gainers and one for the top losers, each saved as separate CSV files in the same temporary directory.
Finally, the task returns both file paths as a dictionary, allowing any downstream task (for example, a visualization or database load step) to easily access both datasets.

This transformation demonstrates how Airflow tasks can move beyond simple sorting; they can perform real business logic, generate multiple outputs, and return structured data to other steps in the workflow.

At this point, your DAG has two working tasks:

Extract — to simulate data collection
Transform — to clean and analyze that data

When Airflow runs this workflow, it will execute them in order:

Extract → Transform

Now that both the Extract and Transform tasks are defined inside your DAG, let’s see how Airflow links them together when you call them in sequence.

Inside your daily_etl_pipeline() function, add these two lines to establish the task order:

raw = extract_market_data()
transformed = transform_market_data(raw)

When Airflow parses the DAG, it doesn’t see these as ordinary Python calls, it reads them as task relationships.

The TaskFlow API automatically builds a dependency chain, so Airflow knows that extract_market_data must complete before transform_market_data begins.

Notice that we’ve assigned extract_market_data() to a variable called raw. This variable represents the output of the first task, in our case, the path to the extracted data file. The next line, transform_market_data(raw), then takes that output and uses it as input for the transformation step.

This pattern makes the workflow clear and logical: data is extracted, then transformed, with Airflow managing the sequence automatically behind the scenes.

This is how Airflow builds the workflow graph internally: by reading the relationships you define through function calls.

Visualizing the Workflow in the Airflow UI

Once you’ve saved your DAG file with both tasks ****—Extract and Transform —it’s time to bring it to life. Start your Airflow environment using:

docker compose up -d

Then open your browser and navigate to: http://localhost:8080

You’ll be able to see the Airflow Home page, this time with the dag we just created ; daily_etl_pipeline_airflow3.

Visualizing the Workflow in the Airflow UI

Click on it to open the DAG details, then trigger a manual run using the Play button.

The task currently running will turn blue, and once it completes successfully, it will turn green.

Visualizing the Workflow in the Airflow UI (2)

On the graph view, you will also see two tasks: extract_market_data and transform_market_data , connected in sequence showing success in each.

Visualizing the Workflow in the Airflow UI (3)

If a task encounters an issue, Airflow will automatically retry it up to three times (as defined in default_args). If it continues to fail after all retries, it will appear red, indicating that the task, and therefore the DAG run, has failed.

Inspecting Task Logs

Click on any task box (for example, transform_market_data), then click Task Instances.

Inspecting Task Logs

All DAG runs for the selected task will be listed here. Click on the latest run. This will open a detailed log of the task’s execution, an invaluable feature for debugging and understanding what’s happening under the hood.

In your log, you’ll see:

The [EXTRACT] or [TRANSFORM] tags you printed in the code.
Confirmation messages showing where your files were saved, e.g.:

These messages prove that your tasks executed correctly and help you trace your data through each stage of the pipeline.

Dynamic Task Mapping

As data engineers, we rarely process just one dataset; we usually work with many sources at once.

For example, instead of analyzing one market, you might process stock data from multiple exchanges or regions simultaneously.

In our current DAG, the extraction and transformation handle only a single dataset.

But what if we wanted to repeat that same process for several markets, say, us, europe, asia, and africa , all in parallel?

Writing a separate task for each region would make our DAG repetitive and hard to maintain.

That’s where Dynamic Task Mapping comes in.

It allows Airflow to create parallel tasks automatically at runtime based on input data such as lists, dictionaries, or query results.

Before editing the DAG, stop any running containers to ensure Airflow picks up your changes cleanly:

docker compose down -v

Now, extend your existing daily_etl_pipeline_airflow3 to handle multiple markets dynamically:

def daily_etl_pipeline():

    @task
    def extract_market_data(market: str):
        ...
    @task
    def transform_market_data(raw_file: str):
      ...

    # Define markets to process dynamically
    markets = ["us", "europe", "asia", "africa"]

    # Dynamically create parallel tasks
    raw_files = extract_market_data.expand(market=markets)
    transformed_files = transform_market_data.expand(raw_file=raw_files)

dag = daily_etl_pipeline()

By using .expand(), Airflow automatically generates multiple parallel task instances from a single function. You’ll notice the argument market passed into the extract_market_data() function. For that to work effectively, here’s the updated version of the extract_market_data() function:

@task
def extract_market_data(market: str):
        """Simulate extracting market data for a given region or market."""
        companies = ["Apple", "Amazon", "Google", "Microsoft", "Tesla", "Netflix"]
        records = []
        for company in companies:
            price = round(random.uniform(100, 1500), 2)
            change = round(random.uniform(-5, 5), 2)
            records.append({
                "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                "market": market,
                "company": company,
                "price_usd": price,
                "daily_change_percent": change,
            })

        df = pd.DataFrame(records)
        os.makedirs("/opt/airflow/tmp", exist_ok=True)
        raw_path = f"/opt/airflow/tmp/market_data_{market}.csv"
        df.to_csv(raw_path, index=False)
        print(f"[EXTRACT] Market data for {market} saved at {raw_path}")
        return raw_path

We also updated our transform_market_data() task to align with this dynamic setup:

@task
def transform_market_data(raw_file: str):
    """Clean and analyze each regional dataset."""
    df = pd.read_csv(raw_file)
    df["price_usd"] = pd.to_numeric(df["price_usd"], errors="coerce")
    df["daily_change_percent"] = pd.to_numeric(df["daily_change_percent"], errors="coerce")
    df_sorted = df.sort_values(by="daily_change_percent", ascending=False)

    top_gainers = df_sorted.head(3)
    top_losers = df_sorted.tail(3)

    transformed_path = raw_file.replace("market_data_", "transformed_")
    top_gainers.to_csv(transformed_path, index=False)
    print(f"[TRANSFORM] Transformed data saved at {transformed_path}")
    return transformed_path

Both extract_market_data() and transform_market_data() now work together dynamically:

extract_market_data() generates a unique dataset per region (e.g., market_data_us.csv, market_data_europe.csv).
transform_market_data() then processes each of those files individually and saves transformed versions (e.g., transformed_us.csv).

Generally:

One extract task is created for each market (us, europe, asia, africa).
Each extract’s output file becomes the input for its corresponding transform task.
Airflow handles all the mapping logic automatically, no loops or manual duplication needed.

Let’s redeploy our containers by running docker compose up -d .

You’ll see this clearly in the Graph View, where the DAG fans out into several parallel branches, one per market.

Dynamic Task Mapping

Each branch runs independently, and Airflow retries or logs failures per task as defined in default_args. You’ll notice that there are four task instances, which clearly correspond to the four market regions we processed.

Dynamic Task Mapping (2)

When you click any of the tasks, for example, extract_market_data , and open the logs, you’ll notice that the data for the corresponding market regions was extracted and saved independently.

Dynamic Task Mapping (3)

Dynamic Task Mapping (4)

Dynamic Task Mapping (5)

Dynamic Task Mapping (6)

Summary and What’s Next

We have built a complete foundation for working with Apache Airflow inside Docker. You learned how to deploy a fully functional Airflow environment using Docker Compose, understand its architecture, and configure it for clean, local development. We explored the Airflow Web UI, and used the TaskFlow API to create our first real workflow, a simple yet powerful ETL pipeline that extracts and transforms data automatically.

By extending it with Dynamic Task Mapping, we saw how Airflow can scale horizontally by processing multiple datasets in parallel, creating independent task instances for each region without duplicating code.

In Part Two, we’ll build on this foundation and introduce the Load phase of our ETL pipeline. You’ll connect Airflow to a local MySQL database, learn how to configure Connections through the Admin panel and environment variables. We’ll also integrate Git and Git Sync to automate DAG deployment and introduce CI/CD pipelines for version-controlled, collaborative Airflow workflows.

By the end of the next part, your environment will evolve from a development sandbox into a production-ready data orchestration system, capable of automating data ingestion, transformation, and loading with full observability, reliability, and continuous integration support.

Normal view

Working with MYSQL in Airflow

Prerequisite: Install and Set Up MySQL Locally

Install MySQL

Create a Database and User for Airflow

Network Configuration for Linux Users

Switching to LocalExecutor

Removing Unnecessary Services

Extending the DAG with a Load Step

Previewing the Loaded Data in Airflow

Git-Based DAG Management and CI/CD for Deployment (with git-sync)

Why Manage DAGs with Git

Pushing Your DAGs to GitHub

How git-sync Works

Setting Up git-sync in Docker Compose

Adjusting the Airflow Configuration

Automating Validation with GitHub Actions

Summary and Up Next

Why Use Docker for Airflow

Prerequisites

Running Airflow Using Docker

Clone the Tutorial Repository

Explore the Docker Compose File

Create Required Folders

Configure User Permissions

Initialize the Metadata Database

Start Airflow

Step 6: Start Airflow

Configuring the Airflow Project

Understanding the docker-compose.yaml File

Resetting the Environment Before Making Changes

Disabling Example DAGs

Working with DAGs in Airflow

Setting Up Your DAG File

Defining the DAG Structure

Creating Tasks with the TaskFlow API

Extract Task — Generating Fake Data

Transform Task — Cleaning and Analyzing Market Data

Visualizing the Workflow in the Airflow UI

Inspecting Task Logs

Dynamic Task Mapping

Summary and What’s Next

Git-Based DAG Management and CI/CD for Deployment (with `git-sync`)

How `git-sync` Works

Setting Up `git-sync` in Docker Compose

Understanding the `docker-compose.yaml` File