Normal view

Kubernetes Services, Rolling Updates, and Namespaces

22 August 2025 at 23:45

In our previous lesson, you saw Kubernetes automatically replace a crashed Pod. That's powerful, but it reveals a fundamental challenge: if Pods come and go with new IP addresses each time, how do other parts of your application find them reliably?

Today we'll solve this networking puzzle and tackle a related production challenge: how do you deploy updates without breaking your users? We'll work with a realistic data pipeline scenario where a PostgreSQL database needs to stay accessible while an ETL application gets updated.

By the end of this tutorial, you'll be able to:

  • Explain why Services exist and how they provide stable networking for changing Pods
  • Perform zero-downtime deployments using rolling updates
  • Use Namespaces to separate different environments
  • Understand when your applications need these production-grade features

The Moving Target Problem

Let's extend what you built in the previous tutorial to see why we need more than just Pods and Deployments.. You deployed a PostgreSQL database and connected to it directly using kubectl exec. Now imagine you want to add a Python ETL script that connects to that database automatically every hour.

Here's the challenge: your ETL script needs to connect to PostgreSQL, but it doesn't know the database Pod's IP address. Even worse, that IP address changes every time Kubernetes restarts the database Pod.

You could try to hardcode the current Pod IP into your ETL script, but this breaks the moment Kubernetes replaces the Pod. You'd be back to manually updating configuration every time something restarts, which defeats the purpose of container orchestration.

This is where Services come in. A Service acts like a stable phone number for your application. Other Pods can always reach your database using the same address, even when the actual database Pod gets replaced.

How Services Work

Think of a Service as a reliable middleman. When your ETL script wants to talk to PostgreSQL, it doesn't need to hunt down the current Pod's IP address. Instead, it just asks for "postgres" and the Service handles finding and connecting to whichever PostgreSQL Pod is currently running. When you create a Service for your PostgreSQL Deployment:

  1. Kubernetes assigns a stable IP address that never changes
  2. DNS gets configured so other Pods can use a friendly name instead of remembering IP addresses
  3. The Service tracks which Pods are healthy and ready to receive traffic
  4. When Pods change, the Service automatically updates its routing without any manual intervention

Your ETL script can connect to postgres:5432 (a DNS name) instead of an IP address. Kubernetes handles all the complexity of routing that request to whichever PostgreSQL Pod is currently running.

Building a Realistic Pipeline

Let's set up that data pipeline and see Services in action. We'll create both the database and the ETL application, then demonstrate how they communicate reliably even when Pods restart.

Start Your Environment

First, make sure you have a Kubernetes cluster running. A cluster is your pool of computing resources - in Minikube's case, it's a single-node cluster running on your local machine.

If you followed the previous tutorial, you can reuse that environment. If not, you'll need Minikube installed - follow the installation guide if needed.

Start your cluster:

minikube start

Notice in the startup logs how Minikube mentions components like 'kubelet' and 'apiserver' - these are the cluster components working together to create your computing pool.

Set up kubectl access using an alias (this mimics how you'll work with production clusters):

alias kubectl="minikube kubectl --"

Verify your cluster is working:

kubectl get nodes

Deploy PostgreSQL with a Service

Let's start by cleaning up any leftover resources from the previous tutorial and creating our database with proper Service networking:

kubectl delete deployment hello-postgres --ignore-not-found=true

Now create the PostgreSQL deployment:

kubectl create deployment postgres --image=postgres:13
kubectl set env deployment/postgres POSTGRES_DB=pipeline POSTGRES_USER=etl POSTGRES_PASSWORD=mysecretpassword

The key step is creating a Service that other applications can use to reach PostgreSQL:

kubectl expose deployment postgres --port=5432 --target-port=5432 --name=postgres

This creates a ClusterIP Service. ClusterIP is the default type of Service that provides internal networking within your cluster - other Pods can reach it, but nothing outside the cluster can access it directly. The --port=5432 means other applications connect on port 5432, and --target-port=5432 means traffic gets forwarded to port 5432 inside the PostgreSQL Pod.

Verify Service Networking

Let's verify that the Service is working. First, check what Kubernetes created:

kubectl get services

You'll see output like:

NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP    1h
postgres     ClusterIP   10.96.123.45    <none>        5432/TCP   30s

The postgres Service has its own stable IP address (10.96.123.45 in this example). This IP never changes, even when the underlying PostgreSQL Pod restarts.

The Service is now ready for other applications to use. Any Pod in your cluster can reach PostgreSQL using the hostname postgres, regardless of which specific Pod is running the database. We'll see this in action when we create the ETL application.

Create the ETL Application

Now let's create an ETL application that connects to our database. We'll use a modified version of the ETL script from our Docker Compose tutorials - it's the same database connection logic, but adapted to run continuously in Kubernetes.
First, clone the tutorial repository and navigate to the ETL application:

git clone https://github.com/dataquestio/tutorials.git
cd tutorials/kubernetes-services-starter

This folder contains two important files:

  • app.py: the ETL script that connects to PostgreSQL
  • Dockerfile: instructions for packaging the script in a container

Build the ETL image in Minikube's Docker environment so Kubernetes can run it directly:

# Point your Docker CLI to Minikube's Docker daemon
eval $(minikube -p minikube docker-env)

# Build the image
docker build -t etl-app:v1 .

Using a version tag (v1) instead of latest makes it easier to demonstrate rolling updates later.

Now, create the Deployment and set environment variables so the ETL app can connect to the postgres Service:

kubectl create deployment etl-app --image=etl-app:v1
kubectl set env deployment/etl-app \
  DB_HOST=postgres \
  DB_PORT=5432 \
  DB_USER=etl \
  DB_PASSWORD=mysecretpassword \
  DB_NAME=pipeline

Scale the deployment to 2 replicas:

kubectl scale deployment etl-app --replicas=2

Check that everything is running:

kubectl get pods

You should see the PostgreSQL Pod and two ETL application Pods all in "Running" status.

Verify the Service Connection

Let's quickly verify that our ETL application can reach the database using the Service name by running the ETL script manually:

kubectl exec deployment/etl-app -- python3 app.py

You should see output showing the ETL script successfully connecting to PostgreSQL using postgres as the hostname. This demonstrates the Service providing stable networking - the ETL Pod found the database without needing to know its specific IP address.

Zero-Downtime Updates with Rolling Updates

Here's where Kubernetes really shines in production environments. Let's say you need to deploy a new version of your ETL application. In traditional deployment approaches, you might need to stop all instances, update them, and restart everything. This creates downtime.

Kubernetes rolling updates solve this by gradually replacing old Pods with new ones, ensuring some instances are always running to handle requests.

Watch a Rolling Update in Action

First, let's set up a way to monitor what's happening. Open a second terminal and run:

# Make sure you have the kubectl alias in this terminal too
alias kubectl="minikube kubectl --"

# Watch the logs from all ETL Pods
kubectl logs -f -l app=etl-app --all-containers --tail=50

Leave this running. Back in your main terminal, rebuild a new version and tell Kubernetes to use it:

# Ensure your Docker CLI is still pointed at Minikube
eval $(minikube -p minikube docker-env)

# Build v2 of the image
docker build -t etl-app:v2 .

# Trigger the rolling update to v2
kubectl set image deployment/etl-app etl-app=etl-app:v2

Watch what happens in both terminals:

  • In the logs terminal: You'll see some Pods stopping and new ones starting with the updated image
  • In the main terminal: Run kubectl get pods -w to watch Pods being created and terminated in real-time

The -w flag keeps the command running and shows changes as they happen. You'll see something like:

NAME                       READY   STATUS    RESTARTS   AGE
etl-app-5d8c7b4f6d-abc123  1/1     Running   0          2m
etl-app-5d8c7b4f6d-def456  1/1     Running   0          2m
etl-app-7f9a8c5e2b-ghi789  1/1     Running   0          10s    # New Pod
etl-app-5d8c7b4f6d-abc123  1/1     Terminating  0       2m     # Old Pod stopping

Press Ctrl+C to stop watching when the update completes.

What Just Happened?

Kubernetes performed a rolling update with these steps:

  1. Created new Pods with the updated image tag (v2)
  2. Waited for new Pods to be ready and healthy
  3. Terminated old Pods one at a time
  4. Repeated until all Pods were updated

At no point were all your application instances offline. If this were a web service behind a Service, users would never notice the deployment happening.

You can check the rollout status and history:

kubectl rollout status deployment/etl-app
kubectl rollout history deployment/etl-app

The history shows your deployments over time, which is useful for tracking what changed and when.

Environment Separation with Namespaces

So far, everything we've created lives in Kubernetes' "default" namespace. In real projects, you typically want to separate different environments (development, staging, production, CI/CD) or different teams' work. Namespaces provide this isolation.

Think of Namespaces as separate workspaces within the same cluster. Resources in different Namespaces can't directly see each other, which prevents accidental conflicts and makes permissions easier to manage.

This solves real problems you encounter as applications grow. Imagine you're developing a new feature for your ETL pipeline - you want to test it without risking your production data or accidentally breaking the version that's currently processing real business data. With Namespaces, you can run a complete copy of your entire pipeline (database, ETL scripts, everything) in a "staging" environment that's completely isolated from production. You can experiment freely, knowing that crashes or bad data in staging won't affect the production system that your users depend on.

Create a Staging Environment

Let's create a completely separate staging environment for our pipeline:

kubectl create namespace staging

Now deploy the same applications into the staging namespace by adding -n staging to your commands:

# Deploy PostgreSQL in staging
kubectl create deployment postgres --image=postgres:13 -n staging
kubectl set env deployment/postgres \
  POSTGRES_DB=pipeline POSTGRES_USER=etl POSTGRES_PASSWORD=stagingpassword -n staging
kubectl expose deployment postgres --port=5432 --target-port=5432 --name=postgres -n staging

# Deploy ETL app in staging (use the image you built earlier)
kubectl create deployment etl-app --image=etl-app:v1 -n staging
kubectl set env deployment/etl-app \
  DB_HOST=postgres DB_PORT=5432 DB_USER=etl DB_PASSWORD=stagingpassword DB_NAME=pipeline -n staging
kubectl scale deployment etl-app --replicas=2 -n staging

See the Separation in Action

Now you have two complete environments. Compare them:

# Production environment (default namespace)
kubectl get pods

# Staging environment
kubectl get pods -n staging

# All resources in staging
kubectl get all -n staging

# See all Pods across all namespaces at once
kubectl get pods --all-namespaces

Notice that each environment has its own set of Pods, Services, and Deployments. They're completely isolated from each other.

Cross-Namespace DNS

Within the staging namespace, applications still connect to postgres:5432 just like in production. But if you needed an application in staging to connect to a Service in production, you'd use the full DNS name: postgres.default.svc.cluster.local.

The pattern is: <service-name>.<namespace>.svc.<cluster-domain>

Here, svc is a fixed keyword that stands for "service", and cluster.local is the default cluster domain. This reveals an important concept: even though you're running Minikube locally, you're working with a real Kubernetes cluster - it just happens to be a single-node cluster running on your machine. In production, you'd have multiple nodes, but the DNS structure works exactly the same way.

This means:

  • postgres reaches the postgres Service in the current namespace
  • postgres.staging.svc reaches the postgres Service in the staging namespace from anywhere
  • postgres.default.svc reaches the postgres Service in the default namespace from anywhere

Understanding Clusters and Scheduling

Before we wrap up, let's briefly discuss some concepts that are important to understand conceptually, even though you won't work with them directly in local development.

Clusters and Node Pools

As a quick refresher, a Kubernetes cluster is a set of physical or virtual machines that work together to run containerized applications. It’s made up of a control plane that manages the cluster and worker nodes to handle the workload. In production Kubernetes environments (like Google GKE or Amazon EKS), these nodes are often grouped into node pools with different characteristics:

  • Standard pool: General-purpose nodes for most applications
  • High-memory pool: Nodes with lots of RAM for data processing jobs
  • GPU pool: Nodes with graphics cards for machine learning workloads
  • Spot/preemptible pool: Cheaper nodes that can be interrupted, good for fault-tolerant batch jobs

Pod Scheduling

Kubernetes automatically decides which node should run each Pod based on:

  • Resource requirements: CPU and memory requests/limits
  • Node capacity: Available resources on each node
  • Affinity rules: Preferences about which nodes to use or avoid
  • Constraints: Requirements like "only run on SSD-equipped nodes"

You rarely need to think about this in local development with Minikube (which only has one node), but it becomes important when running production workloads across multiple machines.

Optional: See Scheduling in Action

If you're curious, you can see a simple example of how scheduling works even in your single-node Minikube cluster:

# "Cordon" your node, marking it as unschedulable for new Pods
kubectl cordon node/minikube

# Try to create a new Pod
kubectl run test-scheduling --image=nginx

# Check if it's stuck in Pending status
kubectl get pods test-scheduling

You should see the Pod stuck in "Pending" status because there are no available nodes to schedule it on.

# "Uncordon" the node to make it schedulable again
kubectl uncordon node/minikube

# The Pod should now get scheduled and start running
kubectl get pods test-scheduling

Clean up the test Pod:

kubectl delete pod test-scheduling

This demonstrates Kubernetes' scheduling system, though you'll mostly encounter this when working with multi-node production clusters.

Cleaning Up

When you're done experimenting:

# Clean up default namespace
kubectl delete deployment postgres etl-app
kubectl delete service postgres

# Clean up staging namespace
kubectl delete namespace staging

# Or stop Minikube entirely
minikube stop

Key Takeaways

You've now experienced three fundamental production capabilities:

Services solve the moving target problem. When Pods restart and get new IP addresses, Services provide stable networking that applications can depend on. Your ETL script connects to postgres:5432 regardless of which specific Pod is running the database.

Rolling updates enable zero-downtime deployments. Instead of stopping everything to deploy updates, Kubernetes gradually replaces old Pods with new ones. This keeps your applications available during deployments.

Namespaces provide environment separation. You can run multiple copies of your entire stack (development, staging, production) in the same cluster while keeping them completely isolated.

These patterns scale from simple applications to complex microservices architectures. A web application with a database uses the same Service networking concepts, just with more components. A data pipeline with multiple processing stages uses the same rolling update strategy for each component.

Next, you'll learn about configuration management with ConfigMaps and Secrets, persistent storage for stateful applications, and resource management to ensure your applications get the CPU and memory they need.

Introduction to Kubernetes

18 August 2025 at 23:29

Up until now you’ve learned about Docker containers and how they solve the "works on my machine" problem. But once your projects involve multiple containers running 24/7, new challenges appear, ones Docker alone doesn't solve.

In this tutorial, you'll discover why Kubernetes exists and get hands-on experience with its core concepts. We'll start by understanding a common problem that developers face, then see how Kubernetes solves it.

By the end of this tutorial, you'll be able to:

  • Explain what problems Kubernetes solves and why it exists
  • Understand the core components: clusters, nodes, pods, and deployments
  • Set up a local Kubernetes environment
  • Deploy a simple application and see self-healing in action
  • Know when you might choose Kubernetes over Docker alone

Why Does Kubernetes Exist?

Let's imagine a realistic scenario that shows why you might need more than just Docker.

You're building a data pipeline with two main components:

  1. A PostgreSQL database that stores your processed data
  2. A Python ETL script that runs every hour to process new data

Using Docker, you've containerized both components and they work perfectly on your laptop. But now you need to deploy this to a production server where it needs to run reliably 24/7.

Here's where things get tricky:

What happens if your ETL container crashes? With Docker alone, it just stays crashed until someone manually restarts it. You could configure VM-level monitoring and auto-restart scripts, but now you're building container management infrastructure yourself.

What if the server fails? You'd need to recreate everything on a new server. Again, you could write scripts to automate this, but you're essentially rebuilding what container orchestration platforms already provide.

The core issue is that you end up writing custom infrastructure code to handle container failures, scaling, and deployments across multiple machines.

This works fine for simple deployments, but becomes complex when you need:

  • Application-level health checks and recovery
  • Coordinated deployments across multiple services
  • Dynamic scaling based on actual workload metrics

How Kubernetes Helps

Before we get into how Kubernetes helps, it’s important to understand that it doesn’t replace Docker. You still use Docker to build container images. What Kubernetes adds is a way to run, manage, and scale those containers automatically in production.

Kubernetes acts like an intelligent supervisor for your containers. Instead of telling Docker exactly what to do ("run this container"), you tell Kubernetes what you want the end result to look like ("always keep my ETL pipeline running"), and it figures out how to make that happen.

If your ETL container crashes, Kubernetes automatically starts a new one. If the entire server fails, Kubernetes can move your containers to a different server. If you need to handle more data, Kubernetes can run multiple copies of your ETL script in parallel.

The key difference is that Kubernetes shifts you from manual container management to automated container management.

The tradeoff? Kubernetes adds complexity, so for single-machine projects Docker Compose is often simpler. But for systems that need to run reliably over time and scale, the complexity is worth it.

How Kubernetes Thinks

To use Kubernetes effectively, you need to understand how it approaches container management differently than Docker.

When you use Docker directly, you think in imperative terms, meaning that you give specific commands about exactly what should happen:

docker run -d --name my-database postgres:13
docker run -d --name my-etl-script python:3.9 my-script.py

You're telling Docker exactly which containers to start, where to start them, and what to call them.

Kubernetes, on the other hand, uses a declarative approach. This means you describe what you want the final state to look like, and Kubernetes figures out how to achieve and maintain that state. For example: "I want a PostgreSQL database to always be running" or "I want my ETL script to run reliably.”

This shift from "do this specific thing" to "maintain this desired state" is fundamental to how Kubernetes operates.

Here's how Kubernetes maintains your desired state:

  1. You declare what you want using configuration files or commands
  2. Kubernetes stores your desired state in its database
  3. Controllers continuously monitor the actual state vs. desired state
  4. When they differ, Kubernetes takes action to fix the discrepancy
  5. This process repeats every few seconds, forever

This means that if something breaks your containers, Kubernetes will automatically detect the problem and fix it without you having to intervene.

Core Building Blocks

Kubernetes organizes everything using several key concepts. We’ll discuss the foundational building blocks here, and address more nuanced and complex concepts in a later tutorial.

Cluster

A cluster is a group of machines that work together as a single system. Think of it as your pool of computing resources that Kubernetes can use to run your applications. The important thing to understand is that you don't usually care which specific machine runs your application. Kubernetes handles the placement automatically based on available resources.

Nodes

Nodes are the individual machines (physical or virtual) in your cluster where your containers actually run. You'll mostly interact with the cluster as a whole rather than individual nodes, but it's helpful to understand that your containers are ultimately running on these machines.

Note: We'll cover the details of how nodes work in a later tutorial. For now, just think of them as the computing resources that make up your cluster.

Pods: Kubernetes' Atomic Unit

Here's where Kubernetes differs significantly from Docker. While Docker thinks in terms of individual containers, Kubernetes' smallest deployable unit is called a Pod.

A Pod typically contains:

  • At least one container
  • Shared networking so containers in the Pod can communicate using localhost
  • Shared storage volumes that all containers in the Pod can access

Most of the time, you'll have one container per Pod, but the Pod abstraction gives Kubernetes a consistent way to manage containers along with their networking and storage needs.

Pods are ephemeral, meaning they come and go. When a Pod fails or gets updated, Kubernetes replaces it with a new one. This is why you rarely work with individual Pods directly in production (we'll cover how applications communicate with each other in a future tutorial).

Deployments: Managing Pod Lifecycles

Since Pods are ephemeral, you need a way to ensure your application keeps running even when individual Pods fail. This is where Deployments come in.

A Deployment is like a blueprint that tells Kubernetes:

  • What container image to use for your application
  • How many copies (replicas) you want running
  • How to handle updates when you deploy new versions

When you create a Deployment, Kubernetes automatically creates the specified number of Pods. If a Pod crashes or gets deleted, the Deployment immediately creates a replacement. If you want to update your application, the Deployment can perform a rolling update, replacing old Pods one at a time with new versions. This is the key to Kubernetes' self-healing behavior: Deployments continuously monitor the actual number of running Pods and work to match your desired number.

Setting Up Your First Cluster

To understand how these concepts work in practice, you'll need a Kubernetes cluster to experiment with. Let's set up a local environment and deploy a simple application.

Prerequisites

Before we start, make sure you have Docker Desktop installed and running. Minikube uses Docker as its default driver to create the virtual environment for your Kubernetes cluster.

If you don't have Docker Desktop yet, download it from docker.com and make sure it's running before proceeding.

Install Minikube

Minikube creates a local Kubernetes cluster perfect for learning and development. Install it by following the official installation guide for your operating system.

You can verify the installation worked by checking the version:

minikube version

Start Your Cluster

Now you're ready to start your local Kubernetes cluster:

minikube start

This command downloads a virtual machine image (if it's your first time), starts the VM using Docker, and configures a Kubernetes cluster inside it. The process usually takes a few minutes.

You'll see output like:

😄  minikube v1.33.1 on Darwin 14.1.2
✨  Using the docker driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.28.3 on Docker 24.0.7 ...
🔎  Verifying Kubernetes components...
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Set Up kubectl Access

Now that your cluster is running, you can use kubectl to interact with it. We'll use the version that comes with Minikube rather than installing it separately to ensure compatibility:

minikube kubectl -- version

You should see version information for both the client and server.

While you could type minikube kubectl -- before every command, the standard practice is to create an alias. This mimics how you'll work with kubectl in cloud environments where you just type kubectl:

alias kubectl="minikube kubectl --"

Why use an alias? In production environments (AWS EKS, Google GKE, etc.), you'll install kubectl separately and use it directly. By practicing with the kubectl command now, you're building the right muscle memory. The alias lets you use standard kubectl syntax while ensuring you're talking to your local Minikube cluster.

Add this alias to your shell profile (.bashrc, .zshrc, etc.) if you want it to persist across terminal sessions.

Verify Your Cluster

Let's make sure everything is working:

kubectl cluster-info

You should see something like:

Kubernetes control plane is running at <https://192.168.49.2:8443>
CoreDNS is running at <https://192.168.49.2:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy>

Now check what's running in your cluster:

kubectl get nodes

You should see one node (your Minikube VM):

NAME       STATUS   ROLES           AGE   VERSION
minikube   Ready    control-plane   2m    v1.33.1

Perfect! You now have a working Kubernetes cluster.

Deploy Your First Application

Let's deploy a PostgreSQL database to see Kubernetes in action. We'll create a Deployment that runs a postgres container. We'll use PostgreSQL because it's a common component in data projects, but the steps are the same for any container.

Create the deployment:

kubectl create deployment hello-postgres --image=postgres:13
kubectl set env deployment/hello-postgres POSTGRES_PASSWORD=mysecretpassword

Check what Kubernetes created for you:

kubectl get deployments

You should see:

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
hello-postgres   1/1     1            1           30s

Note: If you see 0/1 in the READY column, that's normal! PostgreSQL needs the environment variable to start properly. The deployment will automatically restart the Pod once we set the password, and you should see it change to 1/1 within a minute.

Now look at the Pods:

kubectl get pods

You'll see something like:

NAME                              READY   STATUS    RESTARTS   AGE
hello-postgres-7d8757c6d4-xyz123  1/1     Running   0          45s

Notice how Kubernetes automatically created a Pod with a generated name. The Deployment is managing this Pod for you.

Connect to Your Application

Your PostgreSQL database is running inside the cluster. There are two common ways to interact with it:

Option 1: Using kubectl exec (direct container access)

kubectl exec -it deployment/hello-postgres -- psql -U postgres

This connects you directly to a PostgreSQL session inside the container. The -it flags give you an interactive terminal. You can run SQL commands directly:

postgres=# SELECT version();
postgres=# \q

Option 2: Using port forwarding (local connection)

kubectl port-forward deployment/hello-postgres 5432:5432

Leave this running and open a new terminal. Now you can connect using any PostgreSQL client on your local machine as if the database were running locally on port 5432. Press Ctrl+C to stop the port forwarding when you're done.

Both approaches work well. kubectl exec is faster for quick database tasks, while port forwarding lets you use familiar local tools. Choose whichever feels more natural to you.

Let's break down what you just accomplished:

  1. You created a Deployment - This told Kubernetes "I want PostgreSQL running"
  2. Kubernetes created a Pod - The actual container running postgres
  3. The Pod got scheduled to your Minikube node (the single machine in your cluster)
  4. You connected to the database - Either directly with kubectl exec or through port forwarding

You didn't have to worry about which node to use, how to start the container, or how to configure networking. Kubernetes handled all of that based on your simple deployment command.

Next, we'll see the real magic: what happens when things go wrong.

The Magic Moment: Self-Healing

You've deployed your first application, but you haven't seen Kubernetes' most powerful feature yet. Let's break something on purpose and watch Kubernetes automatically fix it.

Break Something on Purpose

First, let's see what's currently running:

kubectl get pods

You should see your PostgreSQL Pod running:

NAME                              READY   STATUS    RESTARTS   AGE
hello-postgres-7d8757c6d4-xyz123  1/1     Running   0          5m

Now, let's "accidentally" delete this Pod. In a traditional Docker setup, this would mean your database is gone until someone manually restarts it:

kubectl delete pod hello-postgres-7d8757c6d4-xyz123

Replace hello-postgres-7d8757c6d4-xyz123 with your actual Pod name from the previous command.

You'll see:

pod "hello-postgres-7d8757c6d4-xyz123" deleted

Watch the Magic Happen

Immediately check your Pods again:

kubectl get pods

You'll likely see something like this:

NAME                              READY   STATUS    RESTARTS   AGE
hello-postgres-7d8757c6d4-abc789  1/1     Running   0          10s

Notice what happened:

  • The Pod name changed - Kubernetes created a completely new Pod
  • It's already running - The replacement happened automatically
  • It happened immediately - No human intervention required

If you're quick enough, you might catch the Pod in ContainerCreating status as Kubernetes spins up the replacement.

What Just Happened?

This is Kubernetes' self-healing behavior in action. Here's the step-by-step process:

  1. You deleted the Pod - The container stopped running
  2. The Deployment noticed - It continuously monitors the actual vs desired state
  3. State mismatch detected - Desired: 1 Pod running, Actual: 0 Pods running
  4. Deployment took action - It immediately created a new Pod to match the desired state
  5. Balance restored - Back to 1 Pod running, as specified in the Deployment

This entire process took seconds and required no human intervention.

Test It Again

Let's verify the database is working in the new Pod:

kubectl exec deployment/hello-postgres -- psql -U postgres -c "SELECT version();"

Perfect! The database is running normally. The new Pod automatically started with the same configuration (PostgreSQL 13, same password) because the Deployment specification didn't change.

What This Means

This demonstrates Kubernetes' core value: turning manual, error-prone operations into automated, reliable systems. In production, if a server fails at 3 AM, Kubernetes automatically restarts your application on a healthy server within seconds, much faster than alternatives that require VM startup time and manual recovery steps.

You experienced the fundamental shift from imperative to declarative management. You didn't tell Kubernetes HOW to fix the problem - you only specified WHAT you wanted ("keep 1 PostgreSQL Pod running"), and Kubernetes figured out the rest.

Next, we'll wrap up with essential tools and guidance for your continued Kubernetes journey.

Cleaning Up

When you're finished experimenting, you can clean up the resources you created:

# Delete the PostgreSQL deployment
kubectl delete deployment hello-postgres

# Stop your Minikube cluster (optional - saves system resources)
minikube stop

# If you want to completely remove the cluster (optional)
minikube delete

The minikube stop command preserves your cluster for future use while freeing up system resources. Use minikube delete only if you want to start completely fresh next time.

Wrap Up and Next Steps

You've successfully set up a Kubernetes cluster, deployed an application, and witnessed self-healing in action. You now understand why Kubernetes exists and how it transforms container management from manual tasks into automated systems.

Now you're ready to explore:

  • Services - How applications communicate within clusters
  • ConfigMaps and Secrets - Managing configuration and sensitive data
  • Persistent Volumes - Handling data that survives Pod restarts
  • Advanced cluster management - Multi-node clusters, node pools, and workload scheduling strategies
  • Security and access control - Understanding RBAC and IAM concepts

The official Kubernetes documentation is a great resource for diving deeper.

Remember the complexity trade-off: Kubernetes is powerful but adds operational overhead. Choose it when you need high availability, automatic scaling, or multi-server deployments. For simple applications running on a single machine, Docker Compose is often the better choice. Many teams start with Docker Compose and migrate to Kubernetes as their reliability and scaling requirements grow.

Now you have the foundation to make informed decisions about when and how to use Kubernetes in your data projects.

Project Tutorial: Star Wars Survey Analysis Using Python and Pandas

11 August 2025 at 23:17

In this project walkthrough, we'll explore how to clean and analyze real survey data using Python and pandas, while diving into the fascinating world of Star Wars fandom. By working with survey results from FiveThirtyEight, we'll uncover insights about viewer preferences, film rankings, and demographic trends that go beyond the obvious.

Survey data analysis is a critical skill for any data analyst. Unlike clean, structured datasets, survey responses come with unique challenges: inconsistent formatting, mixed data types, checkbox responses that need strategic handling, and missing values that tell their own story. This project tackles these real-world challenges head-on, preparing you for the messy datasets you'll encounter in your career.

Throughout this tutorial, we'll build professional-quality visualizations that tell a compelling story about Star Wars fandom, demonstrating how proper data cleaning and thoughtful visualization design can transform raw survey data into stakeholder-ready insights.

Why This Project Matters

Survey analysis represents a core data science skill applicable across industries. Whether you're analyzing customer satisfaction surveys, employee engagement data, or market research, the techniques demonstrated here form the foundation of professional data analysis:

  • Data cleaning proficiency for handling messy, real-world datasets
  • Boolean conversion techniques for survey checkbox responses
  • Demographic segmentation analysis for uncovering group differences
  • Professional visualization design for stakeholder presentations
  • Insight synthesis for translating data findings into business intelligence

The Star Wars theme makes learning enjoyable, but these skills transfer directly to business contexts. Master these techniques, and you'll be prepared to extract meaningful insights from any survey dataset that crosses your desk.

By the end of this tutorial, you'll know how to:

  • Clean messy survey data by mapping yes/no columns and converting checkbox responses
  • Handle unnamed columns and create meaningful column names for analysis
  • Use boolean mapping techniques to avoid data corruption when re-running Jupyter cells
  • Calculate summary statistics and rankings from survey responses
  • Create professional-looking horizontal bar charts with custom styling
  • Build side-by-side comparative visualizations for demographic analysis
  • Apply object-oriented Matplotlib for precise control over chart appearance
  • Present clear, actionable insights to stakeholders

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

Review the Project

Access the project and familiarize yourself with the goals and structure: Star Wars Survey Project

Access the Solution Notebook

You can view and download it here to see what we'll be covering: Solution Notebook

Prepare Your Environment

  • If you're using the Dataquest platform, everything is already set up for you
  • If working locally, ensure you have Python with pandas, matplotlib, and numpy installed
  • Download the dataset from the FiveThirtyEight GitHub repository

Prerequisites

  • Comfortable with Python basics and pandas DataFrames
  • Familiarity with dictionaries, loops, and methods in Python
  • Basic understanding of Matplotlib (we'll use intermediate techniques)
  • Understanding of survey data structure is helpful, but not required

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let's begin by importing the necessary libraries and loading our dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The %matplotlib inline command is Jupyter magic that ensures our plots render directly in the notebook. This is essential for an interactive data exploration workflow.

star_wars = pd.read_csv("star_wars.csv")
star_wars.head()

Setting Up Environment for Star Wars Data Project

Our dataset contains survey responses from over 1,100 respondents about their Star Wars viewing habits and preferences.

Learning Insight: Notice the unnamed columns (Unnamed: 4, Unnamed: 5, etc.) and extremely long column names? This is typical of survey data exported from platforms like SurveyMonkey. The unnamed columns actually represent different movies in the franchise, and cleaning these will be our first major task.

The Data Challenge: Survey Structure Explained

Survey data presents unique structural challenges. Consider this typical survey question:

"Which of the following Star Wars films have you seen? Please select all that apply."

This checkbox-style question gets exported as multiple columns where:

  • Column 1 contains "Star Wars: Episode I The Phantom Menace" if selected, NaN if not
  • Column 2 contains "Star Wars: Episode II Attack of the Clones" if selected, NaN if not
  • And so on for all six films...

This structure makes analysis difficult, so we'll transform it into clean boolean columns.

Data Cleaning Process

Step 1: Converting Yes/No Responses to Booleans

Survey responses often come as text ("Yes"/"No") but boolean values (True/False) are much easier to work with programmatically:

yes_no = {"Yes": True, "No": False, True: True, False: False}

for col in [
    "Have you seen any of the 6 films in the Star Wars franchise?",
    "Do you consider yourself to be a fan of the Star Wars film franchise?",
    "Are you familiar with the Expanded Universe?",
    "Do you consider yourself to be a fan of the Star Trek franchise?"
]:
    star_wars[col] = star_wars[col].map(yes_no, na_action='ignore')

Learning Insight: Why the seemingly redundant True: True, False: False entries? This prevents overwriting data when re-running Jupyter cells. Without these entries, if you accidentally run the cell twice, all your True values would become NaN because the mapping dictionary no longer contains True as a key. This is a common Jupyter pitfall that can silently destroy your analysis!

Step 2: Transforming Movie Viewing Data

The trickiest part involves converting the checkbox movie data. Each unnamed column represents whether someone has seen a specific Star Wars episode:

movie_mapping = {
    "Star Wars: Episode I  The Phantom Menace": True,
    np.nan: False,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True,
    True: True,
    False: False
}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(movie_mapping)

Step 3: Strategic Column Renaming

Long, unwieldy column names make analysis difficult. We'll rename them to something manageable:

star_wars = star_wars.rename(columns={
    "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
    "Unnamed: 4": "seen_2",
    "Unnamed: 5": "seen_3",
    "Unnamed: 6": "seen_4",
    "Unnamed: 7": "seen_5",
    "Unnamed: 8": "seen_6"
})

We'll also clean up the ranking columns:

star_wars = star_wars.rename(columns={
    "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_ep1",
    "Unnamed: 10": "ranking_ep2",
    "Unnamed: 11": "ranking_ep3",
    "Unnamed: 12": "ranking_ep4",
    "Unnamed: 13": "ranking_ep5",
    "Unnamed: 14": "ranking_ep6"
})

Analysis: Uncovering the Data Story

Which Movie Reigns Supreme?

Let's calculate the average ranking for each movie. Remember, in ranking questions, lower numbers indicate higher preference:

mean_ranking = star_wars[star_wars.columns[9:15]].mean().sort_values()
print(mean_ranking)
ranking_ep5    2.513158
ranking_ep6    3.047847
ranking_ep4    3.272727
ranking_ep1    3.732934
ranking_ep2    4.087321
ranking_ep3    4.341317

The results are decisive: Episode V (The Empire Strikes Back) emerges as the clear fan favorite with an average ranking of 2.51. The original trilogy (Episodes IV-VI) significantly outperforms the prequel trilogy (Episodes I-III).

Movie Viewership Patterns

Which movies have people actually seen?

total_seen = star_wars[star_wars.columns[3:9]].sum()
print(total_seen)
seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738

Episodes V and VI lead in viewership, while the prequels show notably lower viewing numbers. Episode III has the fewest viewers at 550 respondents.

Professional Visualization: From Basic to Stakeholder-Ready

Creating Our First Chart

Let's start with a basic visualization and progressively enhance it:

plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())

This creates a functional chart, but it's not ready for stakeholders. Let's upgrade to object-oriented Matplotlib for precise control:

fig, ax = plt.subplots(figsize=(6,3))
rankings = ax.barh(mean_ranking.index, mean_ranking, color='#fe9b00')

ax.set_facecolor('#fff4d6')
ax.set_title('Average Ranking of Each Movie')

for spine in ['top', 'right', 'bottom', 'left']:
    ax.spines[spine].set_visible(False)

ax.invert_yaxis()
ax.text(2.6, 0.35, '*Lowest rank is the most\n liked', fontstyle='italic')

plt.show()

Star Wars Average Ranking for Each Movie

Learning Insight: Think of fig as your canvas and ax as a panel or chart area on that canvas. Object-oriented Matplotlib might seem intimidating initially, but it provides precise control over every visual element. The fig object handles overall figure properties while ax controls individual chart elements.

Advanced Visualization: Gender Comparison

Our most sophisticated visualization compares rankings and viewership by gender using side-by-side bars:

# Create gender-based dataframes
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

# Calculate statistics for each gender
male_ranking_avgs = males[males.columns[9:15]].mean()
female_ranking_avgs = females[females.columns[9:15]].mean()
male_tot_seen = males[males.columns[3:9]].sum()
female_tot_seen = females[females.columns[3:9]].sum()

# Create side-by-side comparison
ind = np.arange(6)
height = 0.35
offset = ind + height

fig, ax = plt.subplots(1, 2, figsize=(8,4))

# Rankings comparison
malebar = ax[0].barh(ind, male_ranking_avgs, color='#fe9b00', height=height)
femalebar = ax[0].barh(offset, female_ranking_avgs, color='#c94402', height=height)
ax[0].set_title('Movie Rankings by Gender')
ax[0].set_yticks(ind + height / 2)
ax[0].set_yticklabels(['Episode 1', 'Episode 2', 'Episode 3', 'Episode 4', 'Episode 5', 'Episode 6'])
ax[0].legend(['Men', 'Women'])

# Viewership comparison
male2bar = ax[1].barh(ind, male_tot_seen, color='#ff1947', height=height)
female2bar = ax[1].barh(offset, female_tot_seen, color='#9b052d', height=height)
ax[1].set_title('# of Respondents by Gender')
ax[1].set_xlabel('Number of Respondents')
ax[1].legend(['Men', 'Women'])

plt.show()

Star Wars Movies Ranking by Gender

Learning Insight: The offset technique (ind + height) is the key to creating side-by-side bars. This shifts the female bars slightly down from the male bars, creating the comparative effect. The same axis limits ensure fair visual comparison between charts.

Key Findings and Insights

Through our systematic analysis, we've discovered:

Movie Preferences:

  • Episode V (Empire Strikes Back) emerges as the definitive fan favorite across all demographics
  • The original trilogy significantly outperforms the prequels in both ratings and viewership
  • Episode III receives the lowest ratings and has the fewest viewers

Gender Analysis:

  • Both men and women rank Episode V as their clear favorite
  • Gender differences in preferences are minimal but consistently favor male engagement
  • Men tended to rank Episode IV slightly higher than women
  • More men have seen each of the six films than women, but the patterns remain consistent

Demographic Insights:

  • The ranking differences between genders are negligible across most films
  • Episodes V and VI represent the franchise's most universally appealing content
  • The stereotype about gender preferences in sci-fi shows some support in engagement levels, but taste preferences remain remarkably similar

The Stakeholder Summary

Every analysis should conclude with clear, actionable insights. Here's what stakeholders need to know:

  • Episode V (Empire Strikes Back) is the definitive fan favorite with the lowest average ranking across all demographics
  • Gender differences in movie preferences are minimal, challenging common stereotypes about sci-fi preferences
  • The original trilogy significantly outperforms the prequels in both critical reception and audience reach
  • Male respondents show higher overall engagement with the franchise, having seen more films on average

Beyond This Analysis: Next Steps

This dataset contains rich additional dimensions worth exploring:

  • Character Analysis: Which characters are universally loved, hated, or controversial across the fanbase?
  • The "Han Shot First" Debate: Analyze this infamous Star Wars controversy and what it reveals about fandom
  • Cross-Franchise Preferences: Explore correlations between Star Wars and Star Trek fandom
  • Education and Age Correlations: Do viewing patterns vary by demographic factors beyond gender?

This project perfectly balances technical skill development with engaging subject matter. You'll emerge with a polished portfolio piece demonstrating data cleaning proficiency, advanced visualization capabilities, and the ability to transform messy survey data into actionable business insights.

Whether you're team Jedi or Sith, the data tells a compelling story. And now you have the skills to tell it beautifully.

If you give this project a go, please share your findings in the Dataquest community and tag me (@Anna_Strahl). I'd love to see what patterns you discover!

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

Advanced Concepts in Docker Compose

5 August 2025 at 19:16

If you completed the previous Intro to Docker Compose tutorial, you’ve probably got a working multi-container pipeline running through Docker Compose. You can start your services with a single command, connect a Python ETL script to a Postgres database, and even persist your data across runs. For local development, that might feel like more than enough.

But when it's time to hand your setup off to a DevOps team or prepare it for staging, new requirements start to appear. Your containers need to be more reliable, your configuration more portable, and your build process more maintainable. These are the kinds of improvements that don’t necessarily change what your pipeline does, but they make a big difference in how safely and consistently it runs—especially in environments you don’t control.

In this tutorial, you'll take your existing Compose-based pipeline and learn how to harden it for production use. That includes adding health checks to prevent race conditions, using multi-stage Docker builds to reduce image size and complexity, running as a non-root user to improve security, and externalizing secrets with environment files.

Each improvement will address a common pitfall in container-based workflows. By the end, your project will be something your team can safely share, deploy, and scale.

Getting Started

Before we begin, let’s clarify one thing: if you’ve completed the earlier tutorials, you should already have a working Docker Compose setup with a Python ETL script and a Postgres database. That’s what we’ll build on in this tutorial.

But if you’re jumping in fresh (or your setup doesn’t work anymore) you can still follow along. You’ll just need a few essentials in place:

  • A simple app.py Python script that connects to Postgres (we won’t be changing the logic much).
  • A Dockerfile that installs Python and runs the script.
  • A docker-compose.yaml with two services: one for the app, one for Postgres.

You can write these from scratch, but to save time, we’ve provided a starter repo with minimal boilerplate.

Once you’ve got that working, you’re ready to start hardening your containerized pipeline.

Add a Health Check to the Database

At this point, your project includes two main services defined in docker-compose.yaml: a Postgres database and a Python container that runs your ETL script. The services start together, and your script connects to the database over the shared Compose network.

That setup works, but it has a hidden risk. When you run docker compose up, Docker starts each container, but it doesn’t check whether those services are actually ready. If Postgres takes a few seconds to initialize, your app might try to connect too early and either fail or hang without a clear explanation.

To fix that, you can define a health check that monitors the readiness of the Postgres container. This gives Docker an explicit test to run, rather than relying on the assumption that "started" means "ready."

Postgres includes a built-in command called pg_isready that makes this easy to implement. You can use it inside your Compose file like this:

services:
  db:
    image: postgres:15
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      timeout: 2s
      retries: 5

This setup checks whether Postgres is accepting connections. Docker will retry up to five times, once every five seconds, before giving up. If the service responds successfully, Docker will mark the container as “healthy.”

To coordinate your services more reliably, you can also add a depends_on condition to your app service. This ensures your ETL script won’t even try to start until the database is ready:

  app:
    build: .
    depends_on:
      db:
        condition: service_healthy

Once you've added both of these settings, try restarting your stack with docker compose up. You can check the health status with docker compose ps, and you should see the Postgres container marked as healthy before the app container starts running.

This one change can prevent a whole category of race conditions that show up only intermittently—exactly the kind of problem that makes pipelines brittle in production environments. Health checks help make your containers functional and dependable.

Optimize Your Dockerfile with Multi-Stage Builds

As your project evolves, your Docker image can quietly grow bloated with unnecessary files like build tools, test dependencies, and leftover cache. It’s not always obvious, especially when the image still “works.” But over time, that bulk slows things down and adds maintenance risk.

That’s why many teams use multi-stage builds: they offer a cleaner, more controlled way to produce smaller, production-ready containers. This technique lets you separate the build environment (where you install and compile everything) from the runtime environment (the lean final image that actually runs your app). Instead of trying to remove unnecessary files or shrink things manually, you define two separate stages and let Docker handle the rest.

Let’s take a quick look at what that means in practice. Here’s a simplified example of what your current Dockerfile might resemble:

FROM python:3.10-slim

WORKDIR /app
COPY app.py .
RUN pip install psycopg2-binary

CMD ["python", "app.py"]

Now here’s a version using multi-stage builds:

# Build stage
FROM python:3.10-slim AS builder

WORKDIR /app
COPY app.py .
RUN pip install --target=/tmp/deps psycopg2-binary

# Final stage
FROM python:3.10-slim

WORKDIR /app
COPY --from=builder /app/app.py .
COPY --from=builder /tmp/deps /usr/local/lib/python3.10/site-packages/

CMD ["python", "app.py"]

The first stage installs your dependencies into a temporary location. The second stage then starts from a fresh image and copies in only what’s needed to run the app. This ensures the final image is small, clean, and free of anything related to development or testing.

Why You Might See a Warning Here

You might see a yellow warning in your IDE about vulnerabilities in the python:3.10-slim image. These come from known issues in upstream packages. In production, you’d typically pin to a specific patch version or scan images as part of your CI pipeline.

For now, you can continue with the tutorial. But it’s helpful to know what these warnings mean and how they fit into professional container workflows. We'll talk more about container security in later steps.

To try this out, rebuild your image using a version tag so it doesn’t overwrite your original:

docker build -t etl-app:v2 .

If you want Docker Compose to use this tagged image, update your Compose file to use image: instead of build::

app:
  image: etl-app:v2

This tells Compose to use the existing etl-app:v2 image instead of building a new one.

On the other hand, if you're still actively developing and want Compose to rebuild the image each time, keep using:

app:
  build: .

In that case, you don’t need to tag anything, just run:

docker compose up --build

That will rebuild the image from your local Dockerfile automatically.

Both approaches work. During development, using build: is often more convenient because you can tweak your Dockerfile and rebuild on the fly. When you're preparing something reproducible for handoff, though, switching to image: makes sense because it locks in a specific version of the container.

This tradeoff is one reason many teams use multiple Compose files:

  • A base docker-compose.yml for production (using image:)
  • A docker-compose.dev.yml for local development (with build:)
  • And sometimes even a docker-compose.test.yml to replicate CI testing environments

This setup keeps your core configuration consistent while letting each environment handle containers in the way that fits best.

You can check the difference in size using:

docker images

Even if your current app is tiny, getting used to multi-stage builds now sets you up for smoother production work later. It separates concerns more clearly, reduces the chance of leaking dev tools into production, and gives you tighter control over what goes into your images.

Some teams even use this structure to compile code in one language and run it in another base image entirely. Others use it to enforce security guidelines by ensuring only tested, minimal files end up in deployable containers.

Whether or not the image size changes much in this case, the structure itself is the win. It gives you portability, predictability, and a cleaner build process without needing to micromanage what’s included.

A single-stage Dockerfile can be tidy on paper, but everything you install or download, even temporarily, ends up in the final image unless you carefully clean it up. Multi-stage builds give you a cleaner separation of concerns by design, which means fewer surprises, fewer manual steps, and less risk of shipping something you didn’t mean to.

Run Your App as a Non-Root User

By default, most containers, including the ones you’ve built so far, run as the root user inside the container. That’s convenient for development, but it’s risky in production. Even if an attacker can’t break out of the container, root access still gives them elevated privileges inside it. That can be enough to install software, run background processes, or exploit your infrastructure for malicious purposes, like launching DDoS attacks or mining cryptocurrency. In shared environments like Kubernetes, this kind of access is especially dangerous.

The good news is that you can fix this with just a few lines in your Dockerfile. Instead of running as root, you’ll create a dedicated user and switch to it before the container runs. In fact, some platforms require non-root users to work properly. Making the switch early can prevent frustrating errors later on, while also improving your security posture.

In the final stage of your Dockerfile, you can add:

RUN useradd -m etluser
USER etluser

This creates a minimal user (-m) and tells Docker to use that account when the container runs. If you’ve already refactored your Dockerfile using multi-stage builds, this change goes in the final stage, after dependencies are copied in and right before the CMD.

To confirm the change, you can run a one-off container that prints the current user:

docker compose run app whoami

You should see:

etluser

This confirms that your container is no longer running as root. Since this command runs in a new container and exits right after, it works even if your main app script finishes quickly.

One thing to keep in mind is file permissions. If your app writes to mounted volumes or tries to access system paths, switching away from root can lead to permission errors. You likely won’t run into that in this project, but it’s worth knowing where to look if something suddenly breaks after this change.

This small step has a big impact. Many modern platforms—including Kubernetes and container registries like Docker Hub—warn you if your images run as root. Some environments even block them entirely. Running as a non-root user improves your pipeline’s security posture and helps future-proof it for deployment.

Externalize Configuration with .env Files

In earlier steps, you may have hardcoded your Postgres credentials and database name directly into your docker-compose.yaml. That works for quick local tests, but in a real project, it’s a security risk.

Storing secrets like usernames and passwords directly in version-controlled files is never safe. Even in private repos, those credentials can easily leak or be accidentally reused. That’s why one of the first steps toward securing your pipeline is externalizing sensitive values into environment variables.

Docker Compose makes this easy by automatically reading from a .env file in your project directory. This is where you store sensitive environment variables like database passwords, without exposing them in your versioned YAML.

Here’s what a simple .env file might look like:

POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=products
DB_HOST=db

Then, in your docker-compose.yaml, you reference those variables just like before:

environment:
  POSTGRES_USER: ${POSTGRES_USER}
  POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
  POSTGRES_DB: ${POSTGRES_DB}
  DB_HOST: ${DB_HOST}

This change doesn’t require any new flags or commands. As long as your .env file lives in the same directory where you run docker compose up, Compose will pick it up automatically.

But your .env file should never be committed to version control. Instead, add it to your .gitignore file to keep it private. To make your project safe and shareable, create a .env.example file with the same variable names but placeholder values:

POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_DB=your_database

Anyone cloning your project can copy that file, rename it to .env, and customize it for their own use, without risking real secrets or overwriting someone else’s setup.

Externalizing secrets this way is one of the simplest and most important steps toward writing secure, production-ready Docker projects. It also lays the foundation for more advanced workflows down the line, like secret injection from CI/CD pipelines or cloud platforms. The more cleanly you separate config and secrets from your code, the easier your project will be to scale, deploy, and share safely.

Optional Concepts: Going Even Further

The features you’ve added so far, health checks, multi-stage builds, non-root users, and .env files, go a long way toward making your pipeline production-ready. But there are a few more Docker and Docker Compose capabilities that are worth knowing, even if you don’t need to implement them right now.

Resource Constraints

One of those is resource constraints. In shared environments, or when testing pipelines in CI, you might want to restrict how much memory or CPU a container can use. Docker Compose supports this through optional fields like mem_limit and cpu_shares, which you can add to any service:

app:
  build: .
  mem_limit: 512m
  cpu_shares: 256

These aren’t enforced strictly in all environments (and don’t work on Docker Desktop without extra configuration), but they become important as you scale up or move into Kubernetes.

Logging

Another area to consider is logging. By default, Docker Compose captures all stdout and stderr output from each container. For most pipelines, that’s enough: you can view logs using docker compose logs or see them live in your terminal. In production, though, logs are often forwarded to a centralized service, written to a mounted volume, or parsed automatically for errors. Keeping your logs structured and focused (especially if you use Python’s logging module) makes that transition easier later on.

Kubernetes

Many of the improvements you’ve made in this tutorial map directly to concepts in Kubernetes:

  • Health checks become readiness and liveness probes
  • Non-root users align with container securityContext settings
  • Environment variables and .env files lay the groundwork for using Secrets and ConfigMaps

Even if you’re not deploying to Kubernetes yet, you’re already building the right habits. These are the same tools and patterns that production-grade pipelines depend on.

You don’t need to learn everything at once, but when you're ready to make that leap, you'll already understand the foundations.

Wrap-Up

You started this tutorial with a Docker Compose stack that worked fine for local development. By now, you've made it significantly more robust without changing what your pipeline actually does. Instead, you focused on how it runs, how it’s configured, and how ready it is for the environments where it might eventually live.

To review, we:

  • Added a health check to make sure services only start when they’re truly ready.
  • Rewrote your Dockerfile using a multi-stage build, slimming down your image and separating build concerns from runtime needs.
  • Hardened your container by running it as a non-root user and moved configuration into a .env file to make it safer and more shareable.

These are the kinds of improvements developers make every day when preparing pipelines for staging, production, or CI. Whether you’re working in Docker, Kubernetes, or a cloud platform, these patterns are part of the job.

If you’ve made it this far, you’ve done more than just containerize a data workflow: you’ve taken your first steps toward running it with confidence, consistency, and professionalism. In the next project, you’ll put all of this into practice by building a fully productionized ETL stack from scratch.

Project Tutorial: Finding Heavy Traffic Indicators on I-94

22 July 2025 at 22:12

In this project walkthrough, we'll explore how to use data visualization techniques to uncover traffic patterns on Interstate 94, one of America's busiest highways. By analyzing real-world traffic volume data along with weather conditions and time-based factors, we'll identify key indicators of heavy traffic that could help commuters plan their travel times more effectively.

Traffic congestion is a daily challenge for millions of commuters. Understanding when and why heavy traffic occurs can help drivers make informed decisions about their travel times, and help city planners optimize traffic flow. Through this hands-on analysis, we'll discover surprising patterns that go beyond the obvious rush-hour expectations.

Throughout this tutorial, we'll build multiple visualizations that tell a comprehensive story about traffic patterns, demonstrating how exploratory data visualization can reveal insights that summary statistics alone might miss.

What You'll Learn

By the end of this tutorial, you'll know how to:

  • Create and interpret histograms to understand traffic volume distributions
  • Use time series visualizations to identify daily, weekly, and monthly patterns
  • Build side-by-side plots for effective comparisons
  • Analyze correlations between weather conditions and traffic volume
  • Apply grouping and aggregation techniques for time-based analysis
  • Combine multiple visualization types to tell a complete data story

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

  1. Review the Project

    Access the project and familiarize yourself with the goals and structure: Finding Heavy Traffic Indicators Project.

  2. Access the Solution Notebook

    You can view and download it here to see what we'll be covering: Solution Notebook

  3. Prepare Your Environment

    • If you're using the Dataquest platform, everything is already set up for you
    • If working locally, ensure you have Python with pandas, matplotlib, and seaborn installed
    • Download the dataset from the UCI Machine Learning Repository
  4. Prerequisites

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let's begin by importing the necessary libraries and loading our dataset:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The %matplotlib inline command is Jupyter magic that ensures our plots render directly in the notebook. This is essential for an interactive data exploration workflow.

traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head()
   holiday   temp  rain_1h  snow_1h  clouds_all weather_main  \
0      NaN  288.28      0.0      0.0          40       Clouds
1      NaN  289.36      0.0      0.0          75       Clouds
2      NaN  289.58      0.0      0.0          90       Clouds
3      NaN  290.13      0.0      0.0          90       Clouds
4      NaN  291.14      0.0      0.0          75       Clouds

      weather_description            date_time  traffic_volume
0      scattered clouds  2012-10-02 09:00:00            5545
1        broken clouds  2012-10-02 10:00:00            4516
2      overcast clouds  2012-10-02 11:00:00            4767
3      overcast clouds  2012-10-02 12:00:00            5026
4        broken clouds  2012-10-02 13:00:00            4918

Our dataset contains hourly traffic volume measurements from a station between Minneapolis and St. Paul on westbound I-94, along with weather conditions for each hour. Key columns include:

  • holiday: Name of holiday (if applicable)
  • temp: Temperature in Kelvin
  • rain_1h: Rainfall in mm for the hour
  • snow_1h: Snowfall in mm for the hour
  • clouds_all: Percentage of cloud cover
  • weather_main: General weather category
  • weather_description: Detailed weather description
  • date_time: Timestamp of the measurement
  • traffic_volume: Number of vehicles (our target variable)

Learning Insight: Notice the temperatures are in Kelvin (around 288K = 15°C = 59°F). This is unusual for everyday use but common in scientific datasets. When presenting findings to stakeholders, you might want to convert these to Fahrenheit or Celsius for better interpretability.

Initial Data Exploration

Before diving into visualizations, let's understand our dataset structure:

traffic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   holiday              61 non-null     object
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64
 5   weather_main         48204 non-null  object
 6   weather_description  48204 non-null  object
 7   date_time            48204 non-null  object
 8   traffic_volume       48204 non-null  int64
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

We have nearly 50,000 hourly observations spanning several years. Notice that the holiday column has only 61 non-null values out of 48,204 rows. Let's investigate:

traffic['holiday'].value_counts()
holiday
Labor Day                    7
Christmas Day                6
Thanksgiving Day             6
Martin Luther King Jr Day    6
New Years Day                6
Veterans Day                 5
Columbus Day                 5
Memorial Day                 5
Washingtons Birthday         5
State Fair                   5
Independence Day             5
Name: count, dtype: int64

Learning Insight: At first glance, you might think the holiday column is nearly useless with so few values. But actually, holidays are only marked at midnight on the holiday itself. This is a great example of how understanding your data's structure can make a big difference: what looks like missing data might actually be a deliberate design choice. For a complete analysis, you'd want to expand these holiday markers to cover all 24 hours of each holiday.

Let's examine our numeric variables:

traffic.describe()
              temp       rain_1h       snow_1h    clouds_all  traffic_volume
count  48204.000000  48204.000000  48204.000000  48204.000000    48204.000000
mean     281.205870      0.334264      0.000222     49.362231     3259.818355
std       13.338232     44.789133      0.008168     39.015750     1986.860670
min        0.000000      0.000000      0.000000      0.000000        0.000000
25%      272.160000      0.000000      0.000000      1.000000     1193.000000
50%      282.450000      0.000000      0.000000     64.000000     3380.000000
75%      291.806000      0.000000      0.000000     90.000000     4933.000000
max      310.070000   9831.300000      0.510000    100.000000     7280.000000

Key observations:

  • Temperature ranges from 0K to 310K (that 0K is suspicious and likely a data quality issue)
  • Most hours have no precipitation (75th percentile for both rain and snow is 0)
  • Traffic volume ranges from 0 to 7,280 vehicles per hour
  • The mean (3,260) and median (3,380) traffic volumes are similar, suggesting relatively symmetric distribution

Visualizing Traffic Volume Distribution

Let's create our first visualization to understand traffic patterns:

plt.hist(traffic["traffic_volume"])
plt.xlabel("Traffic Volume")
plt.title("Traffic Volume Distribution")
plt.show()

Traffic Distribution

Learning Insight: Always label your axes and add titles! Your audience shouldn't have to guess what they're looking at. A graph without context is just pretty colors.

The histogram reveals a striking bimodal distribution with two distinct peaks:

  • One peak near 0-1,000 vehicles (low traffic)
  • Another peak around 4,000-5,000 vehicles (high traffic)

This suggests two distinct traffic regimes. My immediate hypothesis: these correspond to day and night traffic patterns.

Day vs. Night Analysis

Let's test our hypothesis by splitting the data into day and night periods:

# Convert date_time to datetime format
traffic['date_time'] = pd.to_datetime(traffic['date_time'])

# Create day and night dataframes
day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) &
                     (traffic['date_time'].dt.hour < 19)]

night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) |
                       (traffic['date_time'].dt.hour < 7)]

Learning Insight: I chose 7 AM to 7 PM as "day" hours, which gives us equal 12-hour periods. This is somewhat arbitrary and you might define rush hours differently. I encourage you to experiment with different definitions, like 6 AM to 6 PM, and see how it affects your results. Just keep the periods balanced to avoid skewing your analysis.

Now let's visualize both distributions side by side:

plt.figure(figsize=(11,3.5))

plt.subplot(1, 2, 1)
plt.hist(day['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Day')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')

plt.subplot(1, 2, 2)
plt.hist(night['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Night')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')

plt.show()

Traffic by Day and Night

Perfect! Our hypothesis is confirmed. The low-traffic peak corresponds entirely to nighttime hours, while the high-traffic peak occurs during daytime. Notice how I set the same axis limits for both plots—this ensures fair visual comparison.

Let's quantify this difference:

print(f"Day traffic mean: {day['traffic_volume'].mean():.0f} vehicles/hour")
print(f"Night traffic mean: {night['traffic_volume'].mean():.0f} vehicles/hour")
Day traffic mean: 4762 vehicles/hour
Night traffic mean: 1785 vehicles/hour

Day traffic is nearly 3x higher than night traffic on average!

Monthly Traffic Patterns

Now let's explore seasonal patterns by examining traffic by month:

day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean(numeric_only=True)

plt.plot(by_month['traffic_volume'], marker='o')
plt.title('Traffic volume by month')
plt.xlabel('Month')
plt.show()

Traffic by Month

The plot reveals:

  • Winter months (Jan, Feb, Nov, Dec) have notably lower traffic
  • A dramatic dip in July that seems anomalous

Let's investigate that July anomaly:

day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]

plt.plot(only_july.groupby('year').mean(numeric_only=True)['traffic_volume'])
plt.title('July Traffic by Year')
plt.show()

Traffic by Year

Learning Insight: This is a perfect example of why exploratory visualization is so valuable. That July dip? It turns out I-94 was completely shut down for several days in July 2016. Those zero-traffic days pulled down the monthly average dramatically. This is a reminder that outliers can significantly impact means so always investigate unusual patterns in your data!

Day of Week Patterns

Let's examine weekly patterns:

day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean(numeric_only=True)

plt.plot(by_dayofweek['traffic_volume'])

# Add day labels for readability
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xticks(range(len(days)), days)
plt.xlabel('Day of Week')
plt.ylabel('Traffic Volume')
plt.title('Traffic by Day of Week')
plt.show()

Traffic by Day of Week

Clear pattern: weekday traffic is significantly higher than weekend traffic. This aligns with commuting patterns because most people drive to work Monday through Friday.

Hourly Patterns: Weekday vs. Weekend

Let's dig deeper into hourly patterns, comparing business days to weekends:

day['hour'] = day['date_time'].dt.hour
business_days = day.copy()[day['dayofweek'] <= 4]  # Monday-Friday
weekend = day.copy()[day['dayofweek'] >= 5]        # Saturday-Sunday

by_hour_business = business_days.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)

plt.figure(figsize=(11,3.5))

plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Monday–Friday')

plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Weekend')

plt.show()

Traffic by Hour

The patterns are strikingly different:

  • Weekdays: Clear morning (7 AM) and evening (4-5 PM) rush hour peaks
  • Weekends: Gradual increase through the day with no distinct peaks
  • Best time to travel on weekdays: 10 AM (between rush hours)

Weather Impact Analysis

Now let's explore whether weather conditions affect traffic:

weather_cols = ['clouds_all', 'snow_1h', 'rain_1h', 'temp', 'traffic_volume']
correlations = day[weather_cols].corr()['traffic_volume'].sort_values()
print(correlations)
clouds_all       -0.032932
snow_1h           0.001265
rain_1h           0.003697
temp              0.128317
traffic_volume    1.000000
Name: traffic_volume, dtype: float64

Surprisingly weak correlations! Weather doesn't seem to significantly impact traffic volume. Temperature shows the strongest correlation at just 13%.

Let's visualize this with a scatter plot:

plt.figure(figsize=(10,6))
sns.scatterplot(x='traffic_volume', y='temp', hue='dayofweek', data=day)
plt.ylim(230, 320)
plt.show()

Traffic Analysis

Learning Insight: When I first created this scatter plot, I got excited seeing distinct clusters. Then I realized the colors just correspond to our earlier finding—weekends (darker colors) have lower traffic. This is a reminder to always think critically about what patterns actually mean, not just that they exist!

Let's examine specific weather conditions:

by_weather_main = day.groupby('weather_main').mean(numeric_only=True).sort_values('traffic_volume')

plt.barh(by_weather_main.index, by_weather_main['traffic_volume'])
plt.axvline(x=5000, linestyle="--", color="k")
plt.show()

Traffic Analysis and Weather Impact Analysis

Learning Insight: This is a critical lesson in data analysis and you should always check your sample sizes! Those weather conditions with seemingly high traffic volumes? They only have 1-4 data points each. You can't draw reliable conclusions from such small samples. The most common weather conditions (clear skies, scattered clouds) have thousands of data points and show average traffic levels.

Key Findings and Conclusions

Through our exploratory visualization, we've discovered:

Time-Based Indicators of Heavy Traffic:

  1. Day vs. Night: Daytime (7 AM - 7 PM) has 3x more traffic than nighttime
  2. Day of Week: Weekdays have significantly more traffic than weekends
  3. Rush Hours: 7-8 AM and 4-5 PM on weekdays show highest volumes
  4. Seasonal: Winter months (Jan, Feb, Nov, Dec) have lower traffic volumes

Weather Impact:

  • Surprisingly minimal correlation between weather and traffic volume
  • Temperature shows weak positive correlation (13%)
  • Rain and snow show almost no correlation
  • This suggests commuters drive regardless of weather conditions

Best Times to Travel:

  • Avoid: Weekday rush hours (7-8 AM, 4-5 PM)
  • Optimal: Weekends, nights, or mid-day on weekdays (around 10 AM)

Next Steps

To extend this analysis, consider:

  1. Holiday Analysis: Expand holiday markers to cover all 24 hours and analyze holiday traffic patterns
  2. Weather Persistence: Does consecutive hours of rain/snow affect traffic differently?
  3. Outlier Investigation: Deep dive into the July 2016 shutdown and other anomalies
  4. Predictive Modeling: Build a model to forecast traffic volume based on time and weather
  5. Directional Analysis: Compare eastbound vs. westbound traffic patterns

This project perfectly demonstrates the power of exploratory visualization. We started with a simple question, “what causes heavy traffic?,” and through systematic visualization, uncovered clear patterns. The weather findings surprised me; I expected rain and snow to significantly impact traffic. This reminds us to let data challenge our assumptions!

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

Pretty graphs are nice, but they're not the point. The real value of exploratory data analysis comes when you dig deep enough to actually understand what's happening in your data that will allow you can make smart decisions based on what you find. Whether you're a commuter planning your route or a city planner optimizing traffic flow, these insights provide actionable intelligence.

If you give this project a go, please share your findings in the Dataquest community and tag me (@Anna_Strahl). I'd love to see what patterns you discover!

Happy analyzing!

Intro to Docker Compose

17 July 2025 at 00:09

As your data projects grow, they often involve more than one piece, like a database and a script. Running everything by hand can get tedious and error-prone. One service needs to start before another. A missed environment variable can break the whole flow.

Docker Compose makes this easier. It lets you define your full setup in one file and run everything with a single command.

In this tutorial, you’ll build a simple ETL (Extract, Transform, Load) workflow using Docker Compose. It includes two services:

  1. PostgreSQL container that stores product data,
  2. Python container that loads and processes that data.

You’ll learn how to define multi-container apps, connect services, and test your full stack locally, all with a single Compose command.

If you completed the previous Docker tutorial, you’ll recognize some parts of this setup, but you don’t need that tutorial to succeed here.

What is Docker Compose?

By default, Docker runs one container at a time using docker run commands, which can get long and repetitive. That works for quick tests, but as soon as you need multiple services, or just want to avoid copy/paste errors, it becomes fragile.

Docker Compose simplifies this by letting you define your setup in a single file: docker-compose.yaml. That file describes each service in your app, how they connect, and how to configure them. Once that’s in place, Compose handles the rest: it builds images, starts containers in the correct order, and connects everything over a shared network, all in one step.

Compose is just as useful for small setups, like a script and a database, with fewer chances for error.

To see how that works in practice, we’ll start by launching a Postgres database with Docker Compose. From there, we’ll add a second container that runs a Python script and connects to the database.

Run Postgres with Docker Compose (Single Service)

Say your team is working with product data from a new vendor. You want to spin up a local PostgreSQL database so you can start writing and testing your ETL logic before deploying it elsewhere. In this early phase, it’s common to start with minimal data, sometimes even a single test row, just to confirm your pipeline works end to end before wiring up real data sources.

In this section, we’ll spin up a Postgres database using Docker Compose. This sets up a local environment we can reuse as we build out the rest of the pipeline.

Before adding the Python ETL script, we’ll start with just the database service. This “single service” setup gives us a clean, isolated container that persists data using a Docker volume and can be connected to using either the terminal or a GUI.

Step 1: Create a project folder

In your terminal, make a new folder for this project and move into it:

mkdir compose-demo
cd compose-demo

You’ll keep all your Docker Compose files and scripts here.

Step 2: Write the Docker Compose file

Inside the folder, create a new file called docker-compose.yaml and add the following content:

services:
  db:
    image: postgres:15
    container_name: local_pg
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: products
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

This defines a service named db that runs the official postgres:15 image, sets some environment variables, exposes port 5432, and uses a named volume for persistent storage.

Tip: If you already have PostgreSQL running locally, port 5432 might be in use. You can avoid conflicts by changing the host port. For example:

ports:
  - "5433:5432"

This maps port 5433 on your machine to port 5432 inside the container.
You’ll then need to connect to localhost:5433 instead of localhost:5432.

If you did the “Intro to Docker” tutorial, this configuration should look familiar. Here’s how the two approaches compare:

docker run command docker-compose.yaml equivalent
--name local_pg container_name: local_pg
-e POSTGRES_USER=postgres environment: section
-p 5432:5432 ports: section
-v pgdata:/var/lib/postgresql/data volumes: section
postgres:15 image: postgres:15

With this Compose file in place, we’ve turned a long command into something easier to maintain, and we’re one step away from launching our database.

Step 3: Start the container

From the same folder, run:

docker compose up

Docker will read the file, pull the Postgres image if needed, create the volume, and start the container. You should see logs in your terminal showing the database initializing. If you see a port conflict error, scroll back to Step 2 for how to change the host port.

You can now connect to the database just like before, either by using:

  • docker compose exec db bash to get inside the container, or
  • connecting to localhost:5432 using a GUI like DBeaver or pgAdmin.

From there, you can run psql -U postgres -d products to interact with the database.

Step 4: Shut it down

When you’re done, press Ctrl+C to stop the container. This sends a signal to gracefully shut it down while keeping everything else in place, including the container and volume.

If you want to clean things up completely, run:

docker compose down

This stops and removes the container and network, but leaves the volume intact. The next time you run docker compose up, your data will still be there.

We’ve now launched a production-grade database using a single command! Next, we’ll write a Python script to connect to this database and run a simple data operation.

Write a Python ETL Script

In the earlier Docker tutorial, we loaded a CSV file into Postgres using the command line. That works well when the file is clean and the schema is known, but sometimes we need to inspect, validate, or transform the data before loading it.

This is where Python becomes useful.

In this step, we’ll write a small ETL script that connects to the Postgres container and inserts a new row. It simulates the kind of insert logic you'd run on a schedule, and keeps the focus on how Compose helps coordinate it.

We’ll start by writing and testing the script locally, then containerize it and add it to our Compose setup.

Step 1: Install Python dependencies

To connect to a PostgreSQL database from Python, we’ll use a library called psycopg2. It’s a reliable, widely-used driver that lets our script execute SQL queries, manage transactions, and handle database errors.

We’ll be using the psycopg2-binary version, which includes all necessary build dependencies and is easier to install.

From your terminal, run:

pip install psycopg2-binary

This installs the package locally so you can run and test your script before containerizing it. Later, you’ll include the same package inside your Docker image.

Step 2: Start building the script

Create a new file in the same folder called app.py. You’ll build your script step by step.

Start by importing the required libraries and setting up your connection settings:

import psycopg2
import os

Note: We’re importing psycopg2 even though we installed psycopg2-binary. What’s going on here?
The psycopg2-binary package installs the same core psycopg2 library, just bundled with precompiled dependencies so it’s easier to install. You still import it as psycopg2 in your code because that’s the actual library name. The -binary part just refers to how it’s packaged, not how you use it.

Next, in the same app.py file, define the database connection settings. These will be read from environment variables that Docker Compose supplies when the script runs in a container.

If you’re testing locally, you can override them by setting the variables inline when running the script (we’ll see an example shortly).

Add the following lines:

db_host = os.getenv("DB_HOST", "db")
db_port = os.getenv("DB_PORT", "5432")
db_name = os.getenv("POSTGRES_DB", "products")
db_user = os.getenv("POSTGRES_USER", "postgres")
db_pass = os.getenv("POSTGRES_PASSWORD", "postgres")

Tip: If you changed the host port in your Compose file (for example, to 5433:5432), be sure to set DB_PORT=5433 when testing locally, or the connection may fail.

To override the host when testing locally:

DB_HOST=localhost python app.py

To override both the host and port:

DB_HOST=localhost DB_PORT=5433 python app.py

We use "db" as the default hostname because that’s the name of the Postgres service in your Compose file. When the pipeline runs inside Docker, Compose connects both containers to the same private network, and the db hostname will automatically resolve to the correct container.

Step 3: Insert a new row

Rather than loading a dataset from CSV or SQL, you’ll write a simple ETL operation that inserts a single new row into the vegetables table. This simulates a small “load” job like you might run on a schedule to append new data to a growing table.

Add the following code to app.py:

new_vegetable = ("Parsnips", "Fresh", 2.42, 2.19)

This tuple matches the schema of the table you’ll create in the next step.

Step 4: Connect to Postgres and insert the row

Now add the logic to connect to the database and run the insert:

try:
    conn = psycopg2.connect(
        host=db_host,
        port=int(db_port), # Cast to int since env vars are strings
        dbname=db_name,
        user=db_user,
        password=db_pass
    )
    cur = conn.cursor()

    cur.execute("""
        CREATE TABLE IF NOT EXISTS vegetables (
            id SERIAL PRIMARY KEY,
            name TEXT,
            form TEXT,
            retail_price NUMERIC,
            cup_equivalent_price NUMERIC
        );
    """)

    cur.execute(
        """
        INSERT INTO vegetables (name, form, retail_price, cup_equivalent_price)
        VALUES (%s, %s, %s, %s);
        """,
        new_vegetable
    )

    conn.commit()
    cur.close()
    conn.close()
    print(f"ETL complete. 1 row inserted.")

except Exception as e:
    print("Error during ETL:", e)

This code connects to the database using your earlier environment variable settings.
It then creates the vegetables table (if it doesn’t exist) and inserts the sample row you defined earlier.

If the table already exists, Postgres will leave it alone thanks to CREATE TABLE IF NOT EXISTS. This makes the script safe to run more than once without breaking.

Note: This script will insert a new row every time it runs, even if the row is identical. That’s expected in this example, since we’re focusing on how Compose coordinates services, not on deduplication logic. In a real ETL pipeline, you’d typically add logic to avoid duplicates using techniques like:

  • checking for existing data before insert,
  • using ON CONFLICT clauses,
  • or cleaning the table first with TRUNCATE.

We’ll cover those patterns in a future tutorial.

Step 5: Run the script

If you shut down your Postgres container in the previous step, you’ll need to start it again before running the script. From your project folder, run:

docker compose up -d

The -d flag stands for “detached.” It tells Docker to start the container and return control to your terminal so you can run other commands, like testing your Python script.

Once the database is running, test your script by running:

python app.py

If everything is working, you should see output like:

ETL complete. 1 row inserted.

If you get an error like:

could not translate host name "db" to address: No such host is known

That means the script can’t find the database. Scroll back to Step 2 for how to override the hostname when testing locally.

You can verify the results by connecting to the database service and running a quick SQL query. If your Compose setup is still running in the background, run:

docker compose exec db psql -U postgres -d products

This opens a psql session inside the running container. Then try:

SELECT * FROM vegetables ORDER BY id DESC LIMIT 5;

You should see the most recent row, Parsnips , in the results. To exit the session, type \q.

In the next step, you’ll containerize this Python script, add it to your Compose setup, and run the whole ETL pipeline with a single command.

Build a Custom Docker Image for the ETL App

So far, you’ve written a Python script that runs locally and connects to a containerized Postgres database. Now you’ll containerize the script itself, so it can run anywhere, even as part of a larger pipeline.

Before we build it, let’s quickly refresh the difference between a Docker image and a Docker container. A Docker image is a blueprint for a container. It defines everything the container needs: the base operating system, installed packages, environment variables, files, and the command to run. When you run an image, Docker creates a live, isolated environment called a container.

You’ve already used prebuilt images like postgres:15. Now you’ll build your own.

Step 1: Create a Dockerfile

Inside your compose-demo folder, create a new file called Dockerfile (no file extension). Then add the following:

FROM python:3.10-slim

WORKDIR /app

COPY app.py .

RUN pip install psycopg2-binary

CMD ["python", "app.py"]

Let’s walk through what this file does:

  • FROM python:3.10-slim starts with a minimal Debian-based image that includes Python.
  • WORKDIR /app creates a working directory where your code will live.
  • COPY app.py . copies your script into that directory inside the container.
  • RUN pip install psycopg2-binary installs the same Postgres driver you used locally.
  • CMD [...] sets the default command that will run when the container starts.

Step 2: Build the image

To build the image, run this from the same folder as your Dockerfile:

docker build -t etl-app .

This command:

  • Uses the current folder (.) as the build context
  • Looks for a file called Dockerfile
  • Tags the resulting image with the name etl-app

Once the build completes, check that it worked:

docker images

You should see etl-app listed in the output.

Step 3: Try running the container

Now try running your new container:

docker run etl-app

This will start the container and run the script, but unless your Postgres container is still running, it will likely fail with a connection error.

That’s expected.

Right now, the Python container doesn’t know how to find the database because there’s no shared network, no environment variables, and no Compose setup. You’ll fix that in the next step by adding both services to a single Compose file.

Update the docker-compose.yaml

Earlier in the tutorial, we used Docker Compose to define and run a single service: a Postgres database. Now that our ETL app is containerized, we’ll update our existing docker-compose.yaml file to run both services — the database and the app — in a single, connected setup.

Docker Compose will handle building the app, starting both containers, connecting them over a shared network, and passing the right environment variables, all in one command. This setup makes it easy to swap out the app or run different versions just by updating the docker-compose.yaml file.

Step 1: Add the app service to your Compose file

Open docker-compose.yaml and add the following under the existing services: section:

  app:
    build: .
    depends_on:
      - db
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: products
      DB_HOST: db

This tells Docker to:

  • Build the app using the Dockerfile in your current folder
  • Wait for the database to start before running
  • Pass in environment variables so the app can connect to the Postgres container

You don’t need to modify the db service or the volumes: section — leave those as they are.

Step 2: Run and verify the full stack

With both services defined, we can now start the full pipeline with a single command:

docker compose up --build -d

This will rebuild our app image (if needed), launch both containers in the background, and connect them over a shared network.

Once the containers are up, check the logs from your app container to verify that it ran successfully:

docker compose logs app

Look for this line:

ETL complete. 1 row inserted.

That means the app container was able to connect to the database and run its logic successfully.

If you get a database connection error, try running the command again. Compose’s depends_on ensures the database starts first, but doesn’t wait for it to be ready. In production, you’d use retry logic or a wait-for-it script to handle this more gracefully.

To confirm the row was actually inserted into the database, open a psql session inside the running container:

docker compose exec db psql -U postgres -d products

Then run a quick SQL query:

SELECT * FROM vegetables ORDER BY id DESC LIMIT 5;

You should see your most recent row (Parsnips) in the output. Type \q to exit.

Step 3: Shut it down

When you're done testing, stop and remove the containers with:

docker compose down

This tears down both containers but leaves your named volume (pgdata) intact so your data will still be there next time you start things up.

Clean Up and Reuse

To run your pipeline again, just restart the services:

docker compose up

Because your Compose setup uses a named volume (pgdata), your database will retain its data between runs, even after shutting everything down.

Each time you restart the pipeline, the app container will re-run the script and insert the same row unless you update the script logic. In a real pipeline, you'd typically prevent that with checks, truncation, or ON CONFLICT clauses.

You can now test, tweak, and reuse this setup as many times as needed.

Push Your App Image to Docker Hub (optional)

So far, our ETL app runs locally. But what if we want to run it on another machine, share it with a teammate, or deploy it to the cloud?

Docker makes that easy through container registries, which are places where we can store and share Docker images. The most common registry is Docker Hub, which offers free accounts and public repositories. Note that this step is optional and mostly useful if you want to experiment with sharing your image or using it on another computer.

Step 1: Create a Docker Hub account

If you don’t have one yet, go to hub.docker.com and sign up for a free account. Once you’re in, you can create a new repository (for example, etl-app).

Step 2: Tag your image

Docker images need to be tagged with your username and repository name before you can push them. For example, if your username is myname, run:

docker tag etl-app myname/etl-app:latest

This gives your local image a new name that points to your Docker Hub account.

Step 3: Push the image

Log in from your terminal:

docker login

Then push the image:

docker push myname/etl-app:latest

Once it’s uploaded, you (or anyone else) can pull and run the image from anywhere:

docker pull myname/etl-app:latest

This is especially useful if you want to:

  • Share your ETL container with collaborators
  • Use it in cloud deployments or CI pipelines
  • Back up your work in a versioned registry

If you're not ready to create an account, you can skip this step and your image will still work locally as part of your Compose setup.

Wrap-Up and Next Steps

You’ve built and containerized a complete data pipeline using Docker Compose.

Along the way, you learned how to:

  • Build and run custom Docker images
  • Define multi-service environments with a Compose file
  • Pass environment variables and connect services
  • Use volumes for persistent storage
  • Run, inspect, and reuse your full stack with one command

This setup mirrors how real-world data pipelines are often prototyped and tested because Compose gives you a reliable, repeatable way to build and share these workflows.

Where to go next

Here are a few ideas for expanding your project:

  • Schedule your pipeline: Use something like Airflow to run the job on a schedule.
  • Add logging or alerts: Log ETL status to a file or send notifications if something fails.
  • Transform data or add validations: Add more steps to your script to clean, enrich, or validate incoming data.
  • Write tests: Validate that your script does what you expect, especially as it grows.
  • Connect to real-world data sources: Pull from APIs or cloud storage buckets and load the results into Postgres.

Once you’re comfortable with Docker Compose, you’ll be able to spin up production-like environments in seconds, which is a huge win for testing, onboarding, and deployment.

If you're hungry to learn even more, check out our next tutorial: Advanced Concepts in Docker Compose.

❌