Normal view

Understanding, Generating, and Visualizing Embeddings

27 October 2025 at 23:01

Imagine you're searching through a massive library of data science papers looking for content about "cleaning messy datasets." A traditional keyword search returns papers that literally contain those exact words. But it completely misses an excellent paper about "handling missing values and duplicates" and another about "data validation techniques." Even though these papers teach exactly what you're looking for, you'll never see them because they're using different words.

This is the fundamental problem with keyword-based searches: they match words, not meaning. When you search for "neural network training," it won't connect you to papers about "optimizing deep learning models" or "improving model convergence," despite these being essentially the same topic.

Embeddings solve this by teaching machines to understand meaning instead of just matching text. And if you're serious about building AI systems, generating embeddings is a fundamental concept you need to master.

What Are Embeddings?

Embeddings are numerical representations that capture semantic meaning. Instead of treating text as a collection of words to match, embeddings convert text into vectors (a list of numbers) where similar meanings produce similar vectors. Think of it like translating human language into a mathematical language that computers can understand and compare.

When we convert two pieces of text that mean similar things into embeddings, those embedding vectors will be mathematically close to each other in the embedding space. Think of the embedding space as a multi-dimensional map where meaning determines location. Papers about machine learning will cluster together. Papers about data cleaning will form their own group. And papers about data visualization? They'll gather in a completely different region. In a moment, we'll create a visualization that clearly demonstrates this.

Setting Up Your Environment

Before we start working directly with embeddings, let's install the libraries we'll need. We'll use sentence-transformers from Hugging Face to generate embeddings, sklearn for dimensionality reduction, matplotlib for visualization, and numpy to handle the numerical arrays we'll be working with.

This tutorial was developed using Python 3.12.12 with the following library versions. You can use these exact versions for guaranteed compatibility, or install the latest versions (which should work just fine):

# Developed with: Python 3.12.12
# sentence-transformers==5.1.1
# scikit-learn==1.6.1
# matplotlib==3.10.0
# numpy==2.0.2

pip install sentence-transformers scikit-learn matplotlib numpy

Run the command above in your terminal to install all required libraries. This will work whether you're using a Python script, Jupyter notebook, or any other development environment.

For this tutorial series, we'll work with research paper abstracts from arXiv.org, a repository where researchers publish cutting-edge AI and machine learning papers. If you're building AI systems, arXiv is a great resource to have. It's where you'll find the latest research on new architectures, techniques, and approaches that can help you implement the latest techniques in your projects.

arXiv is pronounced as "archive" because the X represents the Greek letter chi ⟨χ⟩

For this tutorial, we've manually created 12 abstracts for papers spanning machine learning, data engineering, and data visualization. These abstracts are stored directly in our code as Python strings, keeping things simple for now. We'll work with APIs and larger datasets in the next tutorial to automate this process.

# Abstracts from three data science domains
papers = [
    # Machine Learning Papers
    {
        'title': 'Building Your First Neural Network with PyTorch',
        'abstract': '''Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the fundamentals of defining layers, activation functions, and forward propagation. You'll build a multi-layer perceptron for classification tasks, understand backpropagation, and implement gradient descent optimization. By the end, you'll have a working model that achieves over 90% accuracy on the MNIST dataset.'''
    },
    {
        'title': 'Preventing Overfitting: Regularization Techniques Explained',
        'abstract': '''Overfitting is one of the most common challenges in machine learning. This guide explores practical regularization methods including L1 and L2 regularization, dropout layers, and early stopping. You'll learn how to detect overfitting by monitoring training and validation loss, implement regularization in both scikit-learn and TensorFlow, and tune regularization hyperparameters to improve model generalization on unseen data.'''
    },
    {
        'title': 'Hyperparameter Tuning with Grid Search and Random Search',
        'abstract': '''Selecting optimal hyperparameters can dramatically improve model performance. This paper demonstrates systematic approaches to hyperparameter optimization using grid search and random search. You'll learn how to define hyperparameter spaces, implement cross-validation during tuning, and use scikit-learn's GridSearchCV and RandomizedSearchCV. We'll compare both methods and discuss when to use each approach for efficient model optimization.'''
    },
    {
        'title': 'Transfer Learning: Using Pre-trained Models for Image Classification',
        'abstract': '''Transfer learning lets you leverage pre-trained models to solve new problems with limited data. This paper shows how to use pre-trained convolutional neural networks like ResNet and VGG for custom image classification tasks. You'll learn how to freeze layers, fine-tune network weights, and adapt pre-trained models to your specific domain. We'll build a classifier that achieves high accuracy with just a few hundred training images.'''
    },

    # Data Engineering/ETL Papers
    {
        'title': 'Handling Missing Data: Strategies and Best Practices',
        'abstract': '''Missing data can derail your analysis if not handled properly. This comprehensive guide covers detection methods for missing values, statistical techniques for understanding missingness patterns, and practical imputation strategies. You'll learn when to use mean imputation, forward fill, and more sophisticated approaches like KNN imputation. We'll work through real datasets with missing values and implement robust solutions using pandas.'''
    },
    {
        'title': 'Data Validation Techniques for ETL Pipelines',
        'abstract': '''Building reliable data pipelines requires thorough validation at every stage. This paper teaches you how to implement data quality checks, define validation rules, and catch errors before they propagate downstream. You'll learn schema validation, outlier detection, and referential integrity checks. We'll build a validation framework using Great Expectations and integrate it into an automated ETL workflow for production data systems.'''
    },
    {
        'title': 'Cleaning Messy CSV Files: A Practical Guide',
        'abstract': '''Real-world CSV files are rarely clean and analysis-ready. This hands-on paper walks through common data quality issues: inconsistent formatting, duplicate records, invalid entries, and encoding problems. You'll master pandas techniques for standardizing column names, removing duplicates, handling date parsing errors, and dealing with mixed data types. We'll transform a messy CSV with multiple quality issues into a clean dataset ready for analysis.'''
    },
    {
        'title': 'Building Scalable ETL Workflows with Apache Airflow',
        'abstract': '''Apache Airflow helps you build, schedule, and monitor complex data pipelines. This paper introduces Airflow's core concepts including DAGs, operators, and task dependencies. You'll learn how to define pipeline workflows, implement retry logic and error handling, and schedule jobs for automated execution. We'll build a complete ETL pipeline that extracts data from APIs, transforms it, and loads it into a data warehouse on a daily schedule.'''
    },

    # Data Visualization Papers
    {
        'title': 'Creating Interactive Dashboards with Plotly Dash',
        'abstract': '''Interactive dashboards make data exploration intuitive and engaging. This paper teaches you how to build web-based dashboards using Plotly Dash. You'll learn to create interactive charts with dropdowns, sliders, and date pickers, implement callbacks for dynamic updates, and design responsive layouts. We'll build a complete dashboard for exploring sales data with filters, multiple chart types, and real-time updates.'''
    },
    {
        'title': 'Matplotlib Best Practices: Making Publication-Quality Plots',
        'abstract': '''Creating clear, professional visualizations requires attention to design principles. This guide covers matplotlib best practices for publication-quality plots. You'll learn about color palette selection, font sizing and typography, axis formatting, and legend placement. We'll explore techniques for reducing chart clutter, choosing appropriate chart types, and creating consistent styling across multiple plots for research papers and presentations.'''
    },
    {
        'title': 'Data Storytelling: Designing Effective Visualizations',
        'abstract': '''Good visualizations tell a story and guide viewers to insights. This paper focuses on the principles of visual storytelling and effective chart design. You'll learn how to choose the right visualization for your data, apply pre-attentive attributes to highlight key information, and structure narratives through sequential visualizations. We'll analyze both effective and ineffective visualizations, discussing what makes certain design choices successful.'''
    },
    {
        'title': 'Building Real-Time Visualization Streams with Bokeh',
        'abstract': '''Visualizing streaming data requires specialized techniques and tools. This paper demonstrates how to create real-time updating visualizations using Bokeh. You'll learn to implement streaming data sources, update plots dynamically as new data arrives, and optimize performance for continuous updates. We'll build a live monitoring dashboard that displays streaming sensor data with smoothly updating line charts and real-time statistics.'''
    }
]

print(f"Loaded {len(papers)} paper abstracts")
print(f"Topics covered: Machine Learning, Data Engineering, and Data Visualization")
Loaded 12 paper abstracts
Topics covered: Machine Learning, Data Engineering, and Data Visualization

Generating Your First Embeddings

Now let's transform these paper abstracts into embeddings. We'll use a pre-trained model from the sentence-transformers library called all-MiniLM-L6-v2. We're using this model because it's fast and efficient for learning purposes, perfect for understanding the core concepts. In our next tutorial, we'll explore more recent production-grade models used in real-world applications.

The model will convert each abstract into an n-dimensional vector, where the value of n depends on the specific model architecture. Different embedding models produce vectors of different sizes. Some models create compact 128-dimensional embeddings, while others produce larger 768 or even 1024-dimensional vectors. Generally, larger embeddings can capture more nuanced semantic information, but they also require more computational resources and storage space.

Think of each dimension in the vector as capturing some aspect of the text's meaning. Maybe one dimension responds strongly to "machine learning" concepts, another to "data cleaning" terminology, and another to "visualization" language. The model learned these representations automatically during training.

Let's see what dimensionality our specific model produces.

from sentence_transformers import SentenceTransformer

# Load the pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract just the abstracts for embedding
abstracts = [paper['abstract'] for paper in papers]

# Generate embeddings for all abstracts
embeddings = model.encode(abstracts)

# Let's examine what we've created
print(f"Shape of embeddings: {embeddings.shape}")
print(f"Each abstract is represented by a vector of {embeddings.shape[1]} numbers")
print(f"\nFirst few values of the first embedding:")
print(embeddings[0][:10])
Shape of embeddings: (12, 384)
Each abstract is represented by a vector of 384 numbers

First few values of the first embedding:
[-0.06071806 -0.13064863  0.00328695 -0.04209436 -0.03220841  0.02034248
  0.0042156  -0.01300791 -0.1026612  -0.04565621]

Perfect! We now have 12 embeddings, one for each paper abstract. Each embedding is a 384-dimensional vector, represented as a NumPy array of floating-point numbers.

These numbers might look random at first, but they encode meaningful information about the semantic content of each abstract. When we want to find similar documents, we measure the cosine similarity between their embedding vectors. Cosine similarity looks at the angle between vectors. Vectors pointing in similar directions (representing similar meanings) have high cosine similarity, even if their magnitudes differ. In a later tutorial, we'll compute vector similarity using cosine, Euclidean, and dot-product methods to compare different approaches.

Before we move on, let's verify we can retrieve the original text:

# Let's look at one paper and its embedding
print("Paper title:", papers[0]['title'])
print("\nAbstract:", papers[0]['abstract'][:100] + "...")
print("\nEmbedding shape:", embeddings[0].shape)
print("Embedding type:", type(embeddings[0]))
Paper title: Building Your First Neural Network with PyTorch

Abstract: Learn how to construct and train a neural network from scratch using PyTorch. This paper covers the ...

Embedding shape: (384,)
Embedding type: <class 'numpy.ndarray'>

Great! We can still access the original paper text alongside its embedding. Throughout this tutorial, we'll work with these embeddings while keeping track of which paper each one represents.

Making Sense of High-Dimensional Spaces

We now have 12 vectors, each with 384 dimensions. But here's the issue: humans can't visualize 384-dimensional space. We struggle to imagine even four dimensions! To understand what our embeddings have learned, we need to reduce them to two dimensions so that we can plot them on a graph.

This is where dimensionality reduction is a good skill to have. We'll use Principal Component Analysis (PCA), a technique we can use to find the two most important dimensions (the ones that capture the most variation in our data). It's like taking a 3D object and finding the best angle to photograph it in 2D while preserving as much information as possible.

While we're definitely going to lose some detail during this compression, our original 384-dimensional embeddings capture rich, nuanced information about semantic meaning. When we squeeze them down to 2D, some subtleties are bound to get lost. But the major patterns (which papers belong to which topic) will still be clearly visible.

from sklearn.decomposition import PCA

# Reduce embeddings from 384 dimensions to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

print(f"Original embedding dimensions: {embeddings.shape[1]}")
print(f"Reduced embedding dimensions: {embeddings_2d.shape[1]}")
print(f"\nVariance explained by these 2 dimensions: {pca.explained_variance_ratio_.sum():.2%}")
Original embedding dimensions: 384
Reduced embedding dimensions: 2

Variance explained by these 2 dimensions: 41.20%

The variance explained tells us how much of the variation in the original data is preserved in these 2 dimensions. Think of it this way: if all our papers were identical, they'd have zero variance. The more different they are, the more variance. We've kept about 41% of that variation, which is plenty to see the major patterns. The clustering itself depends on whether papers use similar vocabulary, not on how much variance we've retained. So even though 41% might seem relatively low, the major patterns separating different topics will still be very clear in our embedding visualization.

Understanding Our Tutorial Topics

Before we create our embeddings visualization, let's see how the 12 papers are organized by topic. This will help us understand the patterns we're about to see in the embeddings:

# Print papers grouped by topic
print("=" * 80)
print("PAPER REFERENCE GUIDE")
print("=" * 80)

topics = [
    ("Machine Learning", list(range(0, 4))),
    ("Data Engineering/ETL", list(range(4, 8))),
    ("Data Visualization", list(range(8, 12)))
]

for topic_name, indices in topics:
    print(f"\n{topic_name}:")
    print("-" * 80)
    for idx in indices:
        print(f"  Paper {idx+1}: {papers[idx]['title']}")
================================================================================
PAPER REFERENCE GUIDE
================================================================================

Machine Learning:
--------------------------------------------------------------------------------
  Paper 1: Building Your First Neural Network with PyTorch
  Paper 2: Preventing Overfitting: Regularization Techniques Explained
  Paper 3: Hyperparameter Tuning with Grid Search and Random Search
  Paper 4: Transfer Learning: Using Pre-trained Models for Image Classification

Data Engineering/ETL:
--------------------------------------------------------------------------------
  Paper 5: Handling Missing Data: Strategies and Best Practices
  Paper 6: Data Validation Techniques for ETL Pipelines
  Paper 7: Cleaning Messy CSV Files: A Practical Guide
  Paper 8: Building Scalable ETL Workflows with Apache Airflow

Data Visualization:
--------------------------------------------------------------------------------
  Paper 9: Creating Interactive Dashboards with Plotly Dash
  Paper 10: Matplotlib Best Practices: Making Publication-Quality Plots
  Paper 11: Data Storytelling: Designing Effective Visualizations
  Paper 12: Building Real-Time Visualization Streams with Bokeh

Now that we know which tutorials belong to which topic, let's visualize their embeddings.

Visualizing Embeddings to Reveal Relationships

We're going to create a scatter plot where each point represents one paper abstract. We'll color-code them by topic so we can see how the embeddings naturally group similar content together.

import matplotlib.pyplot as plt
import numpy as np

# Create the visualization
plt.figure(figsize=(8, 6))

# Define colors for different topics
colors = ['#0066CC', '#CC0099', '#FF6600']
categories = ['Machine Learning', 'Data Engineering/ETL', 'Data Visualization']

# Create color mapping for each paper
color_map = []
for i in range(12):
    if i < 4:
        color_map.append(colors[0])  # Machine Learning
    elif i < 8:
        color_map.append(colors[1])  # Data Engineering
    else:
        color_map.append(colors[2])  # Data Visualization

# Plot each paper
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=color_map[i], s=275, alpha=0.7, edgecolors='black', linewidth=1)
    # Add paper numbers as labels
    plt.annotate(str(i+1), (x, y), fontsize=10, fontweight='bold',
                ha='center', va='center')

plt.xlabel('First Principal Component', fontsize=14)
plt.ylabel('Second Principal Component', fontsize=14)
plt.title('Paper Embeddings from Three Data Science Topics\n(Papers close together have similar semantic meaning)',
          fontsize=15, fontweight='bold', pad=20)

# Add a legend showing which colors represent which topics
legend_elements = [plt.Line2D([0], [0], marker='o', color='w',
                              markerfacecolor=colors[i], markersize=12,
                              label=categories[i]) for i in range(len(categories))]
plt.legend(handles=legend_elements, loc='best', fontsize=12)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What the Visualization Reveals About Semantic Similarity

Take a look at the visualization below that was generated using the code above. As you can see, the results are pretty striking! The embeddings have naturally organized themselves into three distinct regions based purely on semantic content.

Keep in mind that we deliberately chose papers from very distinct topics to make the clustering crystal clear. This is perfect for learning, but real-world datasets are messier. When you're working with papers that bridge multiple topics or have overlapping vocabulary, you'll see more gradual transitions between clusters rather than these sharp separations. We'll encounter that reality in the next tutorial when we work with hundreds of real arXiv papers.

Paper Embeddings from Data Science Topics

  • The Machine Learning cluster (blue, papers 1-4) dominates the lower-left side of the plot. These four points sit close together because they all discuss neural networks, training, and model optimization. Look at papers 1 and 4. They're positioned very near each other despite one focusing on building networks from scratch and the other on transfer learning. The embedding model recognizes that they both use the core language of deep learning: layers, weights, training, and model architectures.
  • The Data Engineering/ETL cluster (magenta, papers 5-8) occupies the upper portion of the plot. These papers share vocabulary around data quality, pipelines, and validation. Notice how papers 5, 6, and 7 form a tight grouping. They all discuss data quality issues using terms like "missing values," "validation," and "cleaning." Paper 8 (about Airflow) sits slightly apart from the others, which makes sense: while it's definitely about data engineering, it focuses more on workflow orchestration than data quality, giving it a slightly different semantic fingerprint.
  • The Data Visualization cluster (orange, papers 9-12) is gathered on the lower-right side. These four papers are packed close together because they all use visualization-specific vocabulary: "charts," "dashboards," "colors," and "interactive elements." The tight clustering here shows just how distinct visualization terminology is from both ML and data engineering language.

What's remarkable is the clear separation between all three clusters. The distance between the ML papers on the left and the visualization papers on the right tells us that these topics use fundamentally different vocabularies. There's minimal semantic overlap between "neural networks" and "dashboards," so they end up far apart in the embedding space.

How the Model Learned to Understand Meaning

The all-MiniLM-L6-v2 embedding model was trained on millions of text pairs, learning which words tend to appear together. When it sees a tutorial full of words like "layers," "training," and "optimization," it produces an embedding vector that's mathematically similar to other texts with that same vocabulary pattern. The clustering emerges naturally from those learned associations.

Why This Matters for Your Work as an AI Engineer

Embeddings are foundational to the modern AI systems you'll build as an AI Engineer. Let's look at how embeddings enable the core technologies you'll work with:

  1. Building Intelligent Search Systems

    Traditional keyword search has a fundamental limitation: it can only find exact matches. If a user searches for "handling null values," they won't find documents about "missing data strategies" or "imputation techniques," even though these are exactly what they need. Embeddings solve this by understanding semantic similarity. When you embed both the search query and your documents, you can find relevant content based on meaning rather than word matching. The result is a search system that actually understands what you're looking for.

  2. Working with Vector Databases

    Vector databases are specialized databases that are built to store and query embeddings efficiently. Instead of SQL queries that match exact values, vector databases let you ask "find me all documents similar to this one" and get results ranked by semantic similarity. They're optimized for the mathematical operations that embeddings require, like calculating distances between high-dimensional vectors, which makes them essential infrastructure for AI applications. Modern systems often use hybrid search approaches that combine semantic similarity with traditional keyword matching to get the best of both worlds.

  3. Implementing Retrieval-Augmented Generation (RAG)

    RAG systems are one of the most powerful patterns in modern AI engineering. Here's how they work: you embed a large collection of documents (like company documentation, research papers, or knowledge bases). When a user asks a question, you embed their question and use that embedding to find the most relevant documents from your collection. Then you pass those documents to a language model, which generates an informed answer grounded in your specific data. Embeddings make the retrieval step possible because they're how the system knows which documents are relevant to the question.

  4. Creating AI Agents with Long-Term Memory

    AI agents that can remember past interactions and learn from experience need a way to store and retrieve relevant memories. Embeddings enable this. When an agent has a conversation or completes a task, you can embed the key information and store it in a vector database. Later, when the agent encounters a similar situation, it can retrieve relevant past experiences by finding embeddings close to the current context. This gives agents the ability to learn from history and make better decisions over time. In practice, long-term agent memory often uses similarity thresholds and time-weighted retrieval to prevent irrelevant or outdated information from being recalled.

These four applications (search, vector databases, RAG, and AI agents) are foundational tools for any aspiring AI Engineer's toolkit. Each builds on embeddings as a core technology. Understanding how embeddings capture semantic meaning is the first step toward building production-ready AI systems.

Advanced Topics to Explore

As you continue learning about embeddings, you'll encounter several advanced techniques that are widely used in production systems:

  • Multimodal Embeddings allow you to embed different types of content (text, images, audio) into the same embedding space. This enables powerful cross-modal search capabilities, like finding images based on text descriptions or vice versa. Models like CLIP demonstrate how effective this approach can be.
  • Instruction-Tuned Embeddings are models fine-tuned to better understand specific types of queries or instructions. These specialized models often outperform general-purpose embeddings for domain-specific tasks like legal document search or medical literature retrieval.
  • Quantization reduces the precision of embedding values (from 32-bit floats to 8-bit integers, for example), which can dramatically reduce storage requirements and speed up similarity calculations with minimal impact on search quality. This becomes crucial when working with millions of embeddings.
  • Dimension Truncation takes advantage of the fact that the most important information in embeddings is often concentrated in the first dimensions. By keeping only the first 256 dimensions of a 768-dimensional embedding, you can achieve significant efficiency gains while preserving most of the semantic information.

These techniques become increasingly important as you scale from prototype to production systems handling real-world data volumes.

Building Toward Production Systems

You've now learned the following core foundational embedding concepts:

  • Embeddings convert text into numerical vectors that capture meaning
  • Similar content produces similar vectors
  • These relationships can be visualized to understand how the model organizes information

But we've only worked with 12 handwritten paper abstracts. This is perfect for getting the core concept, but real applications need to handle hundreds or thousands of documents.

In the next tutorial, we'll scale up dramatically. You'll learn how to collect documents programmatically using APIs, generate embeddings at scale, and make strategic decisions about different embedding approaches.

You'll also face the practical challenges that come with real data: rate limits on APIs, processing time for large datasets, the tradeoff between embedding quality and speed, and how to handle edge cases like empty documents or very long texts. These considerations separate a learning exercise from a production system.

By the end of the next tutorial, you'll be equipped to build an embedding system that handles real-world data at scale. That foundation will prepare you for our final embeddings tutorial, where we'll implement similarity search and build a complete semantic search engine.

Next Steps

For now, experiment with the code above:

  • Try replacing one of the paper abstracts with content from your own learning.
    • Where does it appear on the visualization?
    • Does it cluster with one of our three topics, or does it land somewhere in between?
  • Add a paper abstract that bridges two topics, like "Using Machine Learning to Optimize ETL Pipelines."
    • Does it position itself between the ML and data engineering clusters?
    • What does this tell you about how embeddings handle multi-topic content?
  • Try changing the embedding model to see how it affects the visualization.
    • Models like all-mpnet-base-v2 produce different dimensional embeddings.
    • Do the clusters become tighter or more spread out?
  • Experiment with adding a completely unrelated abstract, like a cooking recipe or news article.
    • Where does it land relative to our three data science clusters?
    • How far away is it from the nearest cluster?

This hands-on exploration and experimentation will deepen your intuition about how embeddings work.

Ready to scale things up? In the next tutorial, we'll work with real arXiv data and build an embedding system that can handle thousands of papers. See you there!


Key Takeaways:

  • Embeddings convert text into numerical vectors that capture semantic meaning
  • Similar meanings produce similar vectors, enabling mathematical comparison of concepts
  • Papers from different topics cluster separately because they use distinct vocabulary
  • Dimensionality reduction (like PCA) helps visualize high-dimensional embeddings in 2D
  • Embeddings power modern AI systems, including semantic search, vector databases, RAG, and AI agents

20 Fun (and Unique) Data Analyst Projects for Beginners

26 October 2025 at 21:23

You're here because you're serious about becoming a data analyst. You’ve probably noticed that just about every data analytics job posting asks for experience. But how do individuals get experience if they’re just starting out?! The answer: you do it by building a solid portfolio of data analytic projects so that you can land a job as a junior data analyst, even with no experience.

Data Analyst with a magnifying glass examining large chart graphics in the background.

Your portfolio is your ticket to proving your capabilities to a potential employer. Even without previous job experience, a well-curated collection of data analytics projects can set you apart from the competition. They demonstrate your ability to tackle real-world problems with real data, showcasing your ability to clean datasets, create compelling visualizations, and extract meaningful insights—skills that are in high demand.

You just have to pick the ones that speak to you and get started!

Getting started with data analytics projects

So, you're ready to tackle your first data analytics project? Awesome! Let's break down what you need to know to set yourself up for success.

Our curated list of 20 projects below will help you develop the most sought-after data analysis skills and practice using the most frequently used data analysis tools. Namely:

Setting up an effective development environment is also vital. Begin by creating a Python environment with Conda or venv. Use version control like Git to track project changes. Combine an IDE like Jupyter Notebook with core Python libraries to boost your productivity.

Remember, Rome wasn't built in a day! Start your data analysis journey with bite-sized projects to steadily build your skills. Keep learning, stay curious, and enjoy the ride. Before you know it, you'll be tackling real-world data challenges like the professionals do.

20 Data Analyst Projects for Beginners

Each project listed below will help you apply what you've learned to real data, growing your abilities one step at a time. While they are tailored towards beginners, some will be more challenging than others. By working through them, you'll create a portfolio that shows a potential employer you have the practical skills to analyze data on the job.

The data analytics projects below cover a range of analysis techniques, applications, and tools:

  1. Learn and Install Jupyter Notebook
  2. Profitable App Profiles for the App Store and Google Play Markets
  3. Exploring Hacker News Posts
  4. Clean and Analyze Employee Exit Surveys
  5. Star Wars Survey
  6. Word Raider
  7. Install RStudio
  8. Creating An Efficient Data Analysis Workflow
  9. Creating An Efficient Data Analysis Workflow, Part 2
  10. Preparing Data with Excel
  11. Visualizing the Answer to Stock Questions Using Spreadsheet Charts
  12. Identifying Customers Likely to Churn for a Telecommunications Provider
  13. Data Prep in Tableau
  14. Business Intelligence Plots
  15. Data Presentation
  16. Modeling Data in Power BI
  17. Visualization of Life Expectancy and GDP Variation Over Time
  18. Building a BI App
  19. Analyzing Kickstarter Projects
  20. Analyzing Startup Fundraising Deals from Crunchbase

In the following sections, you'll find step-by-step guides to walk you through each project. These detailed instructions will help you apply what you've learned and solidify your data analytics skills.

1. Learn and Install Jupyter Notebook

Overview

In this beginner-level project, you'll assume the role of a Jupyter Notebook novice aiming to gain the essential skills for real-world data analytics projects. You'll practice running code cells, documenting your work with Markdown, navigating Jupyter using keyboard shortcuts, mitigating hidden state issues, and installing Jupyter locally. By the end of the project, you'll be equipped to use Jupyter Notebook to work on data analytics projects and share compelling, reproducible notebooks with others.

Tools and Technologies

  • Jupyter Notebook
  • Python

Prerequisites

Before you take on this project, it's recommended that you have some foundational Python skills in place first, such as:

Step-by-Step Instructions

  1. Get acquainted with the Jupyter Notebook interface and its components
  2. Practice running code cells and learn how execution order affects results
  3. Use keyboard shortcuts to efficiently navigate and edit notebooks
  4. Create Markdown cells to document your code and communicate your findings
  5. Install Jupyter locally to work on projects on your own machine

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

  • Familiarity with the core components and workflow of Jupyter Notebook
  • Ability to use Jupyter Notebook to run code, perform analysis, and share results
  • Understanding of how to structure and document notebooks for real-world reproducibility
  • Proficiency in navigating Jupyter Notebook using keyboard shortcuts to boost productivity
  • Readiness to apply Jupyter Notebook skills to real-world data projects and collaborate with others

Relevant Links and Resources

Additional Resources

2. Profitable App Profiles for the App Store and Google Play Markets

Overview

In this guided project, you'll assume the role of a data analyst for a company that builds ad-supported mobile apps. By analyzing historical data from the Apple App Store and Google Play Store, you'll identify app profiles that attract the most users and generate the most revenue. Using Python and Jupyter Notebook, you'll clean the data, analyze it using frequency tables and averages, and make practical recommendations on the app categories and characteristics the company should target to maximize profitability.

Tools and Technologies

  • Python
  • Data Analytics
  • Jupyter Notebook

Prerequisites

This is a beginner-level project, but you should be comfortable working with Python functions and Jupyter Notebook:

  • Writing functions with arguments, return statements, and control flow
  • Debugging functions to ensure proper execution
  • Using conditional logic and loops within functions
  • Working with Jupyter Notebook to write and run code

Step-by-Step Instructions

  1. Open and explore the App Store and Google Play datasets
  2. Clean the datasets by removing non-English apps and duplicate entries
  3. Isolate the free apps for further analysis
  4. Determine the most common app genres and their characteristics using frequency tables
  5. Make recommendations on the ideal app profiles to maximize users and revenue

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Cleaning real-world data to prepare it for analysis
  • Analyzing app market data to identify trends and success factors
  • Applying data analysis techniques like frequency tables and calculating averages
  • Using data insights to inform business strategy and decision-making
  • Communicating your findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

3. Exploring Hacker News Posts

Overview

In this project, you'll explore and analyze a dataset from Hacker News, a popular tech-focused community site. Using Python, you'll apply skills in string manipulation, object-oriented programming, and date management to uncover trends in user submissions and identify factors that drive community engagement. This hands-on project will strengthen your ability to interpret real-world datasets and enhance your data analysis skills.

Tools and Technologies

  • Python
  • Data cleaning
  • Object-oriented programming
  • Data Analytics
  • Jupyter Notebook

Prerequisites

To get the most out of this project, you should have some foundational Python and data cleaning skills, such as:

  • Employing loops in Python to explore CSV data
  • Utilizing string methods in Python to clean data for analysis
  • Processing dates from strings using the datetime library
  • Formatting dates and times for analysis using strftime

Step-by-Step Instructions

  1. Remove headers from a list of lists
  2. Extract 'Ask HN' and 'Show HN' posts
  3. Calculate the average number of comments for 'Ask HN' and 'Show HN' posts
  4. Find the number of 'Ask HN' posts and average comments by hour created
  5. Sort and print values from a list of lists

Expected Outcomes

After completing this project, you'll have gained practical experience and skills, including:

  • Applying Python string manipulation, OOP, and date handling to real-world data
  • Analyzing trends and patterns in user submissions on Hacker News
  • Identifying factors that contribute to post popularity and engagement
  • Communicating insights derived from data analysis

Relevant Links and Resources

Additional Resources

4. Clean and Analyze Employee Exit Surveys

Overview

In this hands-on project, you'll play the role of a data analyst for the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. Your task is to clean and analyze employee exit surveys from both institutes to identify insights into why employees resign. Using Python and pandas, you'll combine messy data from multiple sources, clean column names and values, analyze the data, and share your key findings.

Tools and Technologies

  • Python
  • Pandas
  • Data cleaning
  • Data Analytics
  • Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with:

  • Exploring and analyzing data using pandas
  • Aggregating data with pandas groupby operations
  • Combining datasets using pandas concat and merge functions
  • Manipulating strings and handling missing data in pandas

Step-by-Step Instructions

  1. Load and explore the DETE and TAFE exit survey data
  2. Identify missing values and drop unnecessary columns
  3. Clean and standardize column names across both datasets
  4. Filter the data to only include resignation reasons
  5. Verify data quality and create new columns for analysis
  6. Combine the cleaned datasets into one for further analysis
  7. Analyze the cleaned data to identify trends and insights

Expected Outcomes

By completing this project, you will:

  • Clean real-world, messy HR data to prepare it for analysis
  • Apply core data cleaning techniques in Python and pandas
  • Combine multiple datasets and conduct exploratory analysis
  • Analyze employee exit surveys to understand key drivers of resignations
  • Summarize your findings and share data-driven recommendations

Relevant Links and Resources

Additional Resources

5. Star Wars Survey

Overview

In this project designed for beginners, you'll become a data analyst exploring FiveThirtyEight's Star Wars survey data. Using Python and pandas, you'll clean messy data, map values, compute statistics, and analyze the data to uncover fan film preferences. By comparing results between demographic segments, you'll gain insights into how Star Wars fans differ in their opinions. This project provides hands-on practice with key data cleaning and analysis techniques essential for data analyst roles across industries.

Tools and Technologies

  • Python
  • Pandas
  • Jupyter Notebook

Prerequisites

Before starting this project, you should be familiar with the following:

Step-by-Step Instructions

  1. Map Yes/No columns to Boolean values to standardize the data
  2. Convert checkbox columns to lists and get them into a consistent format
  3. Clean and rename the ranking columns to make them easier to analyze
  4. Identify the highest-ranked and most-viewed Star Wars films
  5. Analyze the data by key demographic segments like gender, age, and location
  6. Summarize your findings on fan preferences and differences between groups

Expected Outcomes

After completing this project, you will have gained:

  • Experience cleaning and analyzing a real-world, messy dataset
  • Hands-on practice with pandas data manipulation techniques
  • Insights into the preferences and opinions of Star Wars fans
  • An understanding of how to analyze survey data for business insights

Relevant Links and Resources

Additional Resources

6. Word Raider

Overview

In this beginner-level Python project, you'll step into the role of a developer to create "Word Raider," an interactive word-guessing game. Although this project won't have you perform any explicit data analysis, it will sharpen your Python skills and make you a better data analyst. Using fundamental programming skills, you'll apply concepts like loops, conditionals, and file handling to build the game logic from the ground up. This hands-on project allows you to consolidate your Python knowledge by integrating key techniques into a fun application.

Tools and Technologies

  • Python
  • Jupyter Notebook

Prerequisites

Before diving into this project, you should have some foundational Python skills, including:

Step-by-Step Instructions

  1. Build the word bank by reading words from a text file into a Python list
  2. Set up variables to track the game state, like the hidden word and remaining attempts
  3. Implement functions to receive and validate user input for their guesses
  4. Create the game loop, checking guesses against the hidden word and providing feedback
  5. Update the game state after each guess and check for a win or loss condition

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Strengthened proficiency in fundamental Python programming concepts
  • Experience building an interactive, text-based game from scratch
  • Practice with file I/O, data structures, and basic object-oriented design
  • Improved problem-solving skills and ability to translate ideas into code

Relevant Links and Resources

Additional Resources

7. Install RStudio

Overview

In this beginner-level project, you'll take the first steps in your data analysis journey by installing R and RStudio. As an aspiring data analyst, you'll set up a professional programming environment and explore RStudio's features for efficient R coding and analysis. Through guided exercises, you'll write scripts, import data, and create visualizations, building key foundations for your career.

Tools and Technologies

  • R
  • RStudio

Prerequisites

To complete this project, it's recommended to have basic knowledge of:

  • R syntax and programming fundamentals
  • Variables, data types, and arithmetic operations in R
  • Logical and relational operators in R expressions
  • Importing, exploring, and visualizing datasets in R

Step-by-Step Instructions

  1. Install the latest version of R and RStudio on your computer
  2. Practice writing and executing R code in the Console
  3. Import a dataset into RStudio and examine its contents
  4. Write and save R scripts to organize your code
  5. Generate basic data visualizations using ggplot2

Expected Outcomes

By completing this project, you'll gain essential skills including:

  • Setting up an R development environment with RStudio
  • Navigating RStudio's interface for data science workflows
  • Writing and running R code in scripts and the Console
  • Installing and loading R packages for analysis and visualization
  • Importing, exploring, and visualizing data in RStudio

Relevant Links and Resources

Additional Resources

8. Creating An Efficient Data Analysis Workflow

Overview

In this hands-on project, you'll step into the role of a data analyst hired by a company selling programming books. Your mission is to analyze their sales data to determine which titles are most profitable. You'll apply key R programming concepts like control flow, loops, and functions to develop an efficient data analysis workflow. This project provides valuable practice in data cleaning, transformation, and analysis, culminating in a structured report of your findings and recommendations.

Tools and Technologies

  • R
  • RStudio
  • Data Analytics

Prerequisites

To successfully complete this project, you should have the following foundational control flow, iteration, and functions in R skills:

  • Implementing control flow using if-else statements
  • Employing for loops and while loops for iteration
  • Writing custom functions to modularize code
  • Combining control flow, loops, and functions in R

Step-by-Step Instructions

  1. Get acquainted with the provided book sales dataset
  2. Transform and prepare the data for analysis
  3. Analyze the cleaned data to identify top performing titles
  4. Summarize your findings in a structured report
  5. Provide data-driven recommendations to stakeholders

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Applying R programming concepts to real-world data analysis
  • Developing an efficient, reproducible data analysis workflow
  • Cleaning and preparing messy data for analysis
  • Analyzing sales data to derive actionable business insights
  • Communicating findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

9. Creating An Efficient Data Analysis Workflow, Part 2

Overview

In this hands-on project, you'll step into the role of a data analyst at a book company tasked with evaluating the impact of a new program launched on July 1, 2019 to encourage customers to buy more books. Using your data analysis skills in R, you'll clean and process the company's 2019 sales data to determine if the program successfully boosted book purchases and improved review quality. This project allows you to apply key R packages like dplyr, stringr, and lubridate to efficiently analyze a real-world business dataset and deliver actionable insights.

Tools and Technologies

  • R
  • RStudio
  • dplyr
  • stringr
  • lubridate

Prerequisites

To successfully complete this project, you should have some specialized data processing in R skills:

  • Manipulating strings using stringr functions
  • Working with dates and times using lubridate
  • Applying the map function to vectorize custom functions
  • Understanding and employing regular expressions for pattern matching

Step-by-Step Instructions

  1. Load and explore the book company's 2019 sales data
  2. Clean the data by handling missing values and inconsistencies
  3. Process the text reviews to determine positive/negative sentiment
  4. Compare key sales metrics like purchases and revenue before and after the July 1 program launch date
  5. Analyze differences in sales between customer segments

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Cleaning and preparing a real-world business dataset for analysis
  • Applying powerful R packages to manipulate and process data efficiently
  • Analyzing sales data to quantify the impact of a new initiative
  • Translating analysis findings into meaningful business insights

Relevant Links and Resources

Additional Resources

10. Preparing Data with Excel

Overview

In this hands-on project for beginners, you'll step into the role of a data professional in a marine biology research organization. Your mission is to prepare a raw dataset on shark attacks for an analysis team to study trends in attack locations and frequency over time. Using Excel, you'll import the data, organize it in worksheets and tables, handle missing values, and clean the data by removing duplicates and fixing inconsistencies. This project provides practical experience in the essential data preparation skills required for real-world analysis projects.

Tools and Technologies

  • Excel

Prerequisites

This project is designed for beginners. To complete it, you should be familiar with preparing data in Excel:

  • Importing data into Excel from various sources
  • Organizing spreadsheet data using worksheets and tables
  • Cleaning data by removing duplicates, fixing inconsistencies, and handling missing values
  • Consolidating data from multiple sources into a single table

Step-by-Step Instructions

  1. Import the raw shark attack data into an Excel workbook
  2. Organize the data into worksheets and tables with a logical structure
  3. Clean the data by removing duplicate entries and fixing inconsistencies
  4. Consolidate shark attack data from multiple sources into a single table

Expected Outcomes

By completing this project, you will gain:

  • Hands-on experience in data preparation and cleaning techniques using Excel
  • Foundational skills for importing, organizing, and cleaning data for analysis
  • An understanding of how to handle missing values and inconsistencies in a dataset
  • Ability to consolidate data from disparate sources into an analysis-ready format
  • Practical experience working with a real-world dataset on shark attacks
  • A solid foundation for data analysis projects and further learning in Excel

Relevant Links and Resources

Additional Resources

11. Visualizing the Answer to Stock Questions Using Spreadsheet Charts

Overview

In this hands-on project, you'll step into the shoes of a business analyst to explore historical stock market data using Excel. By applying information design concepts, you'll create compelling visualizations and craft an insightful report – building valuable skills for communicating data-driven insights that are highly sought-after by employers across industries.

Tools and Technologies

  • Excel
  • Data visualization
  • Information design principles

Prerequisites

To successfully complete this project, it's recommended to have foundational visualizing data in Excel skills, such as:

  • Creating various chart types in Excel to visualize data
  • Selecting appropriate chart types to effectively present data
  • Applying design principles to create clear and informative charts
  • Designing charts for an audience using Gestalt principles

Step-by-Step Instructions

  1. Import the dataset to an Excel spreadsheet
  2. Create a report using data visualizations and tabular data
  3. Represent the data using effective data visualizations
  4. Apply Gestalt principles and pre-attentive attributes to all visualizations
  5. Maximize data-ink ratio in all visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Analyzing real-world stock market data in Excel
  • Applying information design principles to create effective visualizations
  • Selecting the best chart types to answer specific questions about the data
  • Combining multiple charts into a cohesive, insightful report
  • Developing in-demand data visualization and communication skills

Relevant Links and Resources

Additional Resources

12. Identifying Customers Likely to Churn for a Telecommunications Provider

Overview

In this beginner project, you'll take on the role of a data analyst at a telecommunications company. Your challenge is to explore customer data in Excel to identify profiles of those likely to churn. Retaining customers is crucial for telecom providers, so your insights will help inform proactive retention efforts. You'll conduct exploratory data analysis, calculating key statistics, building PivotTables to slice the data, and creating charts to visualize your findings. This project provides hands-on experience with core Excel skills for data-driven business decisions that will enhance your analyst portfolio.

Tools and Technologies

  • Excel

Prerequisites

To complete this project, you should feel comfortable exploring data in Excel:

  • Calculating descriptive statistics in Excel
  • Analyzing data with descriptive statistics
  • Creating PivotTables in Excel to explore and analyze data
  • Visualizing data with histograms and boxplots in Excel

Step-by-Step Instructions

  1. Import the customer dataset into Excel
  2. Calculate descriptive statistics for key metrics
  3. Create PivotTables, histograms, and boxplots to explore data differences
  4. Analyze and identify profiles of likely churners
  5. Compile a report with your data visualizations

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Hands-on practice analyzing a real-world customer dataset in Excel
  • Ability to calculate and interpret key statistics to profile churn risks
  • Experience building PivotTables and charts to slice data and uncover insights
  • Skill in translating analysis findings into an actionable report for stakeholders

Relevant Links and Resources

Additional Resources

13. Data Prep in Tableau

Overview

In this hands-on project, you'll take on the role of a data analyst for Dataquest to prepare their online learning platform data for analysis. You'll connect to Excel data, import tables into Tableau, and define table relationships to build a data model for uncovering insights on student engagement and performance. This project focuses on essential data preparation steps in Tableau, providing you with a robust foundation for data visualization and analysis.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should have some foundational skills in preparing data in Tableau, such as:

  • Connecting to data sources in Tableau to access the required data
  • Importing data tables into the Tableau canvas
  • Defining relationships between tables in Tableau to combine data
  • Cleaning and filtering imported data in Tableau to prepare it for use

Step-by-Step Instructions

  1. Connect to the provided Excel file containing key tables on student engagement, course performance, and content completion rates
  2. Import the tables into Tableau and define the relationships between tables to create a unified data model
  3. Clean and filter the imported data to handle missing values, inconsistencies, or irrelevant information
  4. Save the prepared data source to use for creating visualizations and dashboards
  5. Reflect on the importance of proper data preparation for effective analysis

Expected Outcomes

By completing this project, you will gain valuable skills and experience, including:

  • Hands-on practice with essential data preparation techniques in Tableau
  • Ability to connect to, import, and combine data from multiple tables
  • Understanding of how to clean and structure data for analysis
  • Readiness to progress to creating visualizations and dashboards to uncover insights

Relevant Links and Resources

Additional Resources

14. Business Intelligence Plots

Overview

In this hands-on project, you'll step into the role of a data visualization consultant for Adventure Works. The company's leadership team wants to understand the differences between their online and offline sales channels. You'll apply your Tableau skills to build insightful, interactive data visualizations that provide clear comparisons and enable data-driven business decisions. Key techniques include creating calculated fields, applying filters, utilizing dual-axis charts, and embedding visualizations in tooltips. By the end, you'll have a set of powerful Tableau dashboards ready to share with stakeholders.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should have a solid grasp of data visualization fundamentals in Tableau:

  • Navigating the Tableau interface and distinguishing between dimensions and measures
  • Constructing various foundational chart types in Tableau
  • Developing and interpreting calculated fields to enhance analysis
  • Employing filters to improve visualization interactivity

Step-by-Step Instructions

  1. Compare online vs offline orders using visualizations
  2. Analyze products across channels with scatter plots
  3. Embed visualizations in tooltips for added insight
  4. Summarize findings and identify next steps

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

  • Practical experience building interactive business intelligence dashboards in Tableau
  • Ability to create calculated fields to conduct tailored analysis
  • Understanding of how to use filters and tooltips to enable data exploration
  • Skill in developing visualizations that surface actionable insights for stakeholders

Relevant Links and Resources

Additional Resources

15. Data Presentation

Overview

In this project, you'll step into the role of a data analyst exploring conversion funnel trends for a company's leadership team. Using Tableau, you'll build interactive dashboards that uncover insights about which marketing channels, locations, and customer personas drive the most value in terms of volume and conversion rates. By applying data visualization best practices and incorporating dashboard actions and filters, you'll create a professional, usable dashboard ready to present your findings to stakeholders.

Tools and Technologies

  • Tableau

Prerequisites

To successfully complete this project, you should be comfortable sharing insights in Tableau, such as:

  • Building basic charts like bar charts and line graphs in Tableau
  • Employing color, size, trend lines and forecasting to emphasize insights
  • Combining charts, tables, text and images into dashboards
  • Creating interactive dashboards with filters and quick actions

Step-by-Step Instructions

  1. Import and clean the conversion funnel data in Tableau
  2. Build basic charts to visualize key metrics
  3. Create interactive dashboards with filters and actions
  4. Add annotations and highlights to emphasize key insights
  5. Compile a professional dashboard to present findings

Expected Outcomes

Upon completing this project, you'll have gained practical experience and valuable skills, including:

  • Analyzing conversion funnel data to surface actionable insights
  • Visualizing trends and comparisons using Tableau charts and graphs
  • Applying data visualization best practices to create impactful dashboards
  • Adding interactivity to enable exploration of the data
  • Communicating data-driven findings and recommendations to stakeholders

Relevant Links and Resources

Additional Resources

16. Modeling Data in Power BI

Overview

In this hands-on project, you'll step into the role of an analyst at a company that sells scale model cars. Your mission is to model and analyze data from their sales records database using Power BI to extract insights that drive business decision-making. Power BI is a powerful business analytics tool that enables you to connect to, model, and visualize data. By applying data cleaning, transformation, and modeling techniques in Power BI, you'll prepare the sales data for analysis and develop practical skills in working with real-world datasets. This project provides valuable experience in extracting meaningful insights from raw data to inform business strategy.

Tools and Technologies

  • Power BI

Prerequisites

To successfully complete this project, you should know how to model data in Power BI, such as:

  • Designing a basic data model in Power BI
  • Configuring table and column properties in Power BI
  • Creating calculated columns and measures using DAX in Power BI
  • Reviewing the performance of measures, relationships, and visuals in Power BI

Step-by-Step Instructions

  1. Import the sales data into Power BI
  2. Clean and transform the data for analysis
  3. Design a basic data model in Power BI
  4. Create calculated columns and measures using DAX
  5. Build visualizations to extract insights from the data

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

  • Hands-on practice modeling and analyzing real-world sales data in Power BI
  • Ability to clean, transform and prepare data for analysis
  • Experience extracting meaningful business insights from raw data
  • Developing practical skills in data modeling and analysis using Power BI

Relevant Links and Resources

Additional Resources

17. Visualization of Life Expectancy and GDP Variation Over Time

Overview

In this project, you'll step into the role of a data analyst tasked with visualizing life expectancy and GDP data over time to uncover trends and regional differences. Using Power BI, you'll apply data cleaning, transformation, and visualization skills to create interactive scatter plots and stacked column charts that reveal insights from the Gapminder dataset. This hands-on project allows you to practice the full life-cycle of report and dashboard development in Power BI. You'll load and clean data, create and configure visualizations, and publish your work to showcase your skills. By the end, you'll have an engaging, interactive dashboard to add to your portfolio.

Tools and Technologies

  • Power BI

Prerequisites

To complete this project, you should be able to visualize data in Power BI, such as:

  • Creating basic Power BI visuals
  • Designing accessible report layouts
  • Customizing report themes and visual markers
  • Publishing Power BI reports and dashboards

Step-by-Step Instructions

  1. Import the life expectancy and GDP data into Power BI
  2. Clean and transform the data for analysis
  3. Create interactive scatter plots and stacked column charts
  4. Design an accessible report layout in Power BI
  5. Customize visual markers and themes to enhance insights

Expected Outcomes

By completing this project, you'll gain practical experience and valuable skills, including:

  • Applying data cleaning, transformation, and visualization techniques in Power BI
  • Creating interactive scatter plots and stacked column charts to uncover data insights
  • Developing an engaging dashboard to showcase your data visualization skills
  • Practicing the full life-cycle of Power BI report and dashboard development

Relevant Links and Resources

Additional Resources

18. Building a BI App

Overview

In this hands-on project, you'll step into the role of a business intelligence analyst at Dataquest, an online learning platform. Using Power BI, you'll import and model data on course completion rates and Net Promoter Scores (NPS) to assess course quality. You'll create insightful visualizations like KPI metrics, line charts, and scatter plots to analyze trends and compare courses. Leveraging this analysis, you'll provide data-driven recommendations on which courses Dataquest should improve.

Tools and Technologies

  • Power BI

Prerequisites

To successfully complete this project, you should have some foundational skills in Power BI, such as how to manage workspaces and datasets in Power BI:

  • Creating and managing workspaces
  • Importing and updating assets within a workspace
  • Developing dynamic reports using parameters
  • Implementing static and dynamic row-level security

Step-by-Step Instructions

  1. Import and explore the course completion and NPS data, looking for data quality issues
  2. Create a data model relating the fact and dimension tables
  3. Write calculations for key metrics like completion rate and NPS, and validate the results
  4. Design and build visualizations to analyze course performance trends and comparisons

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience:

  • Importing, modeling, and analyzing data in Power BI to drive decisions
  • Creating calculated columns and measures to quantify key metrics
  • Designing and building insightful data visualizations to convey trends and comparisons
  • Developing impactful reports and dashboards to summarize findings
  • Sharing data stories and recommending actions via Power BI apps

Relevant Links and Resources

Additional Resources

19. Analyzing Kickstarter Projects

Overview

In this hands-on project, you'll step into the role of a data analyst to explore and analyze Kickstarter project data using SQL. You'll start by importing and exploring the dataset, followed by cleaning the data to ensure accuracy. Then, you'll write SQL queries to uncover trends and insights within the data, such as success rates by category, funding goals, and more. By the end of this project, you'll be able to use SQL to derive meaningful insights from real-world datasets.

Tools and Technologies

  • SQL

Prerequisites

To successfully complete this project, you should be comfortable working with SQL and databases, such as:

  • Basic SQL commands and querying
  • Data manipulation and joins in SQL
  • Experience with cleaning data and handling missing values

Step-by-Step Instructions

  1. Import and explore the Kickstarter dataset to understand its structure
  2. Clean the data to handle missing values and ensure consistency
  3. Write SQL queries to analyze the data and uncover trends
  4. Visualize the results of your analysis using SQL queries

Expected Outcomes

Upon completing this project, you'll have gained valuable skills and experience, including:

  • Proficiency in using SQL for data analysis
  • Experience with cleaning and analyzing real-world datasets
  • Ability to derive insights from Kickstarter project data

Relevant Links and Resources

Additional Resources

20. Analyzing Startup Fundraising Deals from Crunchbase

Overview

In this beginner-level guided project, you'll step into the role of a data analyst to explore a dataset of startup investments from Crunchbase. By applying your pandas and SQLite skills, you'll work with a large dataset to uncover insights on fundraising trends, successful startups, and active investors. This project focuses on developing techniques for handling memory constraints, selecting optimal data types, and leveraging SQL databases. You'll strengthen your ability to apply the pandas-SQLite workflow to real-world scenarios.

Tools and Technologies

  • Python
  • Pandas
  • SQLite
  • Jupyter Notebook

Prerequisites

Although this is a beginner-level SQL project, you'll need some solid skills in Python and data analysis before taking it on:

Step-by-Step Instructions

  1. Explore the structure and contents of the Crunchbase startup investments dataset
  2. Process the large dataset in chunks and load into an SQLite database
  3. Analyze fundraising rounds data to identify trends and derive insights
  4. Examine the most successful startup verticals based on total funding raised
  5. Identify the most active investors by number of deals and total amount invested

Expected Outcomes

Upon completing this guided project, you'll gain practical skills and experience, including:

  • Applying pandas and SQLite to analyze real-world startup investment data
  • Handling large datasets effectively through chunking and efficient data types
  • Integrating pandas DataFrames with SQL databases for scalable data analysis
  • Deriving actionable insights from fundraising data to understand startup success
  • Building a project for your portfolio showcasing pandas and SQLite skills

Relevant Links and Resources

Additional Resources

Choosing the right data analyst projects

Since the list of data analytics projects on the internet is exhaustive (and can be exhausting!), no one can be expected to build them all. So, how do you pick the right ones for your portfolio, whether they're guided or independent projects? Let's go over the criteria you should use to make this decision.

Passions vs. Interests vs. In-Demand skills

When selecting projects, it’s essential to strike a balance between your passions, interests, and in-demand skills. Here’s how to navigate these three factors:

  • Passions: Choose projects that genuinely excite you and align with your long-term goals. Passions are often areas you are deeply committed to and are willing to invest significant time and effort in. Working on something you are passionate about can keep you motivated and engaged, which is crucial for learning and completing the project.
  • Interests: Pick projects related to fields or a topic that sparks your curiosity or enjoyment. Interests might not have the same level of commitment as passions, but they can still make the learning process more enjoyable and meaningful. For instance, if you're curious about sports analytics or healthcare data, these interests can guide your project choices.
  • In-Demand Skills: Focus on projects that help you develop skills currently sought after in the job market. Research job postings and industry trends to identify which skills are in demand and tailor your projects to develop those competencies.

Steps to picking the right data analytics projects

  1. Assess your current skill level
    • If you’re a beginner, start with projects that focus on data cleaning (an essential skill), exploration, and visualization. Using Python libraries like Pandas and Matplotlib is an efficient way to build these foundational skills.
    • Utilize structured resources that provide both a beginner data analyst learning path and support to guide you through your first projects.
  2. Plan before you code
    • Clearly define your topic, project objectives, and key questions upfront to stay focused and aligned with your goals.
    • Choose appropriate data sources early in the planning process to streamline the rest of the project.
  3. Focus on the fundamentals
    • Clean your data thoroughly to ensure accuracy.
    • Use analytical techniques that align with your objectives.
    • Create clear, impactful visualizations of your findings.
    • Document your process for reproducibility and effective communication.
  4. Start small and scale up
  5. Seek feedback and iterate
    • Share your projects with peers, mentors, or online communities to get feedback.
    • Use this feedback to improve and refine your work.

Remember, it’s okay to start small and gradually take on bigger challenges. Each project you complete, no matter how simple, helps you gain skills and learn valuable lessons. Tackling a series of focused projects is one of the best ways to grow your abilities as a data professional. With each one, you’ll get better at planning, execution, and communication.

Conclusion

If you're serious about landing a data analytics job, project-based learning is key.

There’s a lot of data out there and a lot you can do with it. Trying to figure out where to start can be overwhelming. If you want a more structured approach to reaching your goal, consider enrolling in Dataquest’s Data Analyst in Python career path. It offers exactly what you need to land your first job as a data analyst or to grow your career by adding one of the most popular programming languages, in-demand data skills, and projects to your CV.

But if you’re confident in doing this on your own, the list of projects we’ve shared in this post will definitely help you get there. To continue improving, we encourage you to take on additional projects and share them in the Dataquest Community. This provides valuable peer feedback, helping you refine your projects to become more advanced and join the group of professionals who do this for a living.

Python Projects: 60+ Ideas for Beginners to Advanced (2025)

23 October 2025 at 18:46
Quick Answer: The best Python projects for beginners include building an interactive word game, analyzing your Netflix data, creating a password generator, or making a simple web scraper. These projects teach core Python skills like loops, functions, data manipulation, and APIs while producing something you can actually use. Below, you'll find 60+ project ideas organized by skill level, from beginner to advanced.

Completing Python projects is the ultimate way to learn the language. When you work on real-world projects, you not only retain more of the lessons you learn, but you'll also find it super motivating to push yourself to pick up new skills. Because let's face it, no one actually enjoys sitting in front of a screen learning random syntax for hours on end―particularly if it's not going to be used right away.

Python projects don't have this problem. Anything new you learn will stick because you're immediately putting it into practice. There's just one problem: many Python learners struggle to come up with their own Python project ideas to work on. But that's okay, we can help you with that!

Best Starter Python Projects

Here are a few beginner-friendly Python projects from the list below that are perfect for getting hands-on experience right away:

Choose one that excites you and just go with it! You’ll learn more by building than by reading alone.

Are You Ready for This?

If you have some programming experience, you might be ready to jump straight into building a Python project. However, if you’re just starting out, it’s vital you have a solid foundation in Python before you take on any projects. Otherwise, you run the risk of getting frustrated and giving up before you even get going. For those in need, we recommend taking either:

  1. Introduction to Python Programming course: meant for those looking to become a data professional while learning the fundamentals of programming with Python.
  2. Introduction to Python Programming course: meant for those looking to leverage the power of AI while learning the fundamentals of programming with Python.

In both courses, the goal is to quickly learn the basics of Python so you can start working on a project as soon as possible. You'll learn by doing, not by passively watching videos.

Selecting a Project

Our list below has 60+ fun and rewarding Python projects for learners at all levels. Some are free guided projects that you can complete directly in your browser via the Dataquest platform. Others are more open-ended, serving as inspiration as you build your Python skills. The key is to choose a project that resonates with you and just go for it!

Now, let’s take a look at some Python project examples. There is definitely something to get you started in this list.

Animated GIF of a smiling blue robot interacting with a mobile app interface

Free Python Projects (Recommended):

These free Dataquest guided projects are a great place to start. They provide an embedded code editor directly in your browser, step-by-step instructions to help you complete the project, and community support if you happen to get stuck.

  1. Building an Interactive Word Game — In this guided project, you’ll use basic Python programming concepts to create a functional and interactive word-guessing game.

  2. Profitable App Profiles for the App Store and Google Play Markets — In this one, you’ll work as a data analyst for a company that builds mobile apps. You’ll use Python to analyze real app market data to find app profiles that attract the most users.

  3. Exploring Hacker News Posts — Use Python string manipulation, OOP, and date handling to analyze trends driving post popularity on Hacker News, a popular technology site.

  4. Learn and Install Jupyter Notebook — A guide to using and setting up Jupyter Notebook locally to prepare you for real-world data projects.

  5. Predicting Heart Disease — We're tasked with using a dataset from the World Health Organization to accurately predict a patient’s risk of developing heart disease based on their medical data.

  6. Analyzing Accuracy in Data Presentation — In this project, we'll step into the role of data journalists to analyze movie ratings data and determine if there’s evidence of bias in Fandango’s rating system.

Animated GIF of a laptop displaying a bar chart with a plant in the background

Table of Contents

More Projects to Help Build Your Portfolio:

  1. Finding Heavy Traffic Indicators on I-94 — Explore how using the pandas plotting functionality along with the Jupyter Notebook interface allows us to analyze data quickly using visualizations to determine indicators of heavy traffic.

  2. Storytelling Data Visualization on Exchange Rates — You'll assume the role of a data analyst tasked with creating an explanatory data visualization about Euro exchange rates to inform and engage an audience.

  3. Clean and Analyze Employee Exit Surveys — Work with exit surveys from employees of the Department of Education in Queensland, Australia. Play the role of a data analyst to analyze employee exit surveys and uncover insights about why employees resign.

  4. Star Wars Survey — In this data cleaning project, you’ll work with Jupyter Notebook to analyze data on the Star Wars movies to answer the hotly contested question, "Who shot first?"

  5. Analyzing NYC High School Data — For this project, you’ll assume the role of a data scientist analyzing relationships between SAT scores and demographic factors in NYC public schools to determine if the SAT is a fair test.

  6. Predicting the Weather Using Machine Learning — For this project, you’ll step into the role of a data scientist to predict tomorrow’s weather using historical data and machine learning, developing skills in data preparation, time series analysis, and model evaluation.

  7. Credit Card Customer Segmentation — For this project, we’ll play the role of a data scientist at a credit card company to segment customers into groups using K-means clustering in Python, allowing the company to tailor strategies for each segment.

Python Projects for AI Enthusiasts:

  1. Building an AI Chatbot with Streamlit — Build a simple website with an AI chatbot user interface similar to the OpenAI Playground in this intermediate-level project using Streamlit.

  2. Developing a Dynamic AI Chatbot — Create your very own AI-powered chatbot that can take on different personalities, keep track of conversation history, and provide coherent responses in this intermediate-level project.

  3. Building a Food Ordering App — Create a functional application using Python dictionaries, loops, and functions to create an interactive system for viewing menus, modifying carts, and placing orders.

Table of Contents

Fun Python Projects for Building Data Skills:

  1. Exploring eBay Car Sales Data — Use Python to work with a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

  2. Find out How Much Money You’ve Spent on Amazon — Dig into your own spending habits with this beginner-level tutorial!

  3. Analyze Your Personal Netflix Data — Another beginner-to-intermediate tutorial that gets you working with your own personal dataset.

  4. Analyze Your Personal Facebook Data with Python — Are you spending too much time posting on Facebook? The numbers don’t lie, and you can find them in this beginner-to-intermediate Python project.

  5. Analyze Survey Data — This walk-through will show you how to set up Python and how to filter survey data from any dataset (or just use the sample data linked in the article).

  6. All of Dataquest’s Guided Projects — These guided data science projects walk you through building real-world data projects of increasing complexity, with suggestions for how to expand each project.

  7. Analyze Everything — Grab a free dataset that interests you, and start poking around! If you get stuck or aren’t sure where to start, our introduction to Python lessons are here to help, and you can try them for free!

Animated GIF of a person playing a space-themed game on a computer, illustrating cool Python projects for game development.

Table of Contents

Cool Python Projects for Game Devs:

  1. Rock, Paper, Scissors — Learn Python with a simple-but-fun game that everybody knows.

  2. Build a Text Adventure Game — This is a classic Python beginner project (it also pops up in this book) that’ll teach you many basic game setup concepts that are useful for more advanced games.

  3. Guessing Game — This is another beginner-level project that’ll help you learn and practice the basics.

  4. Mad Libs — Use Python code to make interactive Python Mad Libs!

  5. Hangman — Another childhood classic that you can make to stretch your Python skills.

  6. Snake — This is a bit more complex, but it’s a classic (and surprisingly fun) game to make and play.

Simple Python Projects for Web Devs:

  1. URL shortener — This free video course will show you how to build your own URL shortener like Bit.ly using Python and Django.

  2. Build a Simple Web Page with Django — This is a very in-depth, from-scratch tutorial for building a website with Python and Django, complete with cartoon illustrations!

Easy Python Projects for Aspiring Developers:

  1. Password generator — Build a secure password generator in Python.

  2. Use Tweepy to create a Twitter bot — This Python project idea is a bit more advanced, as you’ll need to use the Twitter API, but it’s definitely fun!

  3. Build an Address Book — This could start with a simple Python dictionary or become as advanced as something like this!

  4. Create a Crypto App with Python — This free video course walks you through using some APIs and Python to build apps with cryptocurrency data.

Table of Contents

Additional Python Project Ideas

Still haven’t found a project idea that appeals to you? Here are many more, separated by experience level.

These aren’t tutorials; they’re just Python project ideas that you’ll have to dig into and research on your own, but that’s part of the fun! And it’s also part of the natural process of learning to code and working as a programmer.

The pros use Google and AI tools for answers all the time — so don’t be afraid to dive in and get your hands dirty!

Graphic illustration of the Python logo with orange and brown wings, representing python projects for beginners.

Beginner Python Project Ideas

  1. Create a text encryption generator. This would take text as input, replaces each letter with another letter, and outputs the “encoded” message.

  2. Build a countdown calculator. Write some code that can take two dates as input, and then calculate the amount of time between them. This will be a great way to familiarize yourself with Python’s datetime module.

  3. Write a sorting method. Given a list, can you write some code that sorts it alphabetically, or numerically? Yes, Python has this functionality built-in, but see if you can do it without using the sort() function!

  4. Build an interactive quiz application. Which Avenger are you? Build a personality or recommendation quiz that asks users some questions, stores their answers, and then performs some kind of calculation to give the user a personalized result based on their answers

  5. Tic-Tac-Toe by Text. Build a Tic-Tac-Toe game that’s playable like a text adventure. Can you make it print a text-based representation of the board after each move?

  6. Make a temperature/measurement converter. Write a script that can convert Fahrenheit (℉) to Celcius (℃) and back, or inches to centimeters and back, etc. How far can you take it?

  7. Build a counter app. Take your first steps into the world of UI by building a very simple app that counts up by one each time a user clicks a button.

  8. Build a number-guessing game. Think of this as a bit like a text adventure, but with numbers. How far can you take it?

  9. Build an alarm clock. This is borderline beginner/intermediate, but it’s worth trying to build an alarm clock for yourself. Can you create different alarms? A snooze function?

Table of Contents

Graphic illustration of the Python logo with blue and light blue wings, representing intermediate python projects.

Intermediate Python Project Ideas

  1. Build an upgraded text encryption generator. Starting with the project mentioned in the beginner section, see what you can do to make it more sophisticated. Can you make it generate different kinds of codes? Can you create a “decoder” app that reads encoded messages if the user inputs a secret key? Can you create a more sophisticated code that goes beyond simple letter-replacement?

  2. Make your Tic-Tac-Toe game clickable. Building off the beginner project, now make a version of Tic-Tac-Toe that has an actual UI  you’ll use by clicking on open squares. Challenge: can you write a simple “AI” opponent for a human player to play against?

  3. Scrape some data to analyze. This could really be anything, from any website you like. The web is full of interesting data. If you learn a little about web-scraping, you can collect some really unique datasets.

  4. Build a clock website. How close can you get it to real-time? Can you implement different time zone selectors, and add in the “countdown calculator” functionality to calculate lengths of time?

  5. Automate some of your job. This will vary, but many jobs have some kind of repetitive process that you can automate! This intermediate project could even lead to a promotion.

  6. Automate your personal habits. Do you want to remember to stand up once every hour during work? How about writing some code that generates unique workout plans based on your goals and preferences? There are a variety of simple apps you can build to automate or enhance different aspects of your life.

  7. Create a simple web browser. Build a simple UI that accepts  URLs and loads webpages. PyWt will be helpful here! Can you add a “back” button, bookmarks, and other cool features?

  8. Write a notes app. Create an app that helps people write and store notes. Can you think of some interesting and unique features to add?

  9. Build a typing tester. This should show the user some text, and then challenge them to type it quickly and accurately. Meanwhile, you time them and score them on accuracy.

  10. Create a “site updated” notification system. Ever get annoyed when you have to refresh a website to see if an out-of-stock product has been relisted? Or to see if any news has been posted? Write a Python script that automatically checks a given URL for updates and informs you when it identifies one. Be careful not to overload the servers of whatever site you’re checking, though. Keep the time interval reasonable between each check.

  11. Recreate your favorite board game in Python. There are tons of options here, from something simple like Checkers all the way up to Risk. Or even more modern and advanced games like Ticket to Ride or Settlers of Catan. How close can you get to the real thing?

  12. Build a Wikipedia explorer. Build an app that displays a random Wikipedia page. The challenge here is in the details: can you add user-selected categories? Can you try a different “rabbit hole” version of the app, wherein each article is randomly selected from the articles linked in the previous article? This might seem simple, but it can actually require some serious web-scraping skills.

Table of Contents

Graphic illustration of the Python logo with purple and blue wings, representing advanced python projects.

Advanced Python Project Ideas

  1. Build a stock market prediction app. For this one, you’ll need a source of stock market data and some machine learning and data analytics skills. Fortunately, many people have tried this, so there’s plenty of source code out there to work from.

  2. Build a chatbot. The challenge here isn’t so much making the chatbot as it is making it good. Can you, for example, implement some natural language processing techniques to make it sound more natural and spontaneous?

  3. Program a robot. This requires some hardware (which isn’t usually free), but there are many affordable options out there — and many learning resources, too. Definitely look into Raspberry Pi if you’re not already thinking along those lines.

  4. Build an image recognition app. Starting with handwriting recognition is a good idea — Dataquest has a guided data science project to help with that! Once you’ve learned it, you can take it to the next level.

  5. Create a sentiment analysis tool for social media. Collect data from various social media platforms, preprocess it, and then train a deep learning model to analyze the sentiment of each post (positive, negative, neutral).

  6. Make a price prediction model. Select an industry or product that interests you, and build a machine learning model that predicts price changes.

  7. Create an interactive map. This will require a mix of data skills and UI creation skills. Your map can display whatever you’d like — bird migrations, traffic data, crime reports — but it should be interactive in some way. How far can you take it?

Table of Contents

Next Steps

Each of the examples in the previous sections built on the idea of choosing a great Python project for a beginner and then enhancing it as your Python skills progress. Next, you can advance to the following:

  • Think about what interests you, and choose a project that overlaps with your interests.

  • Think about your Python learning goals, and make sure your project moves you closer to achieving those goals.

  • Start small. Once you’ve built a small project, you can either expand it or build another one.

Now you’re ready to get started. If you haven’t learned the basics of Python yet, I recommend diving in with Dataquest’s Introduction to Python Programming course.

If you already know the basics, there’s no reason to hesitate! Now is the time to get in there and find your perfect Python project.

11 Must-Have Skills for Data Analysts in 2025

22 October 2025 at 19:06

Data is everywhere. Every click, purchase, or social media like creates mountains of information, but raw numbers do not tell a story. That is where data analysts come in. They turn messy datasets into actionable insights that help businesses grow.

Whether you're looking to become a junior data analyst or looking to level up, here are the top 11 data analyst skills every professional needs in 2025, including one optional skill that can help you stand out.

1. SQL

SQL (Structured Query Language) is the language of databases and is arguably the most important technical skill for analysts. It allows you to efficiently query and manage large datasets across multiple systems—something Excel cannot do at scale.

Example in action: Want last quarter's sales by region? SQL pulls it in seconds, no matter how huge the dataset.

Learning Tip: Start with basic queries, then explore joins, aggregations, and subqueries. Practicing data analytics exercises with SQL will help you build confidence and precision.

2. Excel

Since it’s not going anywhere, it’s still worth it to learn Microsoft Excel. Beyond spreadsheets, it offers pivot tables, macros, and Power Query, which are perfect for quick analysis on smaller datasets. Many startups or lean teams still rely on Excel as their first database.

Example in action: Summarize thousands of rows of customer feedback in minutes with pivot tables, then highlight trends visually.

Learning Tip: Focus on pivot tables, logical formulas, and basic automation. Once comfortable, try linking Excel to SQL queries or automating repetitive tasks to strengthen your technical skills in data analytics.

3. Python or R

Python and R are essential for handling big datasets, advanced analytics, and automation. Python is versatile for cleaning data, automation, and integrating analyses into workflows, while R excels at exploratory data analysis and statistical analysis.

Example in action: Clean hundreds of thousands of rows with Python’s pandas library in seconds, something that would take hours in Excel.

Learning Tip: Start with data cleaning and visualization, then move to complex analyses like regression or predictive modeling. Building these data analyst skills is critical for anyone working in data science. Of course, which is better to learn is still up for debate.

4. Data Visualization

Numbers alone rarely persuade anyone. Data visualization is how you make your insights clear and memorable. Tools like Tableau, Power BI, or Python/R libraries help you tell a story that anyone can understand.

Example in action: A simple line chart showing revenue trends can be far more persuasive than a table of numbers.

Learning Tip: Design visuals with your audience in mind. Recreate dashboards from online tutorials to practice clarity, storytelling, and your soft skills in communicating data analytics results.

5. Statistics & Analytics

Strong statistical analysis knowledge separates analysts who report numbers from those who generate insights. Skills like regression, correlation, hypothesis testing, and A/B testing help you interpret trends accurately.

Example in action: Before recommending a new marketing campaign, test whether the increase in sales is statistically significant or just random fluctuation.

Learning Tip: Focus on core probability and statistics concepts first, then practice applying them in projects. Our Probability and Statistics with Python skill path is a great way to learn theoretical concepts in a hands-on way.

6. Data Cleaning & Wrangling

Data rarely comes perfect, so data cleaning skills will always be in demand. Cleaning and transforming datasets, removing duplicates, handling missing values, and standardizing formats are often the most time-consuming but essential parts of the job.

Example in action: You want to analyze customer reviews, but ratings are inconsistent and some entries are blank. Cleaning the data ensures your insights are accurate and actionable.

Learning Tip: Practice on free datasets or public data repositories to build real-world data analyst skills.

7. Communication & Presentation Skills

Analyzing data is only half the battle. Sharing your findings clearly is just as important. Being able to present insights in reports, dashboards, or meetings ensures your work drives decisions.

Example in action: Presenting a dashboard to a marketing team that highlights which campaigns brought the most new customers can influence next-quarter strategy.

Learning Tip: Practice explaining complex findings to someone without a technical background. Focus on clarity, storytelling, and visuals rather than technical jargon. Strong soft skills are just as valuable as your technical skills in data analytics.

8. Dashboard & Report Creation

Beyond visualizations, analysts need to build dashboards and reports that allow stakeholders to interact with data. A dashboard is not just a fancy chart. It is a tool that empowers teams to make data-driven decisions without waiting for you to interpret every number.

Example in action: A sales dashboard with filters for region, product line, and time period can help managers quickly identify areas for improvement.

Learning Tip: Start with simple dashboards in Tableau, Power BI, or Google Data Studio. Focus on making them interactive, easy to understand, and aligned with business goals. This is an essential part of professional data analytics skills.

9. Domain Knowledge

Understanding the industry or context of your data makes you exponentially more effective. Metrics and trends mean different things depending on the business.

Example in action: Knowing e-commerce metrics like cart abandonment versus subscription churn metrics can change how you interpret the same type of data.

Learning Tip: Study your company’s industry, read case studies, or shadow colleagues in different departments to build context. The more you know, the better your insights and analysis will be.

10. Critical Thinking & Problem-Solving

Numbers can be misleading. Critical thinking lets analysts ask the right questions, identify anomalies, and uncover hidden insights.

Example in action: Revenue drops in one region. Critical thinking helps you ask whether it is seasonal, a data error, or a genuine trend.

Learning Tip: Challenge assumptions and always ask “why” multiple times when analyzing a dataset. Practice with open-ended case studies to sharpen your analytical thinking and overall data analyst skills.

11. Machine Learning Basics

Not every analyst uses machine learning daily, but knowing the basics—predictive modeling, clustering, or AI-powered insights—can help you stand out. You do not need this skill to get started as an analyst, but familiarity with it is increasingly valuable for advanced roles.

Example in action: Using a simple predictive model to forecast next month’s sales trends can help your team allocate resources more effectively.

Learning Tip: Start small with beginner-friendly tools like Python’s scikit-learn library, then explore more advanced models as you grow. Treat it as an optional skill to explore once you are confident in SQL, Python/R, and statistical analysis.

Where to Learn These Skills

Want to become a data analyst? Dataquest makes it easy to learn the skills you need to get hired.

With our Data Analyst in Python and Data Analyst in R career paths, you’ll learn by doing real projects, not just watching videos. Each course helps you build the technical and practical skills employers look for.

By the end, you’ll have the knowledge, experience, and confidence to start your career in data analysis.

Wrapping It Up

Being a data analyst is not just about crunching numbers. It is about turning data into actionable insights that drive decisions. Master these data analytics and data analyst skills, and you will be prepared to handle the challenges of 2025 and beyond.

Getting Started with Claude Code for Data Scientists

16 October 2025 at 23:39

If you've spent hours debugging a pandas KeyError, or writing the same data validation code for the hundredth time, or refactoring a messy analysis script, you know the frustration of tedious coding work. Real data science work involves analytical thinking and creative problem-solving, but it also requires a lot of mechanical coding: boilerplate writing, test generation, and documentation creation.

What if you could delegate the mechanical parts to an AI assistant that understands your codebase and handles implementation details while you focus on the analytical decisions?

That's what Claude Code does for data scientists.

What Is Claude Code?

Claude Code is Anthropic's terminal-based AI coding assistant that helps you write, refactor, debug, and document code through natural language conversations. Unlike autocomplete tools that suggest individual lines as you type, Claude Code understands project context, makes coordinated multi-file edits, and can execute workflows autonomously.

Claude Code excels at generating boilerplate code for data loading and validation, refactoring messy scripts into clean modules, debugging obscure errors in pandas or numpy operations, implementing standard patterns like preprocessing pipelines, and creating tests and documentation. However, it doesn't replace your analytical judgment, make methodological decisions about statistical approaches, or fix poorly conceived analysis strategies.

In this tutorial, you'll learn how to install Claude Code, understand its capabilities and limitations, and start using it productively for data science work. You'll see the core commands, discover tips that improve efficiency, and see concrete examples of how Claude Code handles common data science tasks.

Key Benefits for Data Scientists

Before we get into installation, let's establish what Claude Code actually does for data scientists:

  1. Eliminate boilerplate code writing for repetitive patterns that consume time without requiring creative thought. File loading with error handling, data validation checks that verify column existence and types, preprocessing pipelines with standard transformations—Claude Code generates these in seconds rather than requiring manual implementation of logic you've written dozens of times before.
  2. Generate test suites for data processing functions covering normal operation, edge cases with malformed or missing data, and validation of output characteristics. Testing data pipelines becomes straightforward rather than work you postpone.
  3. Accelerate documentation creation for data analysis workflows by generating detailed docstrings, README files explaining project setup, and inline comments that explain complex transformations.
  4. Debug obscure errors more efficiently in pandas operations, numpy array manipulations, or scikit-learn pipeline configurations. Claude Code interprets cryptic error messages, suggests likely causes based on common patterns, and proposes fixes you can evaluate immediately.
  5. Refactor exploratory code into production-quality modules with proper structure, error handling, and maintainability standards. The transition from research notebook to deployable pipeline becomes faster and less painful.

These benefits translate directly to time savings on mechanical tasks, allowing you to focus on analysis, modeling decisions, and generating insights rather than wrestling with implementation details.

Installation and Setup

Let's get Claude Code installed and configured. The process takes about 10-15 minutes, including account creation and verification.

Step 1: Obtain Your Anthropic API Key

Navigate to console.anthropic.com and create an account if you don't have one. Once logged in, access the API keys section from the navigation menu on the left, and generate a new API key by clicking on + Create Key.

Claude_Code_API_Key.png

While you can generate a new key anytime from the console, you won’t be able to retrieve any existing API keys once they have been created. For this reason, you’ll want to copy your API key immediately and store it somewhere safe—you'll need it for authentication.

Always keep your API keys secure. Treat them like passwords and never commit them to version control or share them publicly.

Step 2: Install Claude Code

Claude Code installs via npm (Node Package Manager). If you don't have Node.js installed on your system, download it from nodejs.org before proceeding.

Once Node.js is installed, open your terminal and run:

npm install -g @anthropic-ai/claude-code

The -g flag installs Claude Code globally, making it available from any directory on your system.

Common installation issues:

  • "npm: command not found": You need to install Node.js first. Download it from nodejs.org and restart your terminal after installation.
  • Permission errors on Mac/Linux: Try sudo npm install -g @anthropic-ai/claude-code to install with administrator privileges.
  • PATH issues: If Claude Code installs successfully but the claude command isn't recognized, you may need to add npm's global directory to your system PATH. Run npm config get prefix to find the location, then add [that-location]/bin to your PATH environment variable.

Step 3: Configure Authentication

Set your API key as an environment variable so Claude Code can authenticate with Anthropic's servers:

export ANTHROPIC_API_KEY=your_key_here

Replace your_key_here with the actual API key you copied earlier from the Anthropic console.

To make this permanent (so you don't need to set your API key every time you open a terminal), add the export line above to your shell configuration file:

  • For bash: Add to ~/.bashrc or ~/.bash_profile
  • For zsh: Add to ~/.zshrc
  • For fish: Add to ~/.config/fish/config.fish

You can edit your shell configuration file using nano config_file_name. After adding the line, reload your configuration by running source ~/.bashrc (or whichever file you edited), or simply open a new terminal window.

Step 4: Verify Installation

Confirm that Claude Code is properly installed and authenticated:

claude --version

You should see version information displayed. If you get an error, review the installation steps above.

Try running Claude Code for the first time:

claude

This launches the Claude Code interface. You should see a welcome message and a prompt asking you to select the text style that looks best with your terminal:

Claude_Code_Welcome_Screen.png

Use the arrow keys on your keyboard to select a text style and press Enter to continue.

Next, you’ll be asked to select a login method:

If you have an eligible subscription, select option 1. Otherwise, select option 2. For this tutorial, we will use option 2 (API usage billing).

Claude_Code_Select_Login.png

Once your account setup is complete, you’ll see a welcome message showing the email address for your account:

Claude_Code_Setup_Complete.png

To exit the setup of Claude Code at any point, press Control+C twice.

Security Note

Claude Code can read files you explicitly include and generate code that loads data from files or databases. However, it doesn't automatically access your data without your instruction. You maintain full control over what files and information Claude Code can see. When working with sensitive data, be mindful of what files you include in conversation context and review all generated code before execution, especially code that connects to databases or external systems. For more details, see Anthropic’s Security Documentation.

Understanding the Costs

Claude Code itself is free software, but using it requires an Anthropic API key that operates on usage-based pricing:

  • Free tier: Limited testing suitable for evaluation
  • Pro plan (\$20/month): Reasonable usage for individual data scientists conducting moderate development work
  • Pay-as-you-go: For heavy users working intensively on multiple projects, typically \$6-12 daily for active development

Most practitioners doing regular but not continuous development work find the \$20 Pro plan provides good balance between cost and capability. Start with the free tier to evaluate effectiveness on your actual work, then upgrade based on demonstrated value.

Your First Commands

Now that Claude Code is installed and configured, let's walk through basic usage with hands-on examples.

Starting a Claude Code Session

Navigate to a project directory in your terminal:

cd ~/projects/customer_analysis

Launch Claude Code:

claude

You'll see the Claude Code interface with a prompt where you can type natural language instructions.

Understanding Your Project

Before asking Claude Code to make changes, it needs to understand your project context. Try starting with this exploratory command:

Explain the structure of this project and identify the key files.

Claude Code will read through your directory, examine files, and provide a summary of what it found. This shows that Claude Code actively explores and comprehends codebases before acting.

Your First Refactoring Task

Let's demonstrate Claude Code's practical value with a realistic example. Create a simple file called load_data.py with some intentionally messy code:

import pandas as pd

# Load customer data
data = pd.read_csv('/Users/yourname/Desktop/customers.csv')
print(data.head())

This works but has obvious problems: hardcoded absolute path, no error handling, poor variable naming, and no documentation.

Now ask Claude Code to improve it:

Refactor load_data.py to use best practices: configurable paths, error handling, descriptive variable names, and complete docstrings.

Claude Code will analyze the file and propose improvements. Instead of the hardcoded path, you'll get configurable file paths through command-line arguments. The error handling expands to catch missing files, empty files, and CSV parsing errors. Variable names become descriptive (customer_df or customer_data instead of generic data). A complete docstring appears documenting parameters, return values, and potential exceptions. The function adds proper logging to track what's happening during execution.

Claude Code asks your permission before making these changes. Always review its proposal; if it looks good, approve it. If something seems off, ask for modifications or reject the changes entirely. This permission step ensures you stay in control while delegating the mechanical work.

What Just Happened

This demonstrates Claude Code's workflow:

  1. You describe what you want in natural language
  2. Claude Code analyzes the relevant files and context
  3. Claude Code proposes specific changes with explanations
  4. You review and approve or request modifications
  5. Claude Code applies approved changes

The entire refactoring took 90 seconds instead of 20-30 minutes of manual work. More importantly, Claude Code caught details you might have forgotten, such as adding logging, proper type hints, and handling multiple error cases. The permission-based approach ensures you maintain control while delegating implementation work.

Core Commands and Patterns

Claude Code provides several slash (/) commands that control its behavior and help you work more efficiently.

Important Slash Commands

@filename: Reference files directly in your prompts using the @ symbol. Example: @src/preprocessing.py or Explain the logic in @data_loader.py. Claude Code automatically includes the file's content in context. Use tab completion after typing @ to quickly navigate and select files.

/clear: Reset conversation context entirely, removing all history and file references. Use this when switching between different analyses, datasets, or project areas. Accumulated conversation history consumes tokens and can cause Claude Code to inappropriately reference outdated context. Think of /clear as starting a fresh conversation when you switch tasks.

/help: Display available commands and usage information. Useful when you forget command syntax or want to discover capabilities.

Context Management for Data Science Projects

Claude Code has token limits determining how much code it can consider simultaneously. For small projects with a few files, this rarely matters. For larger data science projects with dozens of notebooks and scripts, strategic context management becomes important.

Reference only files relevant to your current task using @filename syntax. If you're working on data validation, reference the validation script and related utilities (like @validation.py and @utils/data_checks.py) but exclude modeling and visualization code that won't influence the current work.

Effective Prompting Patterns

Claude Code responds best to clear, specific instructions. Compare these approaches:

  • Vague: "Make this code better"
    Specific: "Refactor this preprocessing function to handle missing values using median imputation for numerical columns and mode for categorical columns, add error handling for unexpected data types, and include detailed docstrings"
  • Vague: "Add tests"
    Specific: "Create pytest tests for the data_loader function covering successful loading, missing file errors, empty file handling, and malformed CSV detection"
  • Vague: "Fix the pandas error"
    Specific: "Debug the KeyError in line 47 of data_pipeline.py and suggest why it's failing on the 'customer_id' column"

Specific prompts produce focused, useful results. Vague prompts generate generic suggestions that may not address your actual needs.

Iteration and Refinement

Treat Claude Code's initial output as a starting point rather than expecting perfection on the first attempt. Review what it generates, identify improvements needed, and make follow-up requests:

"The validation function you created is good, but it should also check that dates are within reasonable ranges. Add validation that start_date is after 2000-01-01 and end_date is not in the future."

This iterative approach produces better results than attempting to specify every requirement in a single massive prompt.

Advanced Features

Beyond basic commands, several features improve your Claude Code experience for complex work.

  1. Activate plan mode: Press Shift+Tab before sending your prompt to enable plan mode, which creates an explicit execution plan before implementing changes. Use this for workflows with three or more distinct steps—like loading data, preprocessing, and generating outputs. The planning phase helps Claude maintain focus on the overall objective.

  2. Run commands with bash mode: Prefix prompts with an exclamation mark to execute shell commands and inject their output into Claude Code's context:

    ! python analyze_sales.py

    This runs your analysis script and adds complete output to Claude Code's context. You can then ask questions about the output or request interpretations of the results. This creates a tight feedback loop for iterative data exploration.

  3. Use extended thinking for complex problems: Include "think", "think harder", or "ultrathink" in prompts for thorough analysis:

    think harder: why does my linear regression show high R-squared but poor prediction on validation data?

    Extended thinking produces more careful analysis but takes longer (ultrathink can take several minutes). Apply this when debugging subtle statistical issues or planning sophisticated transformations.

  4. Resume previous sessions: Launch Claude Code with claude --resume to continue your most recent session with complete context preserved, including conversation history, file references, and established conventions all intact. This proves valuable for ongoing analysis where you want to continue today without re-explaining your entire analytical approach.

Optional Power User Setting

For personal projects where you trust all operations, launch with claude --dangerously-skip-permissions to bypass constant approval prompts. This carries risk if Claude Code attempts destructive operations, so use it only on projects where you maintain version control and can recover from mistakes. Never use this on production systems or shared codebases.

Configuring Claude Code for Data Science Projects

The CLAUDE.md file provides project-specific context that improves Claude Code's suggestions by explaining your conventions, requirements, and domain specifics.

Quick Setup with /init

The easiest way to create your CLAUDE.md file is using Claude Code's built-in /init command. From your project directory, launch Claude Code and run:

/init

Claude Code will analyze your project structure and ask you questions about your setup: what kind of project you're working on, your coding conventions, important files and directories, and domain-specific context. It then generates a CLAUDE.md file tailored to your project.

This interactive approach is faster than writing from scratch and ensures you don't miss important details. You can always edit the generated file later to refine it.

Understanding Your CLAUDE.md

Whether you used /init or prefer to create it manually, here's what a typical CLAUDE.md file looks like for a data science project on customer churn. In your project root directory, the file named CLAUDE.md uses markdown format and describes project information:

# Customer Churn Analysis Project

## Project Overview
Predict customer churn for a telecommunications company using historical
customer data and behavior patterns. The goal is identifying at-risk
customers for proactive retention efforts.

## Data Sources
- **Customer demographics**: data/raw/customer_info.csv
- **Usage patterns**: data/raw/usage_data.csv
- **Churn labels**: data/raw/churn_labels.csv

Expected columns documented in data/schemas/column_descriptions.md

## Directory Structure
- `data/raw/`: Original unmodified data files
- `data/processed/`: Cleaned and preprocessed data ready for modeling
- `notebooks/`: Exploratory analysis and experimentation
- `src/`: Production code for data processing and modeling
- `tests/`: Pytest tests for all src/ modules
- `outputs/`: Generated reports, visualizations, and model artifacts

## Coding Conventions
- Use pandas for data manipulation, scikit-learn for modeling
- All scripts should accept command-line arguments for file paths
- Include error handling for data quality issues
- Follow PEP 8 style guidelines
- Write pytest tests for all data processing functions

## Domain Notes
Churn is defined as customer canceling service within 30 days. We care
more about catching churners (recall) than minimizing false positives
because retention outreach is relatively low-cost.

This upfront investment takes 10-15 minutes but improves every subsequent interaction by giving Claude Code context about your project structure, conventions, and requirements.

Hierarchical Configuration for Complex Projects

CLAUDE.md files can be hierarchical. You might maintain a root-level CLAUDE.md describing overall project structure, plus subdirectory-specific files for different analysis areas.

For example, a project analyzing both customer behavior and financial performance might have:

  • Root CLAUDE.md: General project description, directory structure, and shared conventions
  • customer_analysis/CLAUDE.md: Specific details about customer data sources, relevant metrics like lifetime value and engagement scores, and analytical approaches for behavioral patterns
  • financial_analysis/CLAUDE.md: Financial data sources, accounting principles used, and approaches for revenue and cost analysis

Claude Code prioritizes the most specific configuration, so subdirectory files take precedence when working within those areas.

Custom Slash Commands

For frequently used patterns specific to your workflow, you can create custom slash commands. Create a .claude/commands directory in your project and add markdown files named for each slash command you want to define.

For example, .claude/commands/test.md:

Create pytest tests for: $ARGUMENTS

Requirements:
- Test normal operation with valid data
- Test edge cases: empty inputs, missing values, invalid types
- Test expected exceptions are raised appropriately
- Include docstrings explaining what each test validates
- Use descriptive test names that explain the scenario

Then /test my_preprocessing_function generates tests following your specified patterns.

These custom commands represent optional advanced customization. Start with basic CLAUDE.md configuration, and consider custom commands only after you've identified repetitive patterns in your prompting.

Practical Data Science Applications

Let's see Claude Code in action across some common data science tasks.

1. Data Loading and Validation

Generate robust data loading code with error handling:

Create a data loading function for customer_data.csv that:
- Accepts configurable file paths
- Validates expected columns exist with correct types
- Detects and logs missing value patterns
- Handles common errors like missing files or malformed CSV
- Returns the dataframe with a summary of loaded records

Claude Code generates a function that handles all these requirements. The code uses pathlib for cross-platform file paths, includes try-except blocks for multiple error scenarios, validates that required columns exist in the dataframe, logs detailed information about data quality issues like missing values, and provides clear exception messages when problems occur. This handles edge cases you might forget: missing files, parsing errors, column validation, and missing value detection with logging.

2. Exploratory Data Analysis Assistance

Generate EDA code:

Create an EDA script for the customer dataset that generates:
- Distribution plots for numerical features (age, income, tenure)
- Count plots for categorical features (plan_type, region)
- Correlation heatmap for numerical variables
- Summary statistics table
Save all visualizations to outputs/eda/

Claude Code produces a complete analysis script with proper plot styling, figure organization, and file saving—saving 30-45 minutes of matplotlib configuration work.

3. Data Preprocessing Pipeline

Build a preprocessing module:

Create preprocessing.py with functions to:
- Handle missing values: median for numerical, mode for categorical
- Encode categorical variables using one-hot encoding
- Scale numerical features using StandardScaler
- Include type hints, docstrings, and error handling

The generated code includes proper sklearn patterns and documentation, and it handles edge cases like unseen categories during transform.

4. Test Generation

Generate pytest tests:

Create tests for the preprocessing functions covering:
- Successful preprocessing with valid data
- Handling of various missing value patterns
- Error cases like all-missing columns
- Verification that output shapes match expectations

Claude Code generates thorough test coverage including fixtures, parametrized tests, and clear assertions—work that often gets postponed due to tedium.

5. Documentation Generation

Add docstrings and project documentation:

Add docstrings to all functions in data_pipeline.py following NumPy style
Create a README.md explaining:
- Project purpose and business context
- Setup instructions for the development environment
- How to run the preprocessing and modeling pipeline
- Description of output artifacts and their interpretation

Generated documentation captures technical details while remaining readable for collaborators.

6. Maintaining Analysis Documentation

For complex analyses, use Claude Code to maintain living documentation:

Create analysis_log.md and document our approach to handling missing income data, including:
- The statistical justification for using median imputation rather than deletion
- Why we chose median over mean given the right-skewed distribution we observed
- Validation checks we performed to ensure imputation didn't bias results

This documentation serves dual purposes. First, it provides context for Claude Code in future sessions when you resume work on this analysis, as it explains the preprocessing you applied and why those specific choices were methodologically appropriate. Second, it creates stakeholder-ready explanations communicating both technical implementation and analytical reasoning.

As your analysis progresses, continue documenting key decisions:

Add to analysis_log.md: Explain why we chose random forest over logistic regression after observing significant feature interactions in the correlation analysis, and document the cross-validation approach we used given temporal dependencies in our customer data.

This living documentation approach transforms implicit analytical reasoning into explicit written rationale, increasing both reproducibility and transparency of your data science work.

Common Pitfalls and How to Avoid Them

  • Insufficient context leads to generic suggestions that miss project-specific requirements. Claude Code doesn't automatically know your data schema, project conventions, or domain constraints. Maintain a detailed CLAUDE.md file and reference relevant files using @filename syntax in your prompts.
  • Accepting generated code without review risks introducing bugs or inappropriate patterns. Claude Code produces good starting points but isn't perfect. Treat all output as first drafts requiring validation through testing and inspection, especially for statistical computations or data transformations.
  • Attempting overly complex requests in single prompts produces confused or incomplete results. When you ask Claude Code to "build the entire analysis pipeline from scratch," it gets overwhelmed. Break large tasks into focused steps—first create data loading, then validation, then preprocessing—building incrementally toward the desired outcome.
  • Ignoring error messages when Claude Code encounters problems prevents identifying root causes. Read errors carefully and ask Claude Code for specific debugging assistance: "The preprocessing function failed with KeyError on 'customer_id'. What might cause this and how should I fix it?"

Understanding Claude Code's Limitations

Setting realistic expectations about what Claude Code cannot do well builds trust through transparency.

Domain-specific understanding requires your input. Claude Code generates code based on patterns and best practices but cannot validate whether analytical approaches are appropriate for your research questions or business problems. You must provide domain expertise and methodological judgment.

Subtle bugs can slip through. Generated code for advanced statistical methods, custom loss functions, or intricate data transformations requires careful validation. Always test generated code thoroughly against known-good examples.

Large project understanding is limited. Claude Code works best on focused tasks within individual files rather than system-wide refactoring across complex architectures with dozens of interconnected files.

Edge cases may not be handled. Preprocessing code might handle clean training data perfectly but break on production data with unexpected null patterns or outlier distributions that weren't present during development.

Expertise is not replaceable. Claude Code accelerates implementation but does not replace fundamental understanding of data science principles, statistical methods, or domain knowledge.

Security Considerations

When Claude Code accesses external data sources, malicious actors could potentially embed instructions in data that Claude Code interprets as commands. This concern is known as prompt injection.

Maintain skepticism about Claude Code suggestions when working with untrusted external sources. Never grant Claude Code access to production databases, sensitive customer information, or critical systems without careful review of proposed operations.

For most data scientists working with internal datasets and trusted sources, this risk remains theoretical, but awareness becomes important as you expand usage into more automated workflows.

Frequently Asked Questions

How much does Claude Code cost for typical data science usage?

Claude Code itself is free to install, but it requires an Anthropic API key with usage-based pricing. The free tier allows limited testing suitable for evaluation. The Pro plan at \$20/month handles moderate daily development—generating preprocessing code, debugging errors, refactoring functions. Heavy users working intensively on multiple projects may prefer pay-as-you-go pricing, typically \$6-12 daily for active development. Start with the free tier to evaluate effectiveness, then upgrade based on value.

Does Claude Code work with Jupyter notebooks?

Claude Code operates as a command-line tool and works best with Python scripts and modules. For Jupyter notebooks, use Claude Code to build utility modules that your notebooks import, creating cleaner separation between exploratory analysis and reusable logic. You can also copy code cells into Python files, improve them with Claude Code, then bring the enhanced code back to the notebook.

Can Claude Code access my data files or databases?

Claude Code reads files you explicitly include through context and generates code that loads data from files or databases. It doesn't automatically access your data without instruction. You maintain full control over what files and information Claude Code can see. When you ask Claude Code to analyze data patterns, it reads the data through code execution, not by directly accessing databases or files independently.

How does Claude Code compare to GitHub Copilot?

GitHub Copilot provides inline code suggestions as you type within an IDE, excelling at completing individual lines or functions. Claude Code offers more substantial assistance with entire file transformations, debugging sessions, and refactoring through conversational interaction. Many practitioners use both—Copilot for writing code interactively, Claude Code for larger refactoring and debugging work. They complement each other rather than compete.

Next Steps

You now have Claude Code installed, understand its capabilities and limitations, and have seen concrete examples of how it handles data science tasks.

Start by using Claude Code for low-risk tasks where mistakes are easily corrected: generating documentation for existing functions, creating test cases for well-understood code, or refactoring non-critical utility scripts. This builds confidence without risking important work. Gradually increase complexity as you become comfortable.

Maintain a personal collection of effective prompts for data science tasks you perform regularly. When you discover a prompt pattern that produces excellent results, save it for reuse. This accelerates work on similar future tasks.

For technical details and advanced features, explore Anthropic's Claude Code documentation. The official docs cover advanced topics like Model Context Protocol servers, custom hooks, and integration patterns.

To systematically learn generative AI across your entire practice, check out our Generative AI Fundamentals in Python skill path. For deeper understanding of effective prompt design, our Prompting Large Language Models in Python course teaches frameworks for crafting prompts that consistently produce useful results.

Getting Started

AI-assisted development requires practice and iteration. You'll experience some awkwardness as you learn to communicate effectively with Claude Code, but this learning curve is brief. Most practitioners feel productive within their first week of regular use.

Install Claude Code, work through the examples in this tutorial with your own projects, and discover how AI assistance fits into your workflow.


Have questions or want to share your Claude Code experience? Join the discussion in the Dataquest Community where thousands of data scientists are exploring AI-assisted development together.

❌