8  Pandas for AI Engineers

9 Why Pandas Matters

Pandas is one of the core tools for working with tabular data in Python.

For AI engineers, pandas is rarely the final destination.

It is the layer where raw information becomes usable system input.

That means pandas work is usually about:

  • loading data reliably
  • inspecting shape and quality
  • selecting useful subsets
  • deriving features
  • aggregating behavior
  • joining information from multiple sources
  • handling timestamps and text before downstream modeling

The official pandas getting-started guide organizes the learning path around those exact tasks, and this chapter follows that same progression in a book-friendly form.

Source: pandas getting started

9.1 Mental Model

In pandas, the main abstraction is the DataFrame.

You can think of it as:

  • a table with named columns
  • an index that identifies rows
  • vectorized operations over whole columns
  • a bridge between raw files and analytical or modeling code

The key shift is to stop thinking row by row.

Most good pandas code works column-wise and table-wise.

9.2 1. Creating And Inspecting Tables

Start with small, explicit tables so you can see the abstraction clearly.

import pandas as pd

customers = pd.DataFrame(
    {
        "customer_id": [101, 102, 103, 104],
        "plan": ["free", "pro", "team", "pro"],
        "monthly_spend": [0, 29, 99, 29],
        "active": [True, True, False, True],
    }
)

customers
customer_id plan monthly_spend active
0 101 free 0 True
1 102 pro 29 True
2 103 team 99 False
3 104 pro 29 True

Useful first questions:

  • What are the column names?
  • How many rows and columns are present?
  • What data types did pandas infer?
  • Are there missing values?
customers.shape
(4, 4)
customers.dtypes
customer_id      int64
plan               str
monthly_spend    int64
active            bool
dtype: object
customers.info()
<class 'pandas.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   customer_id    4 non-null      int64
 1   plan           4 non-null      str  
 2   monthly_spend  4 non-null      int64
 3   active         4 non-null      bool 
dtypes: bool(1), int64(2), str(1)
memory usage: 232.0 bytes

9.3 2. Reading And Writing Data

The pandas docs emphasize read_* functions and to_* methods as the main I/O interface.

That is a strong pattern to internalize:

  • input usually starts with pd.read_*
  • output usually ends with .to_*()

Common examples include:

  • pd.read_csv(...)
  • pd.read_json(...)
  • pd.read_parquet(...)
  • df.to_csv(...)

For this repo, the most common beginner workflow will likely be CSV-based.

from io import StringIO

events = pd.DataFrame(
    {
        "event_id": [1, 2, 3],
        "event_type": ["click", "purchase", "click"],
        "value": [0.0, 120.0, 0.0],
    }
)

buffer = StringIO()
events.to_csv(buffer, index=False)
buffer.seek(0)
loaded = pd.read_csv(buffer)
loaded
event_id event_type value
0 1 click 0.0
1 2 purchase 120.0
2 3 click 0.0

9.4 3. Selecting Subsets

Selection is one of the first places where pandas becomes expressive.

customers["plan"]
0    free
1     pro
2    team
3     pro
Name: plan, dtype: str
customers[["customer_id", "monthly_spend"]]
customer_id monthly_spend
0 101 0
1 102 29
2 103 99
3 104 29
customers.loc[customers["monthly_spend"] > 0, ["customer_id", "plan", "monthly_spend"]]
customer_id plan monthly_spend
1 102 pro 29
2 103 team 99
3 104 pro 29

Use these patterns often:

  • df["col"] for one column
  • df[["a", "b"]] for multiple columns
  • df.loc[row_filter, columns] for label-based selection
  • df.iloc[...] for position-based selection

9.5 4. Creating Derived Columns

The official tutorial highlights derived columns because they are central to feature engineering.

customers["annualized_spend"] = customers["monthly_spend"] * 12
customers["is_paid"] = customers["monthly_spend"] > 0

customers
customer_id plan monthly_spend active annualized_spend is_paid
0 101 free 0 True 0 False
1 102 pro 29 True 348 True
2 103 team 99 False 1188 True
3 104 pro 29 True 348 True

This matters for AI work because features are often transformations, not raw inputs.

9.6 5. Summary Statistics And Grouped Analysis

Before training or evaluating anything, summarize the data.

customers["monthly_spend"].describe()
count     4.000000
mean     39.250000
std      42.113933
min       0.000000
25%      21.750000
50%      29.000000
75%      46.500000
max      99.000000
Name: monthly_spend, dtype: float64
customers.groupby("plan")["monthly_spend"].agg(["count", "mean", "max"])
count mean max
plan
free 1 0.0 0
pro 2 29.0 29
team 1 99.0 99

groupby is one of the most important pandas tools because it turns raw tables into behavioral summaries.

9.7 6. Reshaping Tables

Real systems often receive data in awkward formats.

Pandas gives you tools to reshape tables for analysis or modeling.

metrics = pd.DataFrame(
    {
        "model": ["baseline", "baseline", "tuned", "tuned"],
        "metric": ["accuracy", "f1", "accuracy", "f1"],
        "score": [0.81, 0.77, 0.86, 0.83],
    }
)

metrics
model metric score
0 baseline accuracy 0.81
1 baseline f1 0.77
2 tuned accuracy 0.86
3 tuned f1 0.83
metrics.pivot(index="model", columns="metric", values="score")
metric accuracy f1
model
baseline 0.81 0.77
tuned 0.86 0.83

9.8 7. Combining Tables

Data rarely lives in one clean table.

You often need to join metadata, transactions, labels, or model outputs.

profiles = pd.DataFrame(
    {
        "customer_id": [101, 102, 103, 104],
        "region": ["US", "IN", "US", "DE"],
    }
)

spend = pd.DataFrame(
    {
        "customer_id": [101, 102, 104],
        "monthly_spend": [0, 29, 29],
    }
)

profiles.merge(spend, on="customer_id", how="left")
customer_id region monthly_spend
0 101 US 0.0
1 102 IN 29.0
2 103 US NaN
3 104 DE 29.0

The default mental model:

  • use merge for database-style joins
  • use concat to stack tables by rows or columns

9.9 8. Time Series Basics

Time is everywhere in AI systems:

  • logs
  • events
  • monitoring
  • feature histories
  • forecasting inputs
traffic = pd.DataFrame(
    {
        "timestamp": pd.to_datetime(
            [
                "2026-01-01 09:00:00",
                "2026-01-01 10:00:00",
                "2026-01-01 11:00:00",
            ]
        ),
        "requests": [120, 135, 150],
    }
)

traffic["hour"] = traffic["timestamp"].dt.hour
traffic
timestamp requests hour
0 2026-01-01 09:00:00 120 9
1 2026-01-01 10:00:00 135 10
2 2026-01-01 11:00:00 150 11

Once a column is datetime-aware, pandas gives you rich .dt accessors for extracting useful temporal features.

9.10 9. Text Handling Basics

Text cleaning is another common preprocessing step.

reviews = pd.DataFrame(
    {
        "review": [
            "Great product!",
            "delivery delayed",
            "Works well for small teams",
        ]
    }
)

reviews["review_lower"] = reviews["review"].str.lower()
reviews["contains_delivery"] = reviews["review_lower"].str.contains("delivery")
reviews
review review_lower contains_delivery
0 Great product! great product! False
1 delivery delayed delivery delayed True
2 Works well for small teams works well for small teams False

The .str accessor makes string operations feel vectorized, just like numeric column operations.

9.11 10. Plotting For Fast Inspection

The pandas getting-started guide includes plotting early because plots are useful for quick feedback loops.

sales = pd.DataFrame(
    {
        "month": ["Jan", "Feb", "Mar", "Apr"],
        "revenue": [10, 12, 15, 18],
    }
)

sales.plot(x="month", y="revenue", kind="bar", title="Monthly Revenue");

For serious visualization work, teams often move to Matplotlib, Seaborn, or Plotly.

But pandas plotting is perfect for fast inspection.

9.12 Common Beginner Mistakes

  • Writing row-by-row loops when vectorized column operations would be simpler
  • Forgetting to inspect dtypes before debugging results
  • Using chained indexing instead of clear loc selection
  • Merging tables without first checking key uniqueness
  • Treating missing values as a cosmetic issue instead of a data contract issue

9.13 Suggested Learning Path

If you want to learn pandas deeply, practice in this order:

  1. Create and inspect DataFrame objects
  2. Load and save CSV data
  3. Filter rows and choose columns with loc
  4. Add derived columns
  5. Use groupby and aggregations
  6. Reshape with pivot and melt
  7. Join tables with merge
  8. Work with datetime and string columns

9.14 Practice

The following notebooks are included in this repo as hands-on exercises:

9.15 Final Takeaway

Pandas is not just a library for tables.

It is a thinking tool for turning messy inputs into reliable structure.

That makes it foundational for data pipelines, feature work, evaluation slices, and AI system debugging.