Chapter 4: Aggregating the Data Together

In many data science tasks, we want to split our data into groups and calculate summary statistics for each group. This is called the split–apply–combine pattern:

  • Split the data into groups
  • Apply a function to each group
  • Combine the results into a single table

We do this in pandas using the .groupby() method.

Grouping and Aggregating

Let’s start with a basic example: how many players came from each college. This tells us which colleges produced the most NBA players.

nba.groupby("college")["player_name"].count().sort_values(ascending=False).head()
college
Kentucky           128
Duke               120
North Carolina     115
Kansas             100
Arizona             99

Average Points Per Team

What’s the average number of points scored per player on each team?

nba.groupby("team_abbreviation")["pts"].mean().sort_values(ascending=False).head()
team_abbreviation
DAL    9.6
GSW    9.1
LAL    8.7
PHI    8.4
CHI    8.3

Maximum Height Per Team

Let’s look at the tallest player from each team:

nba.groupby("team_abbreviation")["player_height"].max().sort_values(ascending=False).head()
team_abbreviation
WAS    231.14
PHI    231.14
BOS    228.60
GSW    228.60
SAC    228.60

Multiple Aggregations

You can apply multiple functions at once using .agg(). For example, you might want to see the mean, max, and min points per team:

nba.groupby("team_abbreviation")["pts"].agg(["mean", "max", "min"]).head()
                     mean   max  min
team_abbreviation                    
ATL                 6.22  24.7  0.0
BOS                 6.00  22.4  0.0
CHI                 8.33  26.5  0.0
CLE                 5.91  19.7  0.0
DAL                 9.61  27.4  0.0

Grouping by Multiple Columns

You can also group by more than one column. Here’s an example that shows average points per player by college and team:

nba.groupby(["college", "team_abbreviation"])["pts"].mean().head()
college       team_abbreviation
Alabama       LAL                  7.1
              PHI                  6.2
Arizona       CHI                  9.3
              DAL                  7.6
              PHX                  6.4

Resetting the Index

.groupby() returns a DataFrame with grouped values as the index. To convert them back to columns, use .reset_index():

avg_pts = nba.groupby("college")["pts"].mean().reset_index()
avg_pts.head()
         college      pts
0        Alabama      6.2
1        Arizona      7.4
2       Arkansas     6.9
3           BYU       5.3
4      Cincinnati    6.1

Custom Aggregation with Named Columns

You can assign names to the results of multiple aggregations using a dictionary-like syntax:

nba.groupby("college").agg(
    Average_PPG=("pts", "mean"),
    Max_PPG=("pts", "max"),
    Player_Count=("player_name", "count")
).reset_index().head()
        college     Average_PPG   Max_PPG   Player_Count
0       Alabama           6.2       18.1             22
1       Arizona           7.4       24.1             40
2      Arkansas           6.9       21.7             27
3           BYU           5.3       12.8              9
4    Cincinnati           6.1       19.4             18

Filtering After Grouping

Once you’ve aggregated your data, you can filter it just like a regular DataFrame. For example, only show colleges with average PPG above 10:

grouped = nba.groupby("college")["pts"].mean().reset_index()
grouped[grouped["pts"] > 10]
         college      pts
12     Davidson     24.7
18     Duke         10.3
21     Georgetown   10.1

.value_counts() vs .groupby()

.value_counts() is the fastest way to count the number of values in a column:

nba["team_abbreviation"].value_counts().head()
LAL    152
BOS    150
PHI    147
ATL    144
CHI    139

If you need more flexibility or want to summarize something else, use .groupby():

nba.groupby("team_abbreviation")["pts"].mean().head()
team_abbreviation
ATL    6.22
BOS    6.00
CHI    8.33
CLE    5.91
DAL    9.61

Pivot Tables

Pivot tables let you reshape your data quickly. Think of them as a summary grid:

  • Index: rows (e.g. college)
  • Columns: columns (e.g. team)
  • Values: what we want to summarize (e.g. average points)
pivot = nba.pivot_table(
    values="pts",
    index="college",
    columns="team_abbreviation",
    aggfunc="mean"
)
pivot.head()
team_abbreviation   ATL   BOS   CHI   DAL   DEN
college                                        
Alabama              7.0   NaN   6.1   8.2   7.4
Arizona              NaN   7.4   9.3   7.6   6.5
Arkansas             6.4   7.2   NaN   NaN   7.8

This is especially useful when you want to compare two categories at once.