Chapter 4: Aggregating the Data Together
In many data science tasks, we want to split our data into groups and calculate summary statistics for each group. This is called the split–apply–combine pattern:
- Split the data into groups
- Apply a function to each group
- Combine the results into a single table
We do this in pandas using the .groupby()
method.
Grouping and Aggregating
Let’s start with a basic example: how many players came from each college. This tells us which colleges produced the most NBA players.
nba.groupby("college")["player_name"].count().sort_values(ascending=False).head()
college Kentucky 128 Duke 120 North Carolina 115 Kansas 100 Arizona 99
Average Points Per Team
What’s the average number of points scored per player on each team?
nba.groupby("team_abbreviation")["pts"].mean().sort_values(ascending=False).head()
team_abbreviation DAL 9.6 GSW 9.1 LAL 8.7 PHI 8.4 CHI 8.3
Maximum Height Per Team
Let’s look at the tallest player from each team:
nba.groupby("team_abbreviation")["player_height"].max().sort_values(ascending=False).head()
team_abbreviation WAS 231.14 PHI 231.14 BOS 228.60 GSW 228.60 SAC 228.60
Multiple Aggregations
You can apply multiple functions at once using .agg()
. For example, you might want to see the mean, max, and min points per team:
nba.groupby("team_abbreviation")["pts"].agg(["mean", "max", "min"]).head()
mean max min team_abbreviation ATL 6.22 24.7 0.0 BOS 6.00 22.4 0.0 CHI 8.33 26.5 0.0 CLE 5.91 19.7 0.0 DAL 9.61 27.4 0.0
Grouping by Multiple Columns
You can also group by more than one column. Here’s an example that shows average points per player by college and team:
nba.groupby(["college", "team_abbreviation"])["pts"].mean().head()
college team_abbreviation Alabama LAL 7.1 PHI 6.2 Arizona CHI 9.3 DAL 7.6 PHX 6.4
Resetting the Index
.groupby()
returns a DataFrame with grouped values as the index. To convert them back to columns, use .reset_index()
:
avg_pts = nba.groupby("college")["pts"].mean().reset_index() avg_pts.head()
college pts 0 Alabama 6.2 1 Arizona 7.4 2 Arkansas 6.9 3 BYU 5.3 4 Cincinnati 6.1
Custom Aggregation with Named Columns
You can assign names to the results of multiple aggregations using a dictionary-like syntax:
nba.groupby("college").agg( Average_PPG=("pts", "mean"), Max_PPG=("pts", "max"), Player_Count=("player_name", "count") ).reset_index().head()
college Average_PPG Max_PPG Player_Count 0 Alabama 6.2 18.1 22 1 Arizona 7.4 24.1 40 2 Arkansas 6.9 21.7 27 3 BYU 5.3 12.8 9 4 Cincinnati 6.1 19.4 18
Filtering After Grouping
Once you’ve aggregated your data, you can filter it just like a regular DataFrame. For example, only show colleges with average PPG above 10:
grouped = nba.groupby("college")["pts"].mean().reset_index() grouped[grouped["pts"] > 10]
college pts 12 Davidson 24.7 18 Duke 10.3 21 Georgetown 10.1
.value_counts() vs .groupby()
.value_counts()
is the fastest way to count the number of values in a column:
nba["team_abbreviation"].value_counts().head()
LAL 152 BOS 150 PHI 147 ATL 144 CHI 139
If you need more flexibility or want to summarize something else, use .groupby()
:
nba.groupby("team_abbreviation")["pts"].mean().head()
team_abbreviation ATL 6.22 BOS 6.00 CHI 8.33 CLE 5.91 DAL 9.61
Pivot Tables
Pivot tables let you reshape your data quickly. Think of them as a summary grid:
- Index: rows (e.g. college)
- Columns: columns (e.g. team)
- Values: what we want to summarize (e.g. average points)
pivot = nba.pivot_table( values="pts", index="college", columns="team_abbreviation", aggfunc="mean" ) pivot.head()
team_abbreviation ATL BOS CHI DAL DEN college Alabama 7.0 NaN 6.1 8.2 7.4 Arizona NaN 7.4 9.3 7.6 6.5 Arkansas 6.4 7.2 NaN NaN 7.8
This is especially useful when you want to compare two categories at once.