Chapter 3: Exploring Data With Pandas
In this chapter we will begin working with and manipulating dataframes in order to get important information out of them.
3.1 Conditional Selection
Conditional selection means filtering your table based on a rule. In the NBA player dataset, you might want to ask:
- Which players are over 6'10"?
- Which players average more than 20 points per game?
- Which players entered the league after 2015?
Load the Dataset
import pandas as pd
nba = pd.read_csv("data/nba.csv")
nba.head()
player_name team_abbreviation player_height player_weight ... draft_year pts 0 Randy Livingston HOU 193.04 94.800728 ... 1996 3.9 1 Gaylon Nickerson WAS 190.50 86.182480 ... 1994 3.8 2 George Lynch VAN 203.20 103.418976 ... 1993 8.3 3 George McCloud LAL 203.20 102.058200 ... 1989 10.2 4 George Zidek DEN 213.36 119.748288 ... 1995 2.8
Boolean Arrays
A boolean array is a list of True/False values used to filter rows.
first_5 = nba.head(5)
first_5[[True, False, True, False, True]]
player_name team_abbreviation player_height 0 Randy Livingston HOU 193.04 2 George Lynch VAN 203.20 4 George Zidek DEN 213.36
Filter by Condition
nba[nba["player_height"] > 208.28].head()
player_name team_abbreviation player_height 4 George Zidek DEN 213.36 5 Gheorghe Muresan WAS 231.14 62 Priest Lauderdale ATL 224.79 84 Shawn Bradley PHI 226.06 108 Manute Bol PHI 231.14
3.2 Combining Conditions
Use bitwise operators to combine multiple conditions:
~
NOT&
AND|
OR
Always wrap conditions in parentheses.
Example: Players drafted after 2010 who scored over 20 points per game
nba[(nba["draft_year"] > 2010) & (nba["pts"] > 20)].head()
player_name draft_year pts 1091 Luka Doncic 2018 27.4 1221 Trae Young 2018 25.3 1232 Ja Morant 2019 25.0 1233 Zion Williamson 2019 22.5 1269 Anthony Edwards 2020 21.3
Example: Using OR
nba[(nba["pts"] > 25) | (nba["draft_year"] < 2000)].head()
player_name draft_year pts 0 Randy Livingston 1996 3.9 1 Gaylon Nickerson 1994 3.8 2 George Lynch 1993 8.3 3 George McCloud 1989 10.2 4 George Zidek 1995 2.8
3.3 Advanced Filtering
Using .isin()
players = ["LeBron James", "Kevin Durant", "Luka Doncic"]
nba[nba["player_name"].isin(players)]
player_name team_abbreviation pts 0 LeBron James LAL 27.1 1091 Luka Doncic DAL 27.4 1174 Kevin Durant PHX 26.0
Filtering by String
nba[nba["player_name"].str.startswith("J")].head()
player_name team_abbreviation 6 James Edwards CHI 10 James Robinson POR 15 Jamal Mashburn MIA 17 James Worthy LAL 20 Jason Kidd DAL
3.4 Column Operations
Add a Column
nba["Name Length"] = nba["player_name"].str.len()
nba["Name Length"].head()
0 16 1 17 2 13 3 16 4 13
Modify and Rename
nba["Name Length"] = nba["Name Length"] - 1
nba = nba.rename(columns={"Name Length": "Name_Length"})
Drop a Column
nba = nba.drop("Name_Length", axis="columns")
Note: pandas does not make changes in place unless you assign the result back to the variable.
3.5 Using NumPy with pandas
You can apply NumPy functions directly on Series:
import numpy as np
ppg = nba[nba["player_name"] == "Stephen Curry"]["pts"]
np.mean(ppg), np.max(ppg)
(24.7, 24.7)
3.6 Quick Info & Stats
Shape and Size
nba.shape
(12844, 22)
nba.size
282568
Descriptive Stats
nba.describe()
age player_height player_weight pts count 12844.000000 12844.000000 12844.000000 12844.000000 mean 26.558969 200.691805 98.458500 6.410342 std 5.124638 8.901298 13.541020 5.969174 min 19.000000 160.02 59.874720 0.000000
3.7 Sampling & Frequency
Random Samples
nba.sample(3)
player_name team_abbreviation 5629 Joe Smith MIN 10284 Chris Paul HOU 6555 Greg Ostertag UTA
Value Counts
nba["college"].value_counts().head()
Kentucky 128 Duke 120 North Carolina 115 Kansas 100 Arizona 99
Unique Values
nba["team_abbreviation"].unique()
['HOU' 'WAS' 'VAN' 'LAL' 'DEN' 'UTA' 'PHI' 'PHX' 'MIA' ... ]
Top Scorers
nba.sort_values(by="pts", ascending=False).head()
player_name pts 1091 Luka Doncic 27.4 0 LeBron James 27.1 1221 Trae Young 25.3 1232 Ja Morant 25.0 1174 Kevin Durant 26.0
In this chapter, we practiced filtering rows, updating columns, and applying key pandas functions to explore NBA data. These skills are foundational to finding insights in any dataset.