Chapter 3: Exploring Data With Pandas

In this chapter we will begin working with and manipulating dataframes in order to get important information out of them.

3.1 Conditional Selection

Conditional selection means filtering your table based on a rule. In the NBA player dataset, you might want to ask:

  • Which players are over 6'10"?
  • Which players average more than 20 points per game?
  • Which players entered the league after 2015?

Load the Dataset

import pandas as pd
nba = pd.read_csv("data/nba.csv")
nba.head()
  player_name      team_abbreviation  player_height  player_weight  ...  draft_year  pts
0  Randy Livingston         HOU            193.04          94.800728  ...     1996     3.9
1  Gaylon Nickerson         WAS            190.50          86.182480  ...     1994     3.8
2      George Lynch         VAN            203.20         103.418976  ...     1993     8.3
3    George McCloud         LAL            203.20         102.058200  ...     1989    10.2
4      George Zidek         DEN            213.36         119.748288  ...     1995     2.8

Boolean Arrays

A boolean array is a list of True/False values used to filter rows.

first_5 = nba.head(5)
first_5[[True, False, True, False, True]]
  player_name      team_abbreviation  player_height
0  Randy Livingston         HOU            193.04
2      George Lynch         VAN            203.20
4      George Zidek         DEN            213.36

Filter by Condition

nba[nba["player_height"] > 208.28].head()
     player_name     team_abbreviation  player_height
4   George Zidek            DEN            213.36
5     Gheorghe Muresan      WAS            231.14
62     Priest Lauderdale   ATL            224.79
84       Shawn Bradley      PHI            226.06
108    Manute Bol           PHI            231.14

3.2 Combining Conditions

Use bitwise operators to combine multiple conditions:

  • ~ NOT
  • & AND
  • | OR

Always wrap conditions in parentheses.

Example: Players drafted after 2010 who scored over 20 points per game

nba[(nba["draft_year"] > 2010) & (nba["pts"] > 20)].head()
     player_name     draft_year   pts
1091  Luka Doncic        2018     27.4
1221  Trae Young         2018     25.3
1232  Ja Morant          2019     25.0
1233  Zion Williamson    2019     22.5
1269  Anthony Edwards    2020     21.3

Example: Using OR

nba[(nba["pts"] > 25) | (nba["draft_year"] < 2000)].head()
     player_name     draft_year   pts
0    Randy Livingston    1996     3.9
1    Gaylon Nickerson    1994     3.8
2    George Lynch        1993     8.3
3    George McCloud      1989    10.2
4    George Zidek        1995     2.8

3.3 Advanced Filtering

Using .isin()

players = ["LeBron James", "Kevin Durant", "Luka Doncic"]
nba[nba["player_name"].isin(players)]
     player_name      team_abbreviation  pts
0   LeBron James         LAL            27.1
1091 Luka Doncic         DAL            27.4
1174 Kevin Durant        PHX            26.0

Filtering by String

nba[nba["player_name"].str.startswith("J")].head()
    player_name       team_abbreviation
6   James Edwards           CHI
10  James Robinson          POR
15  Jamal Mashburn          MIA
17  James Worthy            LAL
20  Jason Kidd              DAL

3.4 Column Operations

Add a Column

nba["Name Length"] = nba["player_name"].str.len()
nba["Name Length"].head()
0    16
1    17
2    13
3    16
4    13

Modify and Rename

nba["Name Length"] = nba["Name Length"] - 1
nba = nba.rename(columns={"Name Length": "Name_Length"})

Drop a Column

nba = nba.drop("Name_Length", axis="columns")

Note: pandas does not make changes in place unless you assign the result back to the variable.

3.5 Using NumPy with pandas

You can apply NumPy functions directly on Series:

import numpy as np
ppg = nba[nba["player_name"] == "Stephen Curry"]["pts"]
np.mean(ppg), np.max(ppg)
(24.7, 24.7)

3.6 Quick Info & Stats

Shape and Size

nba.shape
(12844, 22)
nba.size
282568

Descriptive Stats

nba.describe()
               age  player_height  player_weight          pts
count  12844.000000  12844.000000   12844.000000  12844.000000
mean      26.558969    200.691805      98.458500      6.410342
std        5.124638      8.901298      13.541020      5.969174
min       19.000000    160.02         59.874720       0.000000

3.7 Sampling & Frequency

Random Samples

nba.sample(3)
       player_name      team_abbreviation
5629   Joe Smith             MIN
10284  Chris Paul            HOU
6555   Greg Ostertag         UTA

Value Counts

nba["college"].value_counts().head()
Kentucky            128
Duke                120
North Carolina      115
Kansas              100
Arizona              99

Unique Values

nba["team_abbreviation"].unique()
['HOU' 'WAS' 'VAN' 'LAL' 'DEN' 'UTA' 'PHI' 'PHX' 'MIA' ... ]

Top Scorers

nba.sort_values(by="pts", ascending=False).head()
     player_name      pts
1091 Luka Doncic      27.4
0    LeBron James     27.1
1221 Trae Young       25.3
1232 Ja Morant        25.0
1174 Kevin Durant     26.0

In this chapter, we practiced filtering rows, updating columns, and applying key pandas functions to explore NBA data. These skills are foundational to finding insights in any dataset.