Chapter 5: Exploratory Data Analysis

Before we can analyze or visualize any dataset, we need to clean it. Data cleaning is the process of fixing or removing incorrect, incomplete, or improperly formatted data. Common issues include:

  • Missing values
  • Duplicates
  • Inconsistent formatting (e.g., "GSW" vs. "Golden State Warriors")
  • Out-of-range values (e.g., "Height = 0")

Step 1: Inspect the Data

Let’s start by loading and inspecting the NBA dataset.

nba = pd.read_csv("data/nba_players.csv")
nba.info()

The .info() method shows you:

  • Number of entries (rows)
  • Column names and types
  • How many non-null values exist in each column

Step 2: Detect Missing Values

To check for missing values in the dataset:

nba.isnull().sum()

Step 3: Drop or Fill Missing Values

You can remove rows with missing data:

nba.dropna().head()

Or fill missing values with a default:

nba["college"] = nba["college"].fillna("Undrafted")
nba["player_height"] = nba["player_height"].fillna(nba["player_height"].mean())

Use .fillna() carefully, always ask: does this make sense? For example if we're looking at weather data (temperature) and we have missing values for all of January would it make sense to fill the missing values with the average temperature across all months or is there a better way?

Step 4: Check for Duplicates

nba.duplicated().sum()

In order to remove duplicates:

nba = nba.drop_duplicates()

Data Type Issues

Sometimes columns are stored as the wrong data type (like numbers stored as strings).

nba["draft_year"] = pd.to_numeric(nba["draft_year"], errors="coerce")

Or you may want to convert height from centimeters to inches:

nba["height_in"] = nba["player_height"] / 2.54

Standardizing String Data

nba["college"] = nba["college"].str.strip()
nba["team_abbreviation"] = nba["team_abbreviation"].str.upper()

Use .str methods to:

  • Remove extra whitespace
  • Make all letters lowercase/uppercase
  • Replace values

Outlier Detection

Use .describe() to look at ranges:

nba["pts"].describe()

To visualize outliers you might want to consider visualizing a box plot:

import matplotlib.pyplot as plt
nba["pts"].plot.box()
plt.show()

You can also use conditional filtering to spot extremes:

nba[nba["pts"] > 50]

Exploratory Data Analysis (EDA)

Now that the data is clean, we can start exploring it. EDA is about asking and answering open-ended questions, such as:

  • Who are the top scorers?
  • Do taller players score more?
  • Which college produces the most players?

Quick Stats

Finding the averages, maximum, and minimum are important metrics to understand the scope and relationships in your data

nba["pts"].mean()
nba["college"].value_counts().head()

Once we have all the information provided above we are able to understand our data which helps us make further inferences about how to use our data and what inferences we can make

Up Next

Now that we understand how to explore, clean, and transform our data we can now move onto visualizations. Next chapter we will deep dive into visualizations and explore a new library called Seaborn