Chapter 2: Pandas Introduction
Data Scientists often work with tabular data, which means that data is stored in tables. pandas
is a Python package that helps you work with tabular data due to its many tabular wrangling tools that it has to offer. With pandas you can:
- Store data in table format
- Filter and sort rows
- Do calculations and statistics
- Load and save data from files like spreadsheets or CSVs
2.1 Series, DataFrames, and Indices
First, in order to import pandas into your Python environment you must always do the following command:
import pandas as pd
pandas
has three key data structures:
- Series: A single column of labeled data (1D)
- DataFrame: A full table with rows and columns (2D)
- Index: Labels for rows (or sometimes columns)
Series
A Series is like one column in a spreadsheet. It contains: A list of values and an index that labels each value.
s = pd.Series(["Inception", "The Lion King", "Interstellar"])
print(s)
0 Inception 1 The Lion King 2 Interstellar dtype: object
The numbers on the left (0, 1, 2) are the index labels. The movie titles on the right are the values.
You can also inspect parts of the Series:
s.values # array of values
s.index # index labels
Custom Index
You can assign your own index labels:
s = pd.Series([8.8, 8.5, 8.6], index=["Inception", "The Lion King", "Interstellar"])
Now the movie titles are the labels, and the values could be ratings.
2.2 Selecting Values in a Series
There are three main ways to access values in a Series:
A single label:
s["Inception"]
A list of labels:
s[["Inception", "Interstellar"]]
A filter condition:
s[s > 8.6]
This returns only the movies with ratings above 8.6.
2.3 DataFrames
A DataFrame is a full table; it’s like a collection of Series that share the same row index. A DataFrame is what you’ll spend most of your time using in pandas.
Creating a DataFrame
From a CSV file:
movies = pd.read_csv("data/movies.csv")
Suppose movies.csv
looks like this:
Title Year Genre Rating Inception 2010 Sci-Fi 8.8 The Lion King 1994 Animation 8.5 Interstellar 2014 Sci-Fi 8.6
From a list:
pd.DataFrame([1, 2, 3], columns=["Number"])
From a 2D list:
pd.DataFrame([[1, "One"], [2, "Two"]], columns=["Number", "Word"])
From a dictionary:
pd.DataFrame({
"Title": ["Inception", "Interstellar"],
"Rating": [8.8, 8.6]
})
From Series objects:
s1 = pd.Series(["Inception", "Interstellar"], index=["m1", "m2"])
s2 = pd.Series([8.8, 8.6], index=["m1", "m2"])
pd.DataFrame({"Title": s1, "Rating": s2})
Index
An index labels each row. It can be numbers (like 0, 1, 2) or something custom (like movie titles).
Set a new index:
movies.set_index("Title", inplace=True)
Reset to the default index:
movies.reset_index(inplace=True)
2.4 DataFrame Attributes
These are useful when you want to quickly understand your data:
df.index
: List of row labelsdf.columns
: List of column namesdf.shape
: Tuple of (rows, columns)
Example:
movies.shape # (3, 4)
movies.columns # Index(['Title', 'Year', 'Genre', 'Rating'], dtype='object')
2.5 Selecting Data from a DataFrame
You’ll often want to look at just part of a table. These are your main tools:
Method | What it does |
---|---|
.head(n) | Returns the first n rows |
.tail(n) | Returns the last n rows |
.loc[] | Select by label |
.iloc[] | Select by position |
[] | Depends on context (usually column selection) |
.head()
and .tail()
movies.head(2) # First 2 rows
movies.tail(1) # Last row
.loc[]
– Selection by Label
movies.loc[0, "Title"]
movies.loc[0:2, "Year":"Rating"]
.loc
slices include the ending row or column.
.iloc[]
– Selection by Position
movies.iloc[0, 1] # First row, second column
movies.iloc[0:2, 0:3] # First 2 rows and 3 columns
.iloc
slices are exclusive (like normal Python indexing).
[]
– Bracket Notation
movies["Title"] # Single column (returns a Series)
movies[["Title", "Rating"]] # Multiple columns (returns a DataFrame)
movies[0:2] # First 2 rows
This is the most common way to select columns.