Chapter 2: Pandas Introduction

Data Scientists often work with tabular data, which means that data is stored in tables. pandas is a Python package that helps you work with tabular data due to its many tabular wrangling tools that it has to offer. With pandas you can:

  • Store data in table format
  • Filter and sort rows
  • Do calculations and statistics
  • Load and save data from files like spreadsheets or CSVs

2.1 Series, DataFrames, and Indices

First, in order to import pandas into your Python environment you must always do the following command:

import pandas as pd

pandas has three key data structures:

  • Series: A single column of labeled data (1D)
  • DataFrame: A full table with rows and columns (2D)
  • Index: Labels for rows (or sometimes columns)

Series

A Series is like one column in a spreadsheet. It contains: A list of values and an index that labels each value.

s = pd.Series(["Inception", "The Lion King", "Interstellar"])
print(s)
0       Inception
1    The Lion King
2     Interstellar
dtype: object

The numbers on the left (0, 1, 2) are the index labels. The movie titles on the right are the values.

You can also inspect parts of the Series:

s.values  # array of values
s.index   # index labels

Custom Index

You can assign your own index labels:

s = pd.Series([8.8, 8.5, 8.6], index=["Inception", "The Lion King", "Interstellar"])

Now the movie titles are the labels, and the values could be ratings.

2.2 Selecting Values in a Series

There are three main ways to access values in a Series:

A single label:

s["Inception"]

A list of labels:

s[["Inception", "Interstellar"]]

A filter condition:

s[s > 8.6]

This returns only the movies with ratings above 8.6.

2.3 DataFrames

A DataFrame is a full table; it’s like a collection of Series that share the same row index. A DataFrame is what you’ll spend most of your time using in pandas.

Creating a DataFrame

From a CSV file:

movies = pd.read_csv("data/movies.csv")

Suppose movies.csv looks like this:

Title           Year    Genre       Rating
Inception       2010    Sci-Fi      8.8
The Lion King   1994    Animation   8.5
Interstellar    2014    Sci-Fi      8.6

From a list:

pd.DataFrame([1, 2, 3], columns=["Number"])

From a 2D list:

pd.DataFrame([[1, "One"], [2, "Two"]], columns=["Number", "Word"])

From a dictionary:

pd.DataFrame({
    "Title": ["Inception", "Interstellar"],
    "Rating": [8.8, 8.6]
})

From Series objects:

s1 = pd.Series(["Inception", "Interstellar"], index=["m1", "m2"])
s2 = pd.Series([8.8, 8.6], index=["m1", "m2"])
pd.DataFrame({"Title": s1, "Rating": s2})

Index

An index labels each row. It can be numbers (like 0, 1, 2) or something custom (like movie titles).

Set a new index:

movies.set_index("Title", inplace=True)

Reset to the default index:

movies.reset_index(inplace=True)

2.4 DataFrame Attributes

These are useful when you want to quickly understand your data:

  • df.index: List of row labels
  • df.columns: List of column names
  • df.shape: Tuple of (rows, columns)

Example:

movies.shape     # (3, 4)
movies.columns   # Index(['Title', 'Year', 'Genre', 'Rating'], dtype='object')

2.5 Selecting Data from a DataFrame

You’ll often want to look at just part of a table. These are your main tools:

MethodWhat it does
.head(n)Returns the first n rows
.tail(n)Returns the last n rows
.loc[]Select by label
.iloc[]Select by position
[]Depends on context (usually column selection)

.head() and .tail()

movies.head(2)   # First 2 rows
movies.tail(1)   # Last row

.loc[] – Selection by Label

movies.loc[0, "Title"]
movies.loc[0:2, "Year":"Rating"]

.loc slices include the ending row or column.

.iloc[] – Selection by Position

movies.iloc[0, 1]        # First row, second column
movies.iloc[0:2, 0:3]    # First 2 rows and 3 columns

.iloc slices are exclusive (like normal Python indexing).

[] – Bracket Notation

movies["Title"]                 # Single column (returns a Series)
movies[["Title", "Rating"]]     # Multiple columns (returns a DataFrame)
movies[0:2]                     # First 2 rows

This is the most common way to select columns.