Analyzing Pivot Tables with Pandas | Become a Data Analyst

Creating and Analyzing Pivot Tables

5. Advanced Data Manipulation with Python

Pivot tables are a powerful tool for summarizing and analyzing large datasets. In this lesson, you'll learn how to create, manipulate, and interpret pivot tables using Pandas.

Introduction to Pivot Tables

Pivot tables allow you to reshape data and calculate aggregate statistics quickly. They're especially useful for:

Summarizing data across multiple dimensions
Calculating totals and subtotals
Creating cross-tabulations of data

Creating a Basic Pivot Table

Let's start with a simple example using a sales dataset:

import pandas as pd
import numpy as np

# Create a sample dataset
np.random.seed(0)
data = {
    'date': pd.date_range(start='2023-01-01', periods=100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'sales': np.random.randint(100, 1000, 100)
}
df = pd.DataFrame(data)

# Create a basic pivot table
pivot = pd.pivot_table(df, values='sales', index='product', columns='region', aggfunc='sum')

print("Basic Pivot Table:")
print(pivot)

Adding Multiple Aggregation Functions

You can apply multiple aggregation functions to create more comprehensive pivot tables:

import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
data = {
    'date': pd.date_range(start='2023-01-01', periods=100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'sales': np.random.randint(100, 1000, 100)
}
df = pd.DataFrame(data)

# Create a pivot table with multiple aggregation functions
pivot_multi = pd.pivot_table(df, values='sales', index='product', 
                             columns='region', aggfunc=['sum', 'mean', 'count'])

print("Pivot Table with Multiple Aggregations:")
print(pivot_multi)

Hierarchical Indexing in Pivot Tables

Pivot tables can have multiple levels of indexing for more complex analyses:

import pandas as pd
import numpy as np

# Create a more complex dataset
np.random.seed(0)
data = {
    'date': pd.date_range(start='2023-01-01', periods=1000),
    'product': np.random.choice(['A', 'B', 'C'], 1000),
    'category': np.random.choice(['Cat1', 'Cat2'], 1000),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
    'sales': np.random.randint(100, 1000, 1000)
}
df = pd.DataFrame(data)

# Create a pivot table with hierarchical indexing
pivot_hierarchical = pd.pivot_table(df, values='sales', 
                                    index=['category', 'product'], 
                                    columns='region', 
                                    aggfunc='sum', 
                                    margins=True)

print("Pivot Table with Hierarchical Indexing:")
print(pivot_hierarchical)

Key Pivot Table Concepts

1. Values: The data to be summarized or aggregated.
2. Index: Rows in the resulting pivot table.
3. Columns: Columns in the resulting pivot table.
4. Aggregation Function: How to combine the data (e.g., sum, mean, count).
5. Margins: Option to include row and column totals.

Practice Exercises

Now it's time to apply what you've learned about pivot tables!

Exercise 1: Sales Analysis with Pivot Tables

Using the provided sales dataset, create pivot tables to answer the following questions:

What are the total sales for each product category by region?
What is the average order value for each product, broken down by month?
How many orders were placed for each product in each region?

Use pd.pivot_table() to create the required pivot tables.
Choose appropriate values, index, columns, and aggregation functions.
Format the output for readability.

import pandas as pd
import numpy as np

# Create the sales dataset
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
categories = ['Electronics', 'Clothing', 'Home & Garden']
products = ['Product' + str(i) for i in range(1, 6)]
regions = ['North', 'South', 'East', 'West']

data = {
    'date': np.random.choice(dates, 10000),
    'category': np.random.choice(categories, 10000),
    'product': np.random.choice(products, 10000),
    'region': np.random.choice(regions, 10000),
    'sales': np.random.randint(10, 1000, 10000)
}
df = pd.DataFrame(data)

# Your code here
# 1. Total sales for each product category by region
pivot_1 = ___

# 2. Average order value for each product by month
pivot_2 = ___

# 3. Number of orders for each product in each region
pivot_3 = ___

# Print results
print("1. Total sales for each product category by region:")
print(___)
print("\n2. Average order value for each product by month:")
print(___)
print("\n3. Number of orders for each product in each region:")
print(___)

import pandas as pd
import numpy as np

# Create the sales dataset
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
categories = ['Electronics', 'Clothing', 'Home & Garden']
products = ['Product' + str(i) for i in range(1, 6)]
regions = ['North', 'South', 'East', 'West']

data = {
    'date': np.random.choice(dates, 10000),
    'category': np.random.choice(categories, 10000),
    'product': np.random.choice(products, 10000),
    'region': np.random.choice(regions, 10000),
    'sales': np.random.randint(10, 1000, 10000)
}
df = pd.DataFrame(data)

# 1. Total sales for each product category by region
pivot_1 = pd.pivot_table(df, values='sales', index='category', columns='region', aggfunc='sum')

# 2. Average order value for each product by month
df['month'] = df['date'].dt.to_period('M')
pivot_2 = pd.pivot_table(df, values='sales', index='product', columns='month', aggfunc='mean')

# 3. Number of orders for each product in each region
pivot_3 = pd.pivot_table(df, values='sales', index='product', columns='region', aggfunc='count')

# Print results
print("1. Total sales for each product category by region:")
print(pivot_1)
print("\n2. Average order value for each product by month:")
print(pivot_2)
print("\n3. Number of orders for each product in each region:")
print(pivot_3)

Exercise 2: Advanced Pivot Table Analysis

Using the same sales dataset, perform these advanced analyses:

Create a pivot table showing the percentage of total sales each product represents within its category.
Generate a pivot table that displays the top-selling product for each region and category combination.
Produce a pivot table showing the year-over-year growth in sales for each product category.

Use advanced pivot table techniques like custom aggregation functions and multi-level indexing.
Implement data transformations before or after creating pivot tables as needed.
Utilize pivot table methods for further analysis of the results.

import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
dates = pd.date_range(start='2022-01-01', end='2023-12-31', freq='D')
categories = ['Electronics', 'Clothing', 'Home & Garden']
products = ['Product' + str(i) for i in range(1, 6)]
regions = ['North', 'South', 'East', 'West']

data = {
    'date': np.random.choice(dates, 20000),
    'category': np.random.choice(categories, 20000),
    'product': np.random.choice(products, 20000),
    'region': np.random.choice(regions, 20000),
    'sales': np.random.randint(10, 1000, 20000)
}
df = pd.DataFrame(data)

# Your code here
# 1. Percentage of total sales each product represents within its category
pivot_1 = ___

# 2. Top-selling product for each region and category combination
pivot_2 = ___

# 3. Year-over-year growth in sales for each product category
pivot_3 = ___

# Print results
print("1. Percentage of total sales each product represents within its category:")
print(___)
print("\n2. Top-selling product for each region and category combination:")
print(___)
print("\n3. Year-over-year growth in sales for each product category:")
print(___)

import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
dates = pd.date_range(start='2022-01-01', end='2023-12-31', freq='D')
categories = ['Electronics', 'Clothing', 'Home & Garden']
products = ['Product' + str(i) for i in range(1, 6)]
regions = ['North', 'South', 'East', 'West']

data = {
    'date': np.random.choice(dates, 20000),
    'category': np.random.choice(categories, 20000),
    'product': np.random.choice(products, 20000),
    'region': np.random.choice(regions, 20000),
    'sales': np.random.randint(10, 1000, 20000)
}
df = pd.DataFrame(data)

# 1. Percentage of total sales each product represents within its category
def percent_of_category(x):
    return x / x.sum() * 100

pivot_1 = pd.pivot_table(df, values='sales', index=['category', 'product'], aggfunc='sum')
pivot_1 = pivot_1.groupby(level=0).apply(percent_of_category).round(2)

# 2. Top-selling product for each region and category combination
pivot_2 = pd.pivot_table(df, values='sales', index=['region', 'category'], columns='product', aggfunc='sum')
pivot_2 = pivot_2.idxmax(axis=1)

# 3. Year-over-year growth in sales for each product category
df['year'] = df['date'].dt.year
pivot_3_2022 = pd.pivot_table(df[df['year'] == 2022], values='sales', index='category', aggfunc='sum')
pivot_3_2023 = pd.pivot_table(df[df['year'] == 2023], values='sales', index='category', aggfunc='sum')
pivot_3 = (pivot_3_2023 - pivot_3_2022) / pivot_3_2022 * 100

# Print results
print("1. Percentage of total sales each product represents within its category:")
print(pivot_1)
print("\n2. Top-selling product for each region and category combination:")
print(pivot_2)
print("\n3. Year-over-year growth in sales for each product category:")
print(pivot_3)

Summary

Pivot tables are powerful tools for data analysis, allowing you to quickly summarize and analyze large datasets. By mastering pivot table techniques, you can efficiently reorganize, aggregate, and explore your data from different perspectives. Remember to experiment with various combinations of index, columns, and values to uncover hidden patterns and insights in your datasets.