Data Aggregation and Resampling with Pandas

Data Aggregation and Resampling

5. Advanced Data Manipulation with Python

Data aggregation is a crucial skill in data analysis, allowing you to summarize large datasets and extract meaningful insights. In this lesson, you'll learn various techniques for aggregating data using Pandas.

Introduction to Data Aggregation

Data aggregation involves combining multiple data points into a single summary statistic. This process is essential for understanding trends, patterns, and overall characteristics of your data.

Basic Aggregation Functions

Let's start with some basic aggregation functions using a sample dataset:

import pandas as pd
import numpy as np

# Create a sample dataset
np.random.seed(0)
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'C'] * 5,
    'value': np.random.randint(1, 100, 30)
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df.head(10))

# Basic aggregations
print("\nBasic Aggregations:")
print(df.groupby('category')['value'].agg(['mean', 'sum', 'count']))

Custom Aggregation Functions

You can also define and use custom aggregation functions:

import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'C'] * 5,
    'value': np.random.randint(1, 100, 30)
}
df = pd.DataFrame(data)

# Custom aggregation function
def range_func(x):
    return x.max() - x.min()

# Apply custom and built-in aggregations
result = df.groupby('category')['value'].agg([
    ('mean', 'mean'),
    ('range', range_func),
    ('percentile_75', lambda x: np.percentile(x, 75))
])

print("Custom Aggregations:")
print(result)

Resampling Time Series Data

Resampling is a special form of aggregation for time series data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')

# Create a time series dataset
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
np.random.seed(0)
data = np.random.randn(len(dates)).cumsum()
ts = pd.Series(data, index=dates)

# Resample to monthly frequency
monthly = ts.resample('M').mean()

# Plot original and resampled data
plt.figure(figsize=(12, 6))
ts.plot(label='Daily')
monthly.plot(label='Monthly')
plt.title('Original vs Resampled Time Series')
plt.legend()
plt.show()
print("Plot created successfully.")

print("\nFirst 5 rows of monthly data:")
print(monthly.head())

Key Aggregation Concepts

1. Groupby: Splitting data into groups for aggregation.
2. Aggregation Functions: Built-in (mean, sum, count) and custom functions.
3. Multi-level Aggregation: Applying multiple functions to different columns.
4. Resampling: Time-based aggregation for time series data.
5. Transform: Applying aggregation results back to the original DataFrame.

Practice Exercises

Now it's time to apply what you've learned about data aggregation!

Exercise 1: Sales Data Analysis

Using the provided sales dataset, perform the following analyses:

Calculate the total sales and average order value for each product category.
Find the best-selling product (by quantity) in each region.
Compute the monthly sales trend, showing total sales for each month.

Use groupby() and agg() for category-wise calculations.
Utilize idxmax() to find the best-selling product.
Apply resample() for monthly sales aggregation.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')

# Create a sample sales dataset
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
data = {
    'date': np.repeat(dates, 5),
    'product': np.tile(['A', 'B', 'C', 'D', 'E'], len(dates)),
    'category': np.tile(['Electronics', 'Clothing', 'Home', 'Electronics', 'Clothing'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates) * 5),
    'quantity': np.random.randint(1, 10, len(dates) * 5),
    'price': np.random.uniform(10, 100, len(dates) * 5).round(2)
}
df = pd.DataFrame(data)
df['sales'] = df['quantity'] * df['price']

# Your code here
# 1. Total sales and average order value by category
category_summary = ___

# 2. Best-selling product in each region
best_selling = ___

# 3. Monthly sales trend
monthly_sales = ___

# Print results
print("Category Summary:")
print(___)
print("\nBest-selling product in each region:")
print(___)
print("\nMonthly Sales Trend (first 5 months):")
print(___)

# Plot monthly sales trend
plt.figure(figsize=(12, 6))
# Your plotting code here
plt.show()
print("Plot created successfully.")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')

# Create a sample sales dataset
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
data = {
    'date': np.repeat(dates, 5),
    'product': np.tile(['A', 'B', 'C', 'D', 'E'], len(dates)),
    'category': np.tile(['Electronics', 'Clothing', 'Home', 'Electronics', 'Clothing'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates) * 5),
    'quantity': np.random.randint(1, 10, len(dates) * 5),
    'price': np.random.uniform(10, 100, len(dates) * 5).round(2)
}
df = pd.DataFrame(data)
df['sales'] = df['quantity'] * df['price']

# 1. Total sales and average order value by category
category_summary = df.groupby('category').agg({
    'sales': 'sum',
    'price': 'mean'
}).rename(columns={'sales': 'total_sales', 'price': 'avg_order_value'})

# 2. Best-selling product in each region
best_selling = df.loc[df.groupby('region')['quantity'].idxmax()]

# 3. Monthly sales trend
monthly_sales = df.set_index('date')['sales'].resample('M').sum()

# Print results
print("Category Summary:")
print(category_summary)
print("\nBest-selling product in each region:")
print(best_selling[['region', 'product', 'quantity']])
print("\nMonthly Sales Trend (first 5 months):")
print(monthly_sales.head())

# Plot monthly sales trend
plt.figure(figsize=(12, 6))
monthly_sales.plot()
plt.title('Monthly Sales Trend')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.show()
print("Plot created successfully.")

Exercise 2: Advanced Data Aggregation

Using the same sales dataset, perform these advanced analyses:

Calculate the running total of sales for each product category.
Find the top 3 products by sales in each region, along with their market share.
Compute the week-over-week growth rate in sales for each category.

Use groupby() with cumsum() for running totals.
Combine groupby(), nlargest(), and apply() for top products and market share.
Utilize resample() and pct_change() for growth rate calculation.

import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
data = {
    'date': np.repeat(dates, 5),
    'product': np.tile(['A', 'B', 'C', 'D', 'E'], len(dates)),
    'category': np.tile(['Electronics', 'Clothing', 'Home', 'Electronics', 'Clothing'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates) * 5),
    'quantity': np.random.randint(1, 10, len(dates) * 5),
    'price': np.random.uniform(10, 100, len(dates) * 5).round(2)
}
df = pd.DataFrame(data)
df['sales'] = df['quantity'] * df['price']

# Your code here
# 1. Running total of sales for each category
running_total = ___

# 2. Top 3 products by sales in each region with market share
top_products = ___

# 3. Week-over-week growth rate in sales for each category
weekly_growth = ___

# Print results
print("Running Total of Sales (first 10 rows for each category):")
print(___)
print("\nTop 3 Products by Sales in Each Region:")
print(___)
print("\nWeek-over-Week Growth Rate (first 5 weeks for each category):")
print(___)

import pandas as pd
import pandas as pd
import numpy as np

# Using the same dataset as before
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
data = {
    'date': np.repeat(dates, 5),
    'product': np.tile(['A', 'B', 'C', 'D', 'E'], len(dates)),
    'category': np.tile(['Electronics', 'Clothing', 'Home', 'Electronics', 'Clothing'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates) * 5),
    'quantity': np.random.randint(1, 10, len(dates) * 5),
    'price': np.random.uniform(10, 100, len(dates) * 5).round(2)
}
df = pd.DataFrame(data)
df['sales'] = df['quantity'] * df['price']

# 1. Running total of sales for each category
df_sorted = df.sort_values('date')
df_sorted['cumulative_sales'] = df_sorted.groupby('category')['sales'].cumsum()
running_total = df_sorted[['date', 'category', 'cumulative_sales']]

# 2. Top 3 products by sales in each region with market share
def top_3_products(group):
    total_sales = group['sales'].sum()
    top_3 = group.groupby('product')['sales'].sum().nlargest(3).reset_index()
    top_3['market_share'] = top_3['sales'] / total_sales * 100
    return top_3

top_products = df.groupby('region').apply(top_3_products).reset_index()

# 3. Week-over-week growth rate in sales for each category
weekly_sales = df.set_index('date').groupby('category')['sales'].resample('W').sum()
weekly_growth = weekly_sales.groupby('category').pct_change() * 100

# Print results
print("Running Total of Sales (first 10 rows for each category):")
print(running_total.groupby('category').head(10))
print("\nTop 3 Products by Sales in Each Region:")
print(top_products)
print("\nWeek-over-Week Growth Rate (first 5 weeks for each category):")
print(weekly_growth.groupby('category').head())

Summary

Data aggregation is a fundamental skill in data analysis, allowing you to summarize large datasets and extract meaningful insights. By mastering techniques such as groupby operations, custom aggregations, and time-based resampling, you can efficiently analyze complex datasets and uncover valuable patterns and trends. Continue practicing with different datasets and exploring more advanced aggregation techniques to enhance your data analysis capabilities.