Building a Linear Regression Model
Week 7: Introduction to Machine Learning
Linear regression is a fundamental algorithm in machine learning used for predicting a continuous outcome variable based on one or more predictor variables. This lesson covers the theory behind linear regression and its implementation using Python and scikit-learn.
Understanding Linear Regression
Linear regression models the relationship between a dependent variable (y) and one or more independent variables (x) using a linear equation:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
- y is the dependent variable
- x₁, x₂, ..., xₙ are independent variables
- β₀ is the y-intercept
- β₁, β₂, ..., βₙ are the coefficients
- ε is the error term
Simple Linear Regression Example
Let's implement a simple linear regression model using Python and scikit-learn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
plt.switch_backend('Agg')
# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 + 3 * X + np.random.randn(100, 1) * 2
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Testing data')
plt.plot(X_test, y_pred, color='green', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
print("Plot created successfully.")
# Print model coefficients and evaluation metrics
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
Interpreting the Results
- Intercept (β₀): The predicted value of y when x is 0.
- Coefficient (β₁): The change in y for a one-unit increase in x.
- Mean Squared Error (MSE): Average squared difference between predicted and actual values. Lower is better.
- R-squared Score: Proportion of variance in the dependent variable predictable from the independent variable(s). Ranges from 0 to 1, with 1 indicating perfect prediction.
Assumptions of Linear Regression
- Linearity: The relationship between X and y is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of residual is the same for any value of X.
- Normality: For any fixed value of X, y is normally distributed.
Key Takeaways
- Linear regression models the relationship between variables using a linear equation.
- The model's performance can be evaluated using metrics like MSE and R-squared.
- Interpreting coefficients helps understand the impact of each feature on the prediction.
- It's important to check if the assumptions of linear regression are met for reliable results.
Practice Exercises
Let's apply what we've learned about linear regression!
Exercise: Multiple Linear Regression
Implement a multiple linear regression model to predict house prices based on area, number of bedrooms, and age of the house.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
np.random.seed(0)
n_samples = 100
area = np.random.randint(1000, 5000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
price = 100000 + 100 * area + 20000 * bedrooms - 1000 * age + np.random.randn(n_samples) * 50000
# Create a DataFrame
df = pd.DataFrame({
'area': area,
'bedrooms': bedrooms,
'age': age,
'price': price
})
# Your code here
# 1. Split the data into features (X) and target (y)
# 2. Split the data into training and testing sets
# 3. Create and train the model
# 4. Make predictions on the test set
# 5. Evaluate the model (MSE and R-squared)
# 6. Print the coefficients and interpretation
# Print your results and interpretation
Summary
Linear regression is a powerful tool for predicting continuous variables and understanding the relationships between features and the target variable. We've covered simple and multiple linear regression, including implementation, evaluation, and interpretation of results. Remember to always check the assumptions of linear regression and consider the practical significance of your findings in real-world applications.