Linear Regression in Machine Learning

Building a Linear Regression Model

Week 7: Introduction to Machine Learning

Linear Regression in Machine Learning | Data Analysis Course

Linear regression is a fundamental algorithm in machine learning used for predicting a continuous outcome variable based on one or more predictor variables. This lesson covers the theory behind linear regression and its implementation using Python and scikit-learn.

Understanding Linear Regression

Linear regression models the relationship between a dependent variable (y) and one or more independent variables (x) using a linear equation:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Where:

y is the dependent variable
x₁, x₂, ..., xₙ are independent variables
β₀ is the y-intercept
β₁, β₂, ..., βₙ are the coefficients
ε is the error term

Simple Linear Regression Example

Let's implement a simple linear regression model using Python and scikit-learn:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
plt.switch_backend('Agg')

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 + 3 * X + np.random.randn(100, 1) * 2

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Testing data')
plt.plot(X_test, y_pred, color='green', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
print("Plot created successfully.")

# Print model coefficients and evaluation metrics
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

Interpreting the Results

Intercept (β₀): The predicted value of y when x is 0.
Coefficient (β₁): The change in y for a one-unit increase in x.
Mean Squared Error (MSE): Average squared difference between predicted and actual values. Lower is better.
R-squared Score: Proportion of variance in the dependent variable predictable from the independent variable(s). Ranges from 0 to 1, with 1 indicating perfect prediction.

Assumptions of Linear Regression

Linearity: The relationship between X and y is linear.
Independence: Observations are independent of each other.
Homoscedasticity: The variance of residual is the same for any value of X.
Normality: For any fixed value of X, y is normally distributed.

Key Takeaways

Linear regression models the relationship between variables using a linear equation.
The model's performance can be evaluated using metrics like MSE and R-squared.
Interpreting coefficients helps understand the impact of each feature on the prediction.
It's important to check if the assumptions of linear regression are met for reliable results.

Practice Exercises

Let's apply what we've learned about linear regression!

Exercise: Multiple Linear Regression

Implement a multiple linear regression model to predict house prices based on area, number of bedrooms, and age of the house.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(0)
n_samples = 100
area = np.random.randint(1000, 5000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
price = 100000 + 100 * area + 20000 * bedrooms - 1000 * age + np.random.randn(n_samples) * 50000

# Create a DataFrame
df = pd.DataFrame({
    'area': area,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

# Your code here
# 1. Split the data into features (X) and target (y)
# 2. Split the data into training and testing sets
# 3. Create and train the model
# 4. Make predictions on the test set
# 5. Evaluate the model (MSE and R-squared)
# 6. Print the coefficients and interpretation

# Print your results and interpretation

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(0)
n_samples = 100
area = np.random.randint(1000, 5000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
price = 100000 + 100 * area + 20000 * bedrooms - 1000 * age + np.random.randn(n_samples) * 50000

# Create a DataFrame
df = pd.DataFrame({
    'area': area,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

# Split the data into features (X) and target (y)
X = df[['area', 'bedrooms', 'age']]
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Evaluation:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

print("\nModel Coefficients:")
print(f"Intercept: {model.intercept_:.2f}")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")

interpretation = """
Interpretation:
1. The model explains {r2:.2%} of the variance in house prices (R-squared).
2. For every 1 sq ft increase in area, the price increases by ${model.coef_[0]:.2f}, holding other variables constant.
3. Each additional bedroom adds ${model.coef_[1]:.2f} to the price, holding other variables constant.
4. For each year of age, the price decreases by ${-model.coef_[2]:.2f}, holding other variables constant.
5. The baseline price (when all features are 0) is ${model.intercept_:.2f}.

Note: This is a simplified model and real-world housing prices depend on many more factors.
"""
print(interpretation)

Summary

Linear regression is a powerful tool for predicting continuous variables and understanding the relationships between features and the target variable. We've covered simple and multiple linear regression, including implementation, evaluation, and interpretation of results. Remember to always check the assumptions of linear regression and consider the practical significance of your findings in real-world applications.

MLQ

AI for investors

Workspace

Building a Linear Regression Model

Understanding Linear Regression

Simple Linear Regression Example

Interpreting the Results

Assumptions of Linear Regression

Key Takeaways

Practice Exercises

Exercise: Multiple Linear Regression

Summary

AI for investors

Workspace

Building a Linear Regression Model

Understanding Linear Regression

Simple Linear Regression Example

Interpreting the Results

Assumptions of Linear Regression

Key Takeaways

Practice Exercises

Exercise: Multiple Linear Regression

Summary

Upgrade to Premium