Classification Techniques
Week 7: Introduction to Machine Learning
Classification is a fundamental task in machine learning where the goal is to predict the categorical class labels of new instances based on past observations. This lesson covers key classification algorithms and their implementation using Python and scikit-learn.
Overview of Classification Algorithms
We'll focus on three popular classification algorithms:
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
1. Logistic Regression
Logistic regression is used for binary classification problems. It estimates the probability of an instance belonging to a particular class.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
plt.switch_backend('Agg')
# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary")
plt.show()
print("Plot created successfully.")
2. Decision Trees
Decision trees are versatile algorithms that can perform both classification and regression tasks. They make decisions based on asking a series of questions about the features.
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the model
tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)
# Make predictions
y_pred_tree = tree_model.predict(X_test)
# Evaluate the model
accuracy_tree = accuracy_score(y_test, y_pred_tree)
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)
print(f"Accuracy: {accuracy_tree:.2f}")
print("Confusion Matrix:")
print(conf_matrix_tree)
# Plot the decision tree
plt.figure(figsize=(20,10))
plot_tree(tree_model, filled=True, feature_names=['Feature 1', 'Feature 2'], class_names=['Class 0', 'Class 1'])
plt.title("Decision Tree Visualization")
plt.show()
print("Plot created successfully.")
3. Support Vector Machines (SVM)
SVM is a powerful algorithm that finds the hyperplane that best separates classes in high-dimensional space.
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC
# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the model
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train, y_train)
# Make predictions
y_pred_svm = svm_model.predict(X_test)
# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
print(f"Accuracy: {accuracy_svm:.2f}")
print("Confusion Matrix:")
print(conf_matrix_svm)
# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
Z = svm_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("SVM Decision Boundary")
plt.show()
print("Plot created successfully.")
Key Takeaways
- Logistic Regression is simple and interpretable, best for linearly separable classes.
- Decision Trees are easy to understand and can handle non-linear relationships, but may overfit.
- SVMs are powerful for high-dimensional spaces and work well when classes have clear margins.
- The choice of algorithm depends on the specific problem, data characteristics, and interpretability requirements.
Practice Exercise
Let's apply what we've learned about classification techniques!
Exercise: Comparing Classification Algorithms
Use the Iris dataset to compare the performance of Logistic Regression, Decision Trees, and SVM.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Your code here
# 1. Create and train models for Logistic Regression, Decision Tree, and SVM
# 2. Make predictions using each model
# 3. Calculate and print the accuracy and classification report for each model
# 4. Compare the performance of the models
# Print your results and interpretation
Summary
We've explored three popular classification algorithms: Logistic Regression, Decision Trees, and Support Vector Machines. Each algorithm has its strengths and is suited for different types of problems. Logistic Regression is simple and interpretable, Decision Trees can capture non-linear relationships, and SVMs are powerful for high-dimensional data. The choice of algorithm depends on the specific problem, data characteristics, and interpretability requirements. Always compare multiple algorithms and use appropriate evaluation metrics to select the best model for your classification task.