Classification Techniques in Machine Learning

Classification Techniques

Week 7: Introduction to Machine Learning

Classification Techniques in Machine Learning | Data Analysis Course

Classification is a fundamental task in machine learning where the goal is to predict the categorical class labels of new instances based on past observations. This lesson covers key classification algorithms and their implementation using Python and scikit-learn.

Overview of Classification Algorithms

We'll focus on three popular classification algorithms:

Logistic Regression
Decision Trees
Support Vector Machines (SVM)

1. Logistic Regression

Logistic regression is used for binary classification problems. It estimates the probability of an instance belonging to a particular class.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
plt.switch_backend('Agg')

# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary")
plt.show()
print("Plot created successfully.")

2. Decision Trees

Decision trees are versatile algorithms that can perform both classification and regression tasks. They make decisions based on asking a series of questions about the features.

import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Create and train the model
tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model
accuracy_tree = accuracy_score(y_test, y_pred_tree)
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)

print(f"Accuracy: {accuracy_tree:.2f}")
print("Confusion Matrix:")
print(conf_matrix_tree)

# Plot the decision tree
plt.figure(figsize=(20,10))
plot_tree(tree_model, filled=True, feature_names=['Feature 1', 'Feature 2'], class_names=['Class 0', 'Class 1'])
plt.title("Decision Tree Visualization")
plt.show()
print("Plot created successfully.")

3. Support Vector Machines (SVM)

SVM is a powerful algorithm that finds the hyperplane that best separates classes in high-dimensional space.

import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC

# Generate sample data
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)

print(f"Accuracy: {accuracy_svm:.2f}")
print("Confusion Matrix:")
print(conf_matrix_svm)

# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = svm_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("SVM Decision Boundary")
plt.show()
print("Plot created successfully.")

Key Takeaways

Logistic Regression is simple and interpretable, best for linearly separable classes.
Decision Trees are easy to understand and can handle non-linear relationships, but may overfit.
SVMs are powerful for high-dimensional spaces and work well when classes have clear margins.
The choice of algorithm depends on the specific problem, data characteristics, and interpretability requirements.

Practice Exercise

Let's apply what we've learned about classification techniques!

Exercise: Comparing Classification Algorithms

Use the Iris dataset to compare the performance of Logistic Regression, Decision Trees, and SVM.


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Your code here
# 1. Create and train models for Logistic Regression, Decision Tree, and SVM
# 2. Make predictions using each model
# 3. Calculate and print the accuracy and classification report for each model
# 4. Compare the performance of the models

# Print your results and interpretation


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM": SVC(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f"\n{name} Results:")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))

interpretation = """
Interpretation:
1. All three models perform well on the Iris dataset, with accuracies above 90%.
2. SVM seems to have the highest overall accuracy, followed closely by Logistic Regression and Decision Tree.
3. The Decision Tree model might be slightly overfitting, as it has perfect precision and recall for some classes.
4. Logistic Regression and SVM show more balanced performance across all three classes.
5. For this dataset, SVM might be the best choice due to its high accuracy and balanced performance.

Remember that the Iris dataset is relatively small and simple. For more complex datasets, 
the performance differences between these algorithms might be more pronounced.
"""
print(interpretation)

Summary

We've explored three popular classification algorithms: Logistic Regression, Decision Trees, and Support Vector Machines. Each algorithm has its strengths and is suited for different types of problems. Logistic Regression is simple and interpretable, Decision Trees can capture non-linear relationships, and SVMs are powerful for high-dimensional data. The choice of algorithm depends on the specific problem, data characteristics, and interpretability requirements. Always compare multiple algorithms and use appropriate evaluation metrics to select the best model for your classification task.

MLQ

AI for investors

Workspace