Introduction to Machine Learning

Week 7: Introduction to Machine Learning

Introduction to Machine Learning | Data Analysis Course

Machine Learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience. This lesson introduces the fundamental concepts of machine learning and its applications in data analysis.

What is Machine Learning?

Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.

Types of Machine Learning

Supervised Learning: The algorithm learns from labeled training data.
Unsupervised Learning: The algorithm learns from unlabeled data.
Reinforcement Learning: The algorithm learns through interaction with an environment.

Common Machine Learning Algorithms

Linear Regression
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
K-Nearest Neighbors (KNN)
K-Means Clustering
Neural Networks

Example: Linear Regression

Let's implement a simple linear regression model using Python and scikit-learn:


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
plt.switch_backend('Agg')

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.randn(100, 1) * 0.1

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Testing data')
plt.plot(X_test, y_pred, color='green', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.legend()
plt.show()
print("Plot created successfully.")

# Print model coefficients and score
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"R-squared score: {model.score(X_test, y_test):.2f}")

Key Takeaways

Machine Learning allows computers to learn from data and improve performance over time.
There are three main types of machine learning: supervised, unsupervised, and reinforcement learning.
Various algorithms are used in machine learning, each suited for different types of problems.
Linear regression is a simple but powerful algorithm for predicting numerical values.
Scikit-learn is a popular Python library for implementing machine learning algorithms.

Practice Exercises

Let's apply what we've learned about machine learning!

Exercise 1: K-Nearest Neighbors Classification

Implement a K-Nearest Neighbors classifier to predict whether a customer will purchase a product based on their age and estimated salary.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
plt.switch_backend('Agg')

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 2) * 100  # Age and Salary
y = (X[:, 0] + X[:, 1] > 100).astype(int)  # Purchase (1) or not (0)

# Your code here
# 1. Split the data into training and testing sets
# 2. Create and train a KNN model
# 3. Make predictions on the test set
# 4. Calculate and print the accuracy score
# 5. Plot the decision boundary

# Print your results and interpretation


import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
plt.switch_backend('Agg')

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 2) * 100  # Age and Salary
y = (X[:, 0] + X[:, 1] > 100).astype(int)  # Purchase (1) or not (0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Plot the decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("KNN Classification: Purchase Decision")
plt.show()
print("Plot created successfully.")

interpretation = f"""
Interpretation:
The KNN model achieved an accuracy of {accuracy:.2f} on the test set.
The decision boundary plot shows how the model classifies different age-salary combinations.
Blue regions indicate predicted non-purchases, while orange regions indicate predicted purchases.
The model seems to capture the general trend that higher age + salary combinations are more likely to result in a purchase.
"""
print(interpretation)

Summary

This introduction to machine learning has covered the fundamental concepts, types of machine learning, and common algorithms. We've explored a practical example of linear regression and practiced with K-Nearest Neighbors classification. These basics form the foundation for more advanced machine learning techniques and applications in data science.