K-Nearest Neighbors Algorithm ~ Data Science For Lifelong Learning

The K-Nearest Neighbors (KNN) algorithm is a simple and powerful machine learning technique, it is often used for Classification tasks, but can be used for Regression tasks. In this post, I focus on the implementation of KNN algorithm for classification, because this is the most common way to use it.

The main advantage of KNN, which sets it apart from other machine learning algorithms, is its simplicity and the intuitive way it makes predictions. Due to this, it has found a wide range of applications in different fields, from recommendation systems and image recognition to genetic research and anomaly detection.

It belongs to the family of instance-based learning algorithms, which means that it doesn't explicitly learn a model. Instead, it memorizes the training instances that are subsequently used as "knowledge" to make predictions of new instances. The principle behind KNN is based in the concept of similarity, encapsulated by the saying: "Birds of a feather flock together", this means that similar data points are likely to have the same class label (in classification) or a similar output value (in regression).

For classification tasks, KNN predicts the class of the new instance based on the majority class of its nearest neighbors. For regression tasks, it usually takes the mean or median of the nearest neighbors' values. The choice of K and the method to calculate the distance between points (Euclidean, Manhattan, Minkowski, etc.) can greatly affect the algorithm's performance, making KNN a versatile tool that can be finely tuned for specific datasets and problems.

The pseudocode is described in the next image, which provides an outline of the basic KNN algorithm for classification.

Implementing algorithm in Python

According to each dataset, the selection process for the number of neighbors K may be different. In this post the number K will be taken as arbitrary, but in a future post I'll talk about how to correctly select the number K.

Importing Libraries

First, we import the libraries for the code, which includes:

# Importing libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statistics import mode
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import sys

numpy: Provides mathematical functions for arrays.

sklearn: One of the most popular libraries for machine learning in Python. It features many machine learning algorithms, including the KNN algorithm. It also provides tools for model fitting, data preprocessing, model selection, model evaluation and other utilities.

statistics: Provides functions for calculating mathematical statistics of numeric data, in this case we import the function mode to obtain the majority class of the K nearest points.

seaborn and matplotlib: These are visualization libraries in Python for creating static, animated, and interactive visualizations, as well as drawing attractive and informative statistical graphs.

Defining Functions

Now that we have imported the necessary libraries, we can proceed to implement the K-Nearest Neighbors (KNN) algorithm. We define two functions: FindNearLabes and Knn.

# This function returns the labels of the k nearest neighbors
def FindNearLabes(X, Dataset, Labels, k):

    # Calculating the distances from a particular point X to all points of a dataset
    Difference = X - Dataset
    Distances = np.sum(Difference**2, axis=1)
    # Finding the classes of the k-nearest points
    sorted_ids = np.argsort(Distances)
    nearest_labels = Labels[sorted_ids[0:k]]
        
    return nearest_labels

# This function performs K-nearest neighbors classification algorithm
def Knn(Test_Dataset, Train_Dataset, Train_Labels, k):
    
    Predicted_Labels = []
    # Iterate over each data point and find the nearest neighbors
    for i in range(len(Test_Dataset)):
        # Find the indices of the k nearest neighbors for each point
        Labels = FindNearLabes(Test_Dataset[i], Train_Dataset, Train_Labels, k)  
        try:
            # Choosing the most common class label among the neighbors
            Predicted_Labels.append(mode(Labels))     
        except ValueError:
            # Handle the case where there is no majority class among the neighbors
            sys.exit("Please enter a different number of neighbors")

    return np.array(Predicted_Labels)

The FindNearLabes function calculates the Euclidean distances from a given point X to all points in a provided dataset. It then finds the labels of the K nearest points (i.e., points with the smallest distances).

The Knn function applies the KNN algorithm to a test dataset. It iterates over each data point in the test dataset, finds the K nearest neighbors using the FindNearLabes function, and assigns the most common label among the neighbors to the data point. If there is no majority class among the neighbors, the function raises an error.

Generating and Visualizing Dataset

Before we can apply the KNN algorithm, we need to have a dataset. For this demonstration, we generate a synthetic dataset using the make_moons function from the sklearn library. This function generates a simple dataset for binary classification in which the points are shaped as two interleaving half circles (or moons). We can control the number of data points and the amount of noise in the data.

# Choosing parameters
n_points = 3500
noise = 0.2 # Standard deviation of Gaussian noise added to the data
color_map = ListedColormap(['mediumseagreen','mediumblue']) # Color map for the two classes

# Creating dataset
X,y = datasets.make_moons(n_samples=n_points, noise=noise, random_state=0)

After generating the dataset, we use the matplotlib library to create a scatter plot of the data points. The two classes are represented by different colors.

# Plotting generated dataset
plt.rcParams['font.size'] = '12'
fig, ax = plt.subplots(figsize=(8,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')
# Creating a scatter plot
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=color_map, alpha = 0.4)
# Adding a legend to the plot
ax.legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')
ax.set_title("Generated Dataset", fontsize=15)

Splitting the Dataset for Train and Test

After generating the dataset, the next step is to split it into a train set and a test set. The train set is used to train the KNN classifier, while the test set is used to evaluate the classifier's performance on new data. A common practice is to allocate 75% of the dataset to the train set and 25% to the test set, which is what we do in this case.

# Splitting the dataset into a train set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Creating a figure with two subplots
fig, ax = plt.subplots(1,2, figsize=(14,5), facecolor='#F5F5F5')
# Plotting the train set
ax[0].set_facecolor('#F5F5F5')
scatter = ax[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=50, cmap=color_map, alpha = 0.4)
ax[0].legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')
ax[0].set_title("Train Dataset", fontsize=14) 
# Plotting the test set
ax[1].set_facecolor('#F5F5F5')
scatter = ax[1].scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, cmap=color_map, alpha = 0.4)
ax[1].legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')
ax[1].set_title("Test Dataset", fontsize=14)

We use the train_test_split function from the sklearn library to perform this split. This function shuffles the dataset and then splits it into train and test sets. Then, we create scatter plots to visualize both sets using the matplotlib library. These plots are shown below:

Applying K-NN Algorithm

Now that we have the train and test sets, we can apply the KNN classifier and evaluate its performance on the test set.

The Knn function takes the test set, the training set, the training labels, and the number of neighbors k as inputs, and returns the predicted labels for the test set. The accuracy of the predictions was then calculated using the accuracy_score function from sklearn.

# Applying K-NN algorithm
y_pred = Knn(X_test, X_train, y_train, k = 5)
print("The accuracy is:", accuracy_score(y_test, y_pred))

We can also use the KNeighborsClassifier model from the library sklearn. For this, we first create a KNN object with k set to 5. Then, we fit the classifier to the train set and finally predict the labels for the test set. The accuracy of these predictions is also calculated using the accuracy_score function.

# Applying K-NN algorithm using model from Sklearn
knn_sklearn = KNeighborsClassifier(n_neighbors=5)
knn_sklearn.fit(X_train, y_train)
y_pred = knn_sklearn.predict(X_test)
print("The accuracy is:", accuracy_score(y_test, y_pred))

For both alternatives, the accuracy of the models is 0.97.

After making the predictions, we create a confusion matrix using the confusion_matrix function from sklearn. The confusion matrix is a table that is often used to describe the performance of a classification model on a test set for which the true values are known. Then, we visualize the confusion matrix using a heatmap from the seaborn library.

# Creating a confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Visualizing the confusion matrix using a heatmap
fig, ax = plt.subplots(figsize=(7,5), facecolor='#F5F5F5')
ax = sns.heatmap(cm, annot=True, fmt="d", cmap="YlOrBr", annot_kws={"size": 16})
ax.set_xlabel('Predicted labels', fontsize=14)
ax.set_ylabel('True labels', fontsize=14)
ax.set_xticklabels(['Class 0', 'Class 1'], fontsize=13)
ax.set_yticklabels(['Class 0', 'Class 1'], fontsize=13)
plt.show()

The confusion matrix shows that for the class 0, 419 points are classified correctly and 14 points are classified incorrectly. For the class 1, 432 points are classified correctly and 10 points are classified incorrectly. This indicates that the KNN classifier have a high accuracy. The confusion matrix is shown below:

Bonus: Visualizing the Predicted Labels

After applying the KNN classifier and evaluating its performance, we plot the test set with the correct labels and the predicted labels. This helps for a visual understanding of how well the classifier works.

We create two scatter plots: one for the test set with the correct labels, and one for the test set with the predicted labels.

# Creating a figure and a set of subplots
fig, ax = plt.subplots(1,2, figsize=(14,5), facecolor='#F5F5F5')

# Plotting the test set with the correct labels
ax[0].set_facecolor('#F5F5F5')
scatter = ax[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, cmap=color_map, alpha = 0.4)
ax[0].legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')
ax[0].set_title("Correct Classes", fontsize=14)

# Creating a new color map for the predicted labels
color_map2 = ListedColormap(['mediumseagreen','mediumblue', 'firebrick'])
# Finding the indices of the points that were classified incorrectly
incorrect_indices = np.where(y_pred != y_test)[0]
# The labels of the points that were classified incorrectly are set to 2
y_plot = y_pred.copy()
y_plot[incorrect_indices] = 2

# Plotting the test set with the predicted labels
ax[1].set_facecolor('#F5F5F5')
scatter = ax[1].scatter(X_test[:, 0], X_test[:, 1], c=y_plot, s=50, cmap=color_map2, alpha = 0.4)
ax[1].legend(handles=scatter.legend_elements()[0], labels=['0', '1', 'Incorrect'], title = 'Classes')
ax[1].set_title("Predicted Classes", fontsize=14)

These graphs are as follows:

In the second plot, the red points are those that were incorrectly classified by the KNN model.

In this post, we explored the K-Nearest Neighbors (KNN) algorithm, we showed its effectiveness to classify a synthetic dataset, achieving a high value of accuracy. The visualizations provided clear insights into the algorithm's performance, highlighting the few instances where misclassification occurred.

However, it's crucial to remember that while KNN is powerful, its success depends on the nature of the dataset and the appropriate choice of parameters. As we continue the journey in data science, we'll encounter diverse datasets and challenges that will require to try different algorithms and techniques. The key is to understand the strengths and weaknesses of each method, in order to apply them in the right cases.

Saturday, March 11, 2023