K-Means Algorithm ~ Data Science For Lifelong Learning

It is one of the simplest and most efficient clustering algorithms proposed in the literature of data clustering. It corresponds to a flat (or partitional) algorithm, which means that the dataset is classified into multiple groups in one shot based on the characteristics and similarity of the data, in Distance-Based Algorithms for Clustering you can read more about the types of clustering algorithms.

The k-Means algorithm starts by choosing k representative points as initial centroids, these centroids are not drawn from the original dataset. Each point is then assigned to the closest centroid based on a specific distance function. When the groups are formed, the centroids of each cluster are updated, replacing them with the mean values of each cluster. The whole process is described in the next image, which provides an outline of the basic k-Means algorithm.

Implementing algorithm in Python

According to each dataset, the selection process for the number of clusters K may be different. In this post the number k will be taken as arbitrary, but in a future post I'll talk about how to correctly select the number K.

Importing Libraries and Generating Data

First, we import the necessary libraries for the code and generate a synthetic dataset using the make_blobs function, this is a convenient tool for creating a dataset with a specified number of samples and clusters.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Choosing parameters
Num_points = 10000
k = 5
np.random.seed(0)

# Creating dataset
X, y_true = make_blobs(n_samples = Num_points, centers = k, cluster_std=0.8, random_state = 3)

The numpy and matplotlib libraries are used for numerical computations and plotting, respectively. In this case, we generate a dataset with 10000 points and 5 clusters. The X variable is an array that contains the points of the dataset. The y_true variable is another array that contains the cluster to which each point belongs

Defining Helper Functions

Here, we define two helper functions: getCluster and K_means

# Assigning one point to closest centroid
def getCluster(point,Centroids):
    Distances = []
    for j in range(len(Centroids)):
        Dif = point - Centroids[j]
        Distances.append(sum(Dif**2))       
    Cluster = Distances.index(min(Distances))     
    return Cluster
  
# Kmeans algorithm
def K_means(X, k):
  # Initializing the k centroids by random
  Centroids = np.array(X[np.random.choice(len(X), k, replace=False)])
  # Repeating K-means algorithm 30 times
  for m in range(30):
    # Assigning each point to the closest centroid
    Clusters = np.array([getCluster(point,Centroids) for point in X])
    # Updating centroids
    for i in range(len(Centroids)):
      Centroids[i] = np.mean(X[Clusters == i], axis = 0)
  return Centroids, Clusters

The getCluster function takes a data point from the dataset and the array of centroids as input. It calculates the Euclidean distance (omitting the square root because it's unnecessary in this case) between the data point and each centroid and returns the number (index) of the closest centroid.

The K_means function performs the K-means algorithm on the given dataset X with k clusters. The algorithm consists of the following steps:

Initialization: The k centroids are randomly chosen from the data points.

Iteration: The algorithm repeats a fixed number of times (30 in this case).

Assignment: Each data point is assigned to the closest centroid using the getCluster function.

Update: The centroids are updated by calculating the mean of the data points assigned to each centroid.

Applying K-means Algorithm

Here, we apply the K-means algorithm to the generated dataset. We can use the the previously programmed function K_means

# Applying K-means algorithm
Centroids, Labels = K_means(X, k)

Another option is to import the KMeans model from the library sklearn

from sklearn.cluster import KMeans
# Applying K-means algorithm
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
Labels = kmeans.labels_
Centroids = kmeans.cluster_centers_

In both cases, the resulting variables are the centroids and labels (cluster assignments) arrays.

Plotting the Results

In order to have a visual interpretation of the algorithm, we plot the original dataset and the dataset after applying the K-means algorithm

# Defining figure and axes
f, axs = plt.subplots(1,2, figsize=(14,5), facecolor='#F5F5F5')

# Plotting original dataset
axs[0].set_facecolor('#F5F5F5')
axs[0].scatter(X[:, 0], X[:, 1], s=50) 
axs[0].set_title("Original Points")

# Plotting clustered dataset
axs[1].set_facecolor('#F5F5F5')
scatter = axs[1].scatter(X[:, 0], X[:, 1], c=Labels, s=50, cmap='spring' ) 
axs[1].legend(handles=scatter.legend_elements()[0], labels=range(k), title = 'Clusters')
axs[1].scatter(Centroids[:, 0], Centroids[:, 1], c='black', s=200, alpha=0.5);    
axs[1].set_title("Clustered Points")

We create a figure with two subplots using plt.subplots. The first subplot (axs[0]) shows the original dataset, where each point represents a data sample. The second subplot (axs[1]) displays the dataset after applying the K-means algorithm, this time we color the points based on their assigned cluster labels and the centroids are marked with black circles. These graphs are shown below.

Below is a gif that contains the entire clustering process, in other words, it contains the changes of the centroids for each iteration.

As can be seen, the algorithm performed excellent results and found the best centroids to cluster the dataset into 5 groups at iteration 18.

In conclusion, the K-means algorithm is a widely used and efficient clustering technique in data science. It offers simplicity, scalability, and fast convergence, making it suitable for various applications. Despite the challenges of selecting the optimal number of clusters and initializing centroids, K-means remains a valuable tool for uncovering patterns and gaining insights from datasets. Its ease of implementation and ability to handle large datasets make it a good choice for data scientists who need efficient clustering solutions.

Tuesday, December 20, 2022