• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Principal Components Analysis

    Making high dimensions comprehensible and actionable, it captures the maximum amount of variance in the data with a reduced number of features

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Wednesday, December 28, 2022

Dimensionality Reduction

Dimensionality reduction is an important task in machine learning which facilities analysis, compression and visualization of high dimensional data. The problem is that most of the times the machine learning algorithms (e.g. methos for clustering, classification, regression, etc.) don't need to use all the features from the dataset to get great results; in fact, some cases they work better with a reduced dataset. It has a long history as a method for data visualization and for extracting low dimensional features.

The primary objective of dimensionality reduction, as the name implies, is to reduce the number of features in the data while retaining as much relevant information as possible. This type of technique is useful in various scenarios:

  • Data and Model Interpretation: Dimensionality reduction methods can help simplify the data and model for users and stakeholders. By reducing the quantity of variables, we can focus on the key factors that influence the outcomes. This makes it easier to understand and interpret the model's behavior, as well as communicate the results to non-technical audiences.

  • Decrease Training Time: Dimensionality reduction can contribute to reducing the training time of machine learning models. When working with high-dimensional datasets, the training process can become computationally expensive and time-consuming. By removing irrelevant or redundant information and features, we reduce the complexity of the data and the model, which improves the speed of the training process. This is particularly beneficial when working real-time applications, where efficiency is crucial.
  • Variance Reduction: Dimensionality reduction techniques can reinforce the generalization of machine learning models by reducing the variance. High-dimensional datasets often contain noise and irrelevant features that can lead to overfitting, this occurs when the model memorizes the data instead of learning patterns. By focusing on the most informative and discriminative features, we improve the model's ability to generalize well to unseen data. In this way, the machine learning model can perform more robust and reliable predictions.
  • Data Compression: Dimensionality reduction methods can be applied to compress the data in situations where storage or computational resources are limited. By reducing the number of features, we also reduce the memory footprint required to store the data and the computational resources needed to process it. This compression is beneficial in scenarios such as big data analysis, where efficient storage and processing are critical for scalability and performance.


There are two different techniques in dimensionality reduction: feature selection and feature extraction

  • Feature Selection: It focuses on selecting a subset of the original features that are most relevant and informative for the given task. Instead of creating new features, feature selection aims to identify the most discriminative features from the original dataset. These methods can be based on various criteria, such as statistical measures like correlation analysis and mutual information scores, or machine learning algorithms.
  • Feature extraction: On the other hand, feature extraction involves transforming the original features into a new set of features with reduced dimensionality. Rather than selecting a subset from the original features, feature extraction techniques create new features that capture the most important information from the original data.

It's important to note that not all dimensionality reduction methods works equal for all the problems, the choice of technique depends on the specific characteristics of the dataset and the objectives of the analysis. Experimentation and evaluation of different dimensionality reduction methods are necessary to find the most suitable approach for a given task.


Share:

Tuesday, December 20, 2022

K-Means Algorithm

It is one of the simplest and most efficient clustering algorithms proposed in the literature of data clustering. It corresponds to a flat (or partitional) algorithm, which means that the dataset is classified  into multiple groups in one shot based on the characteristics and similarity of the data, in Distance-Based Algorithms for Clustering you can read more about the types of clustering algorithms.  

The k-Means algorithm starts by choosing k representative points as initial centroids, these centroids are not drawn from the original dataset. Each point is then assigned to the closest centroid based on a specific distance function. When the groups are formed, the centroids of each cluster are updated, replacing them with the mean values of each cluster. The whole process is described in the next image, which provides an outline of the basic k-Means algorithm.


Implementing algorithm in Python

According to each dataset, the selection process for the number of clusters K may be different. In this post the number k will be taken as arbitrary, but in a future post I'll talk about how to correctly select the number K.

Importing Libraries and Generating Data

First, we import the necessary libraries for the code and generate a synthetic dataset using the make_blobs function, this is a convenient tool for creating a dataset with a specified number of samples and clusters.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Choosing parameters
Num_points = 10000
k = 5
np.random.seed(0)

# Creating dataset
X, y_true = make_blobs(n_samples = Num_points, centers = k, cluster_std=0.8, random_state = 3)

The numpy and matplotlib libraries are used for numerical computations and plotting, respectively. In this case, we generate a dataset with 10000 points and 5 clusters. The X variable is an array that contains the points of the dataset. The y_true variable is another array that contains the cluster to which each point belongs

Defining Helper Functions

Here, we define two helper functions: getCluster and K_means

# Assigning one point to closest centroid
def getCluster(point,Centroids):
    Distances = []
    for j in range(len(Centroids)):
        Dif = point - Centroids[j]
        Distances.append(sum(Dif**2))       
    Cluster = Distances.index(min(Distances))     
    return Cluster
  
# Kmeans algorithm
def K_means(X, k):
  # Initializing the k centroids by random
  Centroids = np.array(X[np.random.choice(len(X), k, replace=False)])
  # Repeating K-means algorithm 30 times
  for m in range(30):
    # Assigning each point to the closest centroid
    Clusters = np.array([getCluster(point,Centroids) for point in X])
    # Updating centroids
    for i in range(len(Centroids)):
      Centroids[i] = np.mean(X[Clusters == i], axis = 0)
  return Centroids, Clusters

The getCluster function takes a data point from the dataset and the array of centroids as input. It calculates the Euclidean distance (omitting the square root because it's unnecessary in this case) between the data point and each centroid and returns the number (index) of the closest centroid.

The K_means function performs the K-means algorithm on the given dataset X with k clusters. The algorithm consists of the following steps:

  • Initialization: The k centroids are randomly chosen from the data points.
  • Iteration: The algorithm repeats a fixed number of times (30 in this case).
  • Assignment: Each data point is assigned to the closest centroid using the getCluster function.
  • Update: The centroids are updated by calculating the mean of the data points assigned to each centroid.

Applying K-means Algorithm

Here, we apply the K-means algorithm to the generated dataset. We can use the the previously programmed function K_means

# Applying K-means algorithm
Centroids, Labels = K_means(X, k)

Another option is to import the KMeans model from the library sklearn

from sklearn.cluster import KMeans
# Applying K-means algorithm
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
Labels = kmeans.labels_
Centroids = kmeans.cluster_centers_

In both cases, the resulting variables are the centroids and labels (cluster assignments) arrays.

Plotting the Results

In order to have a visual interpretation of the algorithm, we plot the original dataset and the dataset after applying the K-means algorithm

# Defining figure and axes
f, axs = plt.subplots(1,2, figsize=(14,5), facecolor='#F5F5F5')

# Plotting original dataset
axs[0].set_facecolor('#F5F5F5')
axs[0].scatter(X[:, 0], X[:, 1], s=50) 
axs[0].set_title("Original Points")

# Plotting clustered dataset
axs[1].set_facecolor('#F5F5F5')
scatter = axs[1].scatter(X[:, 0], X[:, 1], c=Labels, s=50, cmap='spring' ) 
axs[1].legend(handles=scatter.legend_elements()[0], labels=range(k), title = 'Clusters')
axs[1].scatter(Centroids[:, 0], Centroids[:, 1], c='black', s=200, alpha=0.5);    
axs[1].set_title("Clustered Points")

We create a figure with two subplots using plt.subplots. The first subplot (axs[0]) shows the original dataset, where each point represents a data sample. The second subplot (axs[1]) displays the dataset after applying the K-means algorithm, this time we color the points based on their assigned cluster labels and the centroids are marked with black circles. These graphs are shown below.

Below is a gif that contains the entire clustering process, in other words, it contains the changes of the centroids for each iteration.

As can be seen, the algorithm performed excellent results and found the best centroids to cluster the dataset into 5 groups at iteration 18.

In conclusion, the K-means algorithm is a widely used and efficient clustering technique in data science. It offers simplicity, scalability, and fast convergence, making it suitable for various applications. Despite the challenges of selecting the optimal number of clusters and initializing centroids, K-means remains a valuable tool for uncovering patterns and gaining insights from datasets. Its ease of implementation and ability to handle large datasets make it a good choice for data scientists who need efficient clustering solutions.

Share:

Tuesday, December 6, 2022

Distance-Based Algorithms for Clustering

The differences of clustering algorithms, as explained on the Data Clustering post, lies in how "similarity" is defined. In probabilistic models, the main idea is to model the data from a generative process, where a specific distribution is assumed (e.g. Gaussian), then parameters of the model are estimated using Expectation Maximum algorithm (EM). One time the parameters are found, the generative probabilities (fit probabilities) of the underlying data points are calculated. 

Many times, different kind of generative models can be reduced to distance-based algorithms, because the mixture components often use a distance function within the probability distribution. For example, the Gaussian distribution represents probabilities in terms of the Euclidean distance from the mean of the mixture. Distance-based methods are more desirable than probabilistic models, because of the simplicity and easy implementation, these can be generally divided in two types, flat and hierarchical algorithms.

Flat methods: The data is divided into different clusters in one shot using partitioning representatives, optimizing specific objective functions (generally distance functions) by iteratively improving the quality of partitions. In each iteration, the data points are assigned to their closest partitioning representatives, an then the representative is adjusted according the data points assigned to the cluster, in this way the quality of partitions improves with each iteration.

  • k-Means: These algorithm employs the Euclidean Distance to calculate the distance between the data points and the partitioning representative of each cluster, which correspond to the mean value of each cluster. This method is considered one of the simplest and most classical models for data clustering and is widely used because of its simplicity.
  • k-Medians: The operation of this algorithm is equivalent to k-Means but the partitioning representative is the median along each dimension. This provides several advantages, such as stability in presence of outliers and certain types of noise, because the median is less sensitive than the mean to extreme values in the data.
  • k-Medoids: In this algorithm, contrary to k-Means and k-Medians, the partitioning representative is sampled from the original data. Such techniques are useful in cases where de data points are arbitrary objects, for example, when a set of discrete sequence objects need to be clustered, since it may not make sense to talk about mean or median.

Hierarchical methods: The clusters are represented hierarchically through a dendrogram (binary tree-based data structure) at varying levels of granularity. Hierarchical algorithms were developed to overcome some disadvantages of flat models, because these generally require a predefined parameter K (number of clusters) to obtain a solution and they are often nondeterministic. Hierarchical methods can be categorized into agglomerative and divisive algorithms.

  • Agglomerative: These methos start by taking singleton clusters (clusters with only one data object) at the bottom level, then continue by merging two clusters at time to build a bottom-up hierarchy. There are a variety of options about how these clusters are merged, which provide different levels of quality and efficiency, for example, single-linkage, all-pairs linkage, sampled-linkage and centroid-linkage clustering.
  • Divisive: These works as opposite than agglomerative, start with all the data objects in a huge cluster and split it continuously in two groups producing a top-down hierarchy. Divisive algorithms generally are more efficient compared to agglomerative clustering, specially where there is no need to generate a complete hierarchy all the way down to individual leaves, in other words, it's not necessary to have a perfectly balanced tree in terms of node depths.


Share:

About Me

My photo
I am a Physics Engineer graduated with academic excellence as the first in my generation. I have experience programming in several languages, like C++, Matlab and especially Python, using the last two I have worked on projects in the area of Image and signal processing, as well as machine learning and data analysis projects.

Recent Post

Particle Swarm Optimization

The Concept of "Optimization" Optimization is a fundamental aspect of many scientific and engineering disciplines. It involves fi...

Pages