January 2023 ~ Data Science For Lifelong Learning

Monday, January 30, 2023

Principal Components Analysis Algorithm

January 30, 2023Data Preprocessing, Dimensionality Reduction No comments

Principal Component Analysis (PCA) is one of the most famous models used for dimensionality reduction. It uses an orthonormal transformation to convert a set of observation of possibly correlated variables into a set of linearly uncorrelated variables. It corresponds to a feature extraction technique, which means that instead of selecting a subset of the original features, these methods transform the original features into a new set of features with reduced dimensionality, in Dimensionality Reduction and Feature Extraction you can read more about dimensionality reduction and feature extraction techniques.

Given a dataset consisting of a set of points (vectors) in a high dimensional space, the main idea of PCA is to find the directions, also called principal components, along which the points line up best and make a projection on these components to create a new reduced dataset, but capturing the most relevant information.

The PCA algorithm starts by choosing the number of principal components (number of dimensions to reduce), then the covariance matrix of the dataset is calculated, as well as the eigenvectors and eigenvalues of this matrix. The eigenvectors are used to perform the transformation and eigenvalues to choose which eigenvectors are taken. The N-eigenvectors with the highest corresponding eigenvalues (in descend order) are used as to build a transformation matrix. Finally, to obtain a low dimensional dataset with new features, the original dataset is multiplied by the transformation matrix. The whole process is described in the next image, which provides an outline of the basic PCA algorithm.

Implementing algorithm in Python

Depending on each dataset, the selection of the number of dimensions may vary. In this post we generate a 2-dimensional dataset and reduce it to a 1-dimensional dataset, but these parameters can be easily modified to apply the PCA algorithm with different datasets.

Importing libraries

First, we import the necessary libraries for the code, which includes:

# Importing libraries
import numpy as np
from numpy.linalg import eig
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA as PCA_sklearn
from sklearn.preprocessing import StandardScaler
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt

numpy: Provides mathematical functions for arrays, we also import the function eig to calculate eigenvalues and eigenvectors.

sklearn: Library that contains methods for data preprocessing and machine learning models. In this post the functions imported are: StandardScaler, which standardize features by removing the mean and scaling to unit variance; normalize, that normalizes vectors to unit norm and PCA_sklearn, it provides an implementation of the PCA algorithm.

scipy: provides algorithms for optimization, integration, statistics and many other classes of problems. In this post we imported the function multivariate_normal, it enables the generation of random samples from a multivariate normal distribution.

matplotlib: It's a comprehensive library for creating static, animated, and interactive visualizations.

Defining Functions

We define two functions: createDataset and PCA

# Create bivariate dataset
def createDataset(cov, mean, n_points, seed = 101):
  # Defining dataset
  distr = multivariate_normal(cov = cov, mean = mean, seed = seed)
  X = distr.rvs(size = n_points)
  # Scaling dataset
  sc = StandardScaler()
  X_scaled = sc.fit_transform(X)
  return X_scaled

# PCA algorithm
def PCA(X, n_components):
  # Calculating matrix of covariance
  cov_matrix = np.cov(X.T)
  # Eigenvectors and eigenvalues
  eigenvalues, eigenvectors = eig(cov_matrix)
  sort = np.argsort(eigenvalues)[::-1]
  eigvalues_sort = eigenvalues[sort]
  # Principal Components
  eigvectors_sort = normalize(eigenvectors.T[sort], axis = 0)
  # Matrix of transformation
  transformation_matrix = eigvectors_sort.T
  # Applying transformation to original dataset
  reduced_dataset = X @ transformation_matrix[:,0:n_components]
  return reduced_dataset, eigvectors_sort

The createDataset function generates a bivariate dataset with a multivariate normal distribution, taking the following parameters: covariance matrix (determines the shape and orientation of the distribution), mean vector (specifies the center of the distribution), number of points to generate and seed for the random number generator (optional). This function also standardizes the dataset using the StandardScaler function to ensure that have zero mean and unit variance.

The PCA function takes the input dataset X and the number of components n_components as parameters. It follows the steps early explained to perform the Principal Component Analysis algorithm and returns the reduced dataset and the sorted eigenvectors (principal components).

Generating Dataset

We generate a synthetic dataset X of 350 points using the createDataset function, and also defined one as the number of components for the PCA algorithm.

# Choosing parameters
cov = np.array([[1, 0.9], [0.9, 1]])
mean = np.array([0,0])
n_points = 350
n_components = 1

# Creating dataset
X = createDataset(cov, mean, n_points)

Plotting the original dataset

The matplotlib is used to create a scatter plot of the original dataset X.

# Plotting original dataset
plt.rcParams['font.size'] = '10'
f, axs = plt.subplots(1,1, figsize=(7,7), facecolor='#F5F5F5')
axs.set_facecolor('#F5F5F5')
axs.plot(X[:, 0], X[:, 1],'o', c='mediumseagreen',markeredgewidth = 0.5,markeredgecolor = 'black') 
axs.set_title("Original Dataset", fontsize="16")
plt.show()

Applying PCA algorithm

Here we apply the PCA algorithm to the dataset X using the PCA function. It retains only 1 principal component (n_components). The reduced dataset and the corresponding eigenvectors are stored in the variables reduced_dataset and components, respectively.

# Applying PCA algorithm
reduced_dataset, components = PCA(X, n_components)

Another alternative is to use the PCA model from the library sklearn

# Applying PCA algorithm from sklearn
pca_sklearn =  PCA_sklearn(n_components=1)
pca_sklearn.fit(X)
reduced_dataset = pca_sklearn.transform(X)

Plotting the PCA result (reduced dataset)

We create a scatter plot to visualize the reduced dataset after applying PCA algorithm. It plots the values from the reduced dataset on the x-axis. All y-values are 0 because the dataset has only one dimension, which corresponds to x-axis in this visualization.

# Plotting PCA result (reduced dataset with only one feature)
f, axs = plt.subplots(1,1, figsize=(9,5), facecolor='#F5F5F5')
axs.set_facecolor('#F5F5F5')
axs.scatter(reduced_dataset[:,0],len(reduced_dataset)*[0],s=200, zorder = -1, color="mediumseagreen", alpha = 0.3) 
axs.set_title("Reduced dataset with only one feature", fontsize="16")
plt.show()

The result is:

Bonus: Visualizing the Principal Components

We create a scatter plot to visualize the directions of the principal components. The points of the original data set are plotted as green circles with transparency along with the two principal components as vectors (arrows), for this we use the components variable obtained from the PCA function.

# Plotting principal components
origin = np.array([[0, 0],[0, 0]]) # origin point
components_scaled = components * np.array([[1.0], [0.5]]) # scaled principal components
f, axs = plt.subplots(1,1, figsize=(7,7), facecolor='#F5F5F5')
axs.set_facecolor('#F5F5F5')
axs.scatter(X[:, 0], X[:, 1],s=50, zorder = -1, color="mediumseagreen", alpha = 0.3) 
axs.quiver(*origin, components_scaled[:,0], components_scaled[:,1], color=['mediumblue','firebrick'], scale=3, headwidth = 4, width = 0.011)
axs.set_title("Principal Components Visualization", fontsize="16")
plt.show()

By following these steps, the code generates a bivariate dataset, applies PCA, and visualizes the original dataset, principal components, and the reduced dataset.

Feature Extraction

January 20, 2023Data Preprocessing, Dimensionality Reduction No comments

Feature extraction techniques explore the relationships and dependencies between features to create a new dataset of transformed features. Unlike feature selection methods that only select the best features from the existing ones, feature extraction models go further by generating new features that capture the essential information of the original data. This transformed dataset not only has a lower dimensionality but also retains the interesting and intrinsic characteristics of the original data.

The goal of feature extraction is to reduce the complexity of the model by transforming the original features into a new representation that captures the most relevant information. This can lead to improved efficiency, reduced generalization error, and decreased overfitting, as the model can focus on the most informative aspects of the data.

There are different perspectives on what properties or content from the original dataset should be preserved in the transformed dataset, resulting in a wide range of feature extraction techniques. Some of the most popular feature extraction algorithms include Principal Component Analysis (PCA), Linear discriminant analysis (LDA) and Non-Negative Matrix Factorization (NMF).

Principal Component Analysis

Principal Component Analysis (PCA) is a widely used feature extraction method that aims to transform the original features into a new set of uncorrelated features called principal components. The key idea behind PCA is to capture the maximum amount of variance in the data with a reduced number of features.

PCA achieves this by finding a set of orthogonal axes, called principal components, that represent the directions of maximum variance in the data. The first principal component captures the most significant variation, followed by the second principal component, and so on.

In the above image the two principal components of a dataset are shown, easily we can identify that the amount of information in the direction of blue axis is greater than the red axis (these axes are called eigenvectors or principal components). To reduce the dimensionality of the dataset, we simply have to make a projection on the first principal components (the exact number of components is chosen by the user).

By selecting a subset of the most important principal components, PCA effectively reduces the dimensionality of the data while preserving the most significant information. PCA is particularly useful when dealing with highly correlated features or when the data is characterized by a large number of dimensions. It helps in visualizing and understanding the underlying structure of the data and can be used as a preprocessing step for various machine learning tasks.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a feature extraction technique commonly used in classification tasks. Unlike PCA, which focuses on maximizing variance, LDA aims to find a linear combination of features that maximizes the separability between different classes. This is achieved by mapping the data into a lower-dimensional space while maximizing the between-class distances and minimizing the within-class variances. It finds a set of discriminant functions that maximize the ratio of between-class scatter to within-class scatter.

By projecting the data onto the derived discriminant axes, LDA creates a new feature space where the classes are well-separated, making it easier for classification algorithms to discriminate between different classes. LDA is particularly effective when there is a clear class separation in the data and can improve classification performance by reducing the dimensionality of the feature space.

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) is a feature extraction method that decomposes the original data matrix into two low-rank matrices: one representing the basis or feature vectors and the other representing the coefficients. NMF assumes that the original data can be represented as a linear combination of non-negative basis vectors.

NMF is particularly useful for data with non-negative values, such as text data or image data. It can be applied to extract meaningful components or topics from text documents or to represent images in terms of interpretable parts.

NMF iteratively updates the basis vectors and coefficients until it converges to a representation that captures the most important features of the data. The resulting low-dimensional representation can be used for various tasks, including clustering, visualization, or as input for subsequent machine learning algorithms.

Feature Selection

January 12, 2023Data Preprocessing, Dimensionality Reduction No comments

Feature selection, also known as variable or attribute selection, is the process of selecting a subset with the most relevant or useful features for use in a model construction. Instead of creating new features, feature selection aims to identify the most discriminative features from the original dataset and discard the irrelevant or redundant ones.

This algorithm can be seen as the combination of a search technique for proposing new feature subsets along with an evaluation measure, which scores the different feature subsets. Feature selection techniques can be based on various criteria, such as statistical measures, like correlation analysis or mutual information scores; or machine learning algorithms. By reducing the feature space, feature selection simplifies the data representation, enhances interpretability, and potentially improves the performance of machine learning models by removing noise or irrelevant information.

According to their relationship with learning methods, feature selection techniques can be classified into three different classes: embedded, filter and wrapper methods.

Filter Methods

Filter models are based on the association between each feature and the variable to predict (class variable). These methods employ ranks based on statistical measures, and are independent of the machine learning model because operate on the features themselves, without considering the relationship between them and the specific model's internal mechanism. This means that filter methods can be applied with any algorithm or modeling technique.

One of the advantages of filter methods is their efficiency in terms of time consumption. They are computationally inexpensive and can handle large-scale datasets efficiently, making them an excellent choice for the data preprocessing step. They are faster compared to wrapper methods or embedded methods, which involve repeatedly training and evaluating the model.

Filter methods typically assign a score to each feature, reflecting its relevance or importance. These scores are calculated using statistical measures such as Pearson correlation coefficient, Chi-squared test and mutual information score.

Pearson correlation coefficient: It represent the linear relationship between each feature and the class variable. The the resulting values are in the range [-1,1] indicating the strength and direction of the correlation, so variables with high absolute correlation coefficients (close to 1 or -1) are considered more relevant.

Mutual Information: It measures the amount of information shared between two variables. In the context of feature selection, it quantifies the amount of information that a feature provides about the class variable. Features with high mutual information scores are considered more informative and are likely to have a strong relationship with the class variable.

Chi-Square Test: This method is specifically used for categorical variables. It measures the dependence between each categorical feature and the class variable using the chi-square statistic. Higher chi-square values indicate a stronger association between the feature and the class variable.

Wrapper Methods

In simple words, Wrapper methods can be seen as "brute force" searching methods, because they select the subsets of features by evaluating the performance of a machine learning model, they tries different subset of features to train an specific machine learning model, and the subset with the best generalization performance is selected. Unlike filter methods that are independent of the machine learning model, wrapper methods select features based on the specific model's performance.

Wrapper methods have several advantages, like handling complex interactions between features and selecting features that are important for a specific model, rather than for the dataset as a whole. But, they can be computationally expensive and may overfit the training data if the number of features is too large.

The subsets are usually selected from a large feature pool using search algorithms such as forward selection, backward elimination, exhaustive search, recursive feature elimination, etc. These algorithms explore different combinations of features and evaluate their impact on the model's performance.

Forward Selection: This algorithm starts with an empty set of features and iteratively adds one feature at a time. At each step, it evaluates the performance of the model using a specific criterion (e.g. accuracy) and selects the feature that improves the performance the most.

Backward Elimination: It begins with a set of all available features and removes one feature at a time in each iteration. The algorithm evaluates the model's performance after removing each feature and eliminates the one that has the least impact on the model's performance.

Exhaustive Search: It evaluates all possible feature subsets, systematically tries every combination of features and measures the model's performance with each subset. This method guarantees finding the optimal subset of features in terms of performance but is computationally expensive, especially for large feature spaces.

Embedded Methods

Embedded models combines the advantages of both Filter and Wrapper methods. In these, the features are selected during the training process of a specific machine learning model, and the feature selection ends automatically when the training process concludes.

Embedded methods may initially appear similar to wrapper methods, because they both select features based on the learning procedure of a machine learning model. However, the biggest difference between both lies in how they perform the feature selection in relation to the training process. Wrapper methods select features iteratively based on the evaluation metric, while embedded methods perform feature selection and train the machine learning algorithm in parallel.

Embedded methods typically involve adding a penalty term to the loss function during model training, which encourages the model to select only the most important features. This also means that embedded methos are more continuous and thus, don’t suffer much from high variability compared with filter or wrapped models. Some examples of algorithms commonly used as embedded methods are:

LASSO: LASSO means Least Absolute Shrinkage and Selection Operator, it's a linear regression method that adds a regularization term to the loss function. It applies a L1 penalty, resulting in some coefficients being exactly zero. If a coefficient is zero, the corresponding feature is discarded, while the features corresponding to non-zero coefficients are selected as important features. Lasso is widely used for feature selection and can handle both continuous and categorical variables.

Ridge Regression: It is a linear regression method equal to LASSO, but ridge regression applies a L2 penalty regularization instead of L1, which shrinks the coefficients towards zero without eliminating them entirely. Ridge regression helps control multicollinearity and reduces the impact of less important features, promoting more stable and reliable feature selection, but is not preferable when the data contains a huge number of features of which only a few are important.

Elastic Net: It combines the L1 (LASSO) and L2 (Ridge Regression) regularization penalties to overcome their individual limitations. It aims to get a balance between feature selection and coefficient shrinkage, which results particularly useful when dealing with highly correlated features.

K-Means

Random Forest for Regression

Support Vector Machines for Regression