Feature Selection ~ Data Science For Lifelong Learning

Feature selection, also known as variable or attribute selection, is the process of selecting a subset with the most relevant or useful features for use in a model construction. Instead of creating new features, feature selection aims to identify the most discriminative features from the original dataset and discard the irrelevant or redundant ones.

This algorithm can be seen as the combination of a search technique for proposing new feature subsets along with an evaluation measure, which scores the different feature subsets. Feature selection techniques can be based on various criteria, such as statistical measures, like correlation analysis or mutual information scores; or machine learning algorithms. By reducing the feature space, feature selection simplifies the data representation, enhances interpretability, and potentially improves the performance of machine learning models by removing noise or irrelevant information.

According to their relationship with learning methods, feature selection techniques can be classified into three different classes: embedded, filter and wrapper methods.

Filter Methods

Filter models are based on the association between each feature and the variable to predict (class variable). These methods employ ranks based on statistical measures, and are independent of the machine learning model because operate on the features themselves, without considering the relationship between them and the specific model's internal mechanism. This means that filter methods can be applied with any algorithm or modeling technique.

One of the advantages of filter methods is their efficiency in terms of time consumption. They are computationally inexpensive and can handle large-scale datasets efficiently, making them an excellent choice for the data preprocessing step. They are faster compared to wrapper methods or embedded methods, which involve repeatedly training and evaluating the model.

Filter methods typically assign a score to each feature, reflecting its relevance or importance. These scores are calculated using statistical measures such as Pearson correlation coefficient, Chi-squared test and mutual information score.

Pearson correlation coefficient: It represent the linear relationship between each feature and the class variable. The the resulting values are in the range [-1,1] indicating the strength and direction of the correlation, so variables with high absolute correlation coefficients (close to 1 or -1) are considered more relevant.

Mutual Information: It measures the amount of information shared between two variables. In the context of feature selection, it quantifies the amount of information that a feature provides about the class variable. Features with high mutual information scores are considered more informative and are likely to have a strong relationship with the class variable.

Chi-Square Test: This method is specifically used for categorical variables. It measures the dependence between each categorical feature and the class variable using the chi-square statistic. Higher chi-square values indicate a stronger association between the feature and the class variable.

Wrapper Methods

In simple words, Wrapper methods can be seen as "brute force" searching methods, because they select the subsets of features by evaluating the performance of a machine learning model, they tries different subset of features to train an specific machine learning model, and the subset with the best generalization performance is selected. Unlike filter methods that are independent of the machine learning model, wrapper methods select features based on the specific model's performance.

Wrapper methods have several advantages, like handling complex interactions between features and selecting features that are important for a specific model, rather than for the dataset as a whole. But, they can be computationally expensive and may overfit the training data if the number of features is too large.

The subsets are usually selected from a large feature pool using search algorithms such as forward selection, backward elimination, exhaustive search, recursive feature elimination, etc. These algorithms explore different combinations of features and evaluate their impact on the model's performance.

Forward Selection: This algorithm starts with an empty set of features and iteratively adds one feature at a time. At each step, it evaluates the performance of the model using a specific criterion (e.g. accuracy) and selects the feature that improves the performance the most.

Backward Elimination: It begins with a set of all available features and removes one feature at a time in each iteration. The algorithm evaluates the model's performance after removing each feature and eliminates the one that has the least impact on the model's performance.

Exhaustive Search: It evaluates all possible feature subsets, systematically tries every combination of features and measures the model's performance with each subset. This method guarantees finding the optimal subset of features in terms of performance but is computationally expensive, especially for large feature spaces.

Embedded Methods

Embedded models combines the advantages of both Filter and Wrapper methods. In these, the features are selected during the training process of a specific machine learning model, and the feature selection ends automatically when the training process concludes.

Embedded methods may initially appear similar to wrapper methods, because they both select features based on the learning procedure of a machine learning model. However, the biggest difference between both lies in how they perform the feature selection in relation to the training process. Wrapper methods select features iteratively based on the evaluation metric, while embedded methods perform feature selection and train the machine learning algorithm in parallel.

Embedded methods typically involve adding a penalty term to the loss function during model training, which encourages the model to select only the most important features. This also means that embedded methos are more continuous and thus, don’t suffer much from high variability compared with filter or wrapped models. Some examples of algorithms commonly used as embedded methods are:

LASSO: LASSO means Least Absolute Shrinkage and Selection Operator, it's a linear regression method that adds a regularization term to the loss function. It applies a L1 penalty, resulting in some coefficients being exactly zero. If a coefficient is zero, the corresponding feature is discarded, while the features corresponding to non-zero coefficients are selected as important features. Lasso is widely used for feature selection and can handle both continuous and categorical variables.

Ridge Regression: It is a linear regression method equal to LASSO, but ridge regression applies a L2 penalty regularization instead of L1, which shrinks the coefficients towards zero without eliminating them entirely. Ridge regression helps control multicollinearity and reduces the impact of less important features, promoting more stable and reliable feature selection, but is not preferable when the data contains a huge number of features of which only a few are important.

Elastic Net: It combines the L1 (LASSO) and L2 (Ridge Regression) regularization penalties to overcome their individual limitations. It aims to get a balance between feature selection and coefficient shrinkage, which results particularly useful when dealing with highly correlated features.

Thursday, January 12, 2023