• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Wednesday, June 28, 2023

Random Forest Regression

Random Forest is a powerful machine learning algorithm that has gained significant popularity in both academia and industry due to its excellent performance, robustness, and easy use. It's an ensemble learning method, which means that it combines the predictions of several base estimators with a given learning algorithm to improve generalizability and robustness. In the case of Random Forest, the base estimators are decision trees.

The concept behind Random Forest is simple and effective. It operates by building a multitude of decision trees at training time, and taking as output the mean prediction (for regression) or majority class (for classification) of the individual trees. The model creates is a set of decision trees, usually trained with the "bagging" method, the main idea is to inject randomness into the tree building to ensure each tree is different. Consequently, although some trees may be wrong, many others will be right, so as a group, the trees are able to move in the correct direction.

The strength of Random Forest lies in its ability to overcome the overfitting problem, which is common in decision trees. While a single decision tree can create overly complex models that don't generalize well for new data, Random Forest mitigates this by employing multiple decision trees to build a more robust and generalizable model. It maintains a high level of accuracy and provides a measure of "interpretability" via feature importance estimation, which is not available in many other machine learning algorithms.



Learning Method

Random Forest is a perfect example of ensemble learning, a powerful concept in machine learning where multiple models, known as base learners, are trained to solve the same problem and combined to get better results. The main principle behind ensemble learning is that a group of weak learners can come together to form a strong learner, the models in the ensemble may not be strong predictors individually, but together they can provide a more accurate and stable prediction.

In the case of Random Forest, the base learners are decision trees. The algorithm introduces randomness into the model building process, which results in an ensemble of different models. Two key concepts in ensemble learning applied to Random Forest are Bagging and Feature Randomness.

Bagging: Also known as bootstrap aggregating, it's used to create different subsets of the original data by sampling with replacement. For each subset, a decision tree is built. This means that each tree in the forest is trained on a different set of data. This process decorrelates the trees, which means the errors of individual trees become less correlated with each other. When the predictions of these trees are averaged (in the case of regression), the variance of the final prediction is reduced.

Feature Randomness: In addition to bagging, Random Forest also uses a method called feature randomness. In traditional decision trees, when it is time to split a node, we consider every possible feature and choose the one that produces the most significant improvement in the objective function. Random Forest changes this by selecting a random subset of the features at each split point, which adds an extra layer of randomness to the model building process. This further decorrelates the trees and ultimately results in a more robust model.

Another type of ensemble learning is the Boosting method, which is used by some of the most popular ensemble algorithms such as AdaBoost, Gradient Boosting, XGBoost and LightGBM

Boosting: It consists of a sequential process, where each model is trained to correct the mistakes made by the previous model. In the context of regression, these mistakes are the residuals or differences between the predicted and actual values. The final prediction is made by a weighted sum of the individual predictions of all models, where models that perform better have more weight. Boosting is effective at reducing both bias and variance, making it a powerful method for improving the accuracy of regression models.


Random Forest Hyperparameters

When building a Random Forest model for regression, there are several parameters that can be tuned to optimize the performance of the model. Two of the most important hyperparameters are Number of Trees and Max Depth.

Number of Trees: It  determines the number of estimators in the forest. Each tree is built independently and contributes equally to the final prediction, which is the mean of the predictions of all trees. Increasing the number of trees can make the model more robust and reduce variance, but beyond a certain point, the benefits in prediction performance may be outweighed by the computational cost. There is no definitive rule for choosing the optimal number of trees, as it depends on the specific problem and the trade-off between computational efficiency and model performance.

Max Depth: It determines the maximum depth of each tree. The depth of a tree is the maximum distance between the root node and any leaf node. A tree of depth k can model interactions involving up to k features. If the max depth is too high, the model may overfit the training data and not work well with unseen data. If it's too low, the model may be too simple to capture complex patterns in the data, leading to underfitting.

A visual example of the hyperparameters tuning process is shown in the next video:

Here I built a Random Forest Regression varying the parameters Max Depth and Number of Tree, showing the regression result, as well as the individual Decision Trees built for each case. We can appreciate how the shape of the Random Forest Regression changes. When the model is underfitting, the regression result may appear overly simplified and unable to capture the complexity of the data. As we increase the max depth and number of trees, the model becomes more complex and better able to fit the data. However, if we go too far, the model may start to overfit, capturing noise and outliers in the data.


Python Implementation


Importing Libraries

The first step is importing the necessary libraries. For data manipulation we use pandas and numpy, for data visualization we use seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


Data Loading

The next step is to load a dataset and split it into a training set and a test set. We're using the load_diabetes dataset from sklearn, which is a commonly used dataset for regression tasks. It consists of 442 patients with ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline. It's important to notice that when we loaded the dataset, the features have been mean centered and scaled by the standard deviation by default.

# Loading the dataset
diabetes = load_diabetes(as_frame = True)
data = diabetes['frame']

# Displaying the first few rows of the dataset
data.head()

We're displaying the first few rows of the dataset using the head method. This gives us a quick overview of the data we're working with:


We separate the features (X) and the target variable (y) from the dataset. The features are all the columns except the last one, which is the target variable. We also split these into a training set and a test set using the train_test_split function, taking 80% of the data for training and 20% for testing.

# Separate the features and the target
X = data.iloc[:,0:10].to_numpy()
y = data["target"].to_numpy()

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)


Random Forest Training and Evaluation

Now that we've loaded and split the diabetes dataset, we can move on to apply the model. We create a Random Forest Regression model with 100 estimators (trees), train it on the training set and make predictions on the testing set.

# Creating the Random Forest Regression instance
RF = RandomForestRegressor(n_estimators=100, random_state=0)
# Training the model on the training data
RF.fit(X_train, y_train)
# Making predictions on the test set
y_pred = RF.predict(X_test)

Then we calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model's performance.

# Computing the Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test, y_pred)
# Computing the Root Mean Squared Error (RMSE)
RMSE = mean_squared_error(y_test, y_pred)**0.5

To better understand the performance of the model, we can visualize the error metrics using a bar chart. This gives us a clear, visual comparison of the two metrics.

# Creating the figure
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
plt.rcParams['font.size'] = '14'
ax.set_facecolor('#F5F5F5')

# Creating the bar for MAE
bar1 = ax.bar(0.25, MAE, color ='firebrick', width = 0.23)[0]
# Adding the value of the MAE on top of the bar
yval1 = bar1.get_height()
ax.text(bar1.get_x() + bar1.get_width()/2, yval1, round(yval1, 2), ha='center', va='bottom', fontsize = 16)

# Creating the bar for RMSE
bar2 = ax.bar(0.75, RMSE, color ='seagreen', width = 0.23)[0]
# Adding the value of the RMSE on top of the bar
yval2 = bar2.get_height()
ax.text(bar2.get_x() + bar2.get_width()/2, yval2, round(yval2, 2), ha='center', va='bottom', fontsize = 16)

# Adding ticks and title
ax.set_xticks([0.25,0.75], ['MAE', 'RMSE'], fontsize = 16)
ax.set_title('Metrics for Random Forest', fontsize = 18)
ax.set_ylim([0, 70])
ax.set_xlim([0, 1])

# Display the plot
plt.show()

The results are the following:


Conclusions

Through this exploration of the Random Forest algorithm for regression, we've gained valuable insights into this powerful machine learning technique. We've learned how it leverages the concept of ensemble learning, specifically bagging, to build robust and accurate models. We've also understood the importance of key parameters like the number of trees and maximum depth, and how they influence the model's performance. Through our practical implementation in Python, we've seen how to apply these concepts in practice. We've learned how to train a Random Forest model, make predictions, and evaluate the model's performance using metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

By reading this post, you've not only gained a deeper understanding of Random Forest for regression, but also acquired practical skills that you can apply to your own machine learning projects. Remember, the key to successful machine learning is understanding the data, the algorithm, and the problem at hand.


Share:

Saturday, June 10, 2023

Mathematical Formulation SVR

Support Vector Machines (SVMs), first introduced by Vapnik in 1995, represent one of the most impactful advancements in the realm of machine learning. Since their inception, SVMs have proven highly versatile, with successful implementations across a myriad of domains including text categorization, image analysis, speech recognition, time series forecasting, and even information security.

The core principle of SVMs is rooted in the concepts of statistical learning theory (SLT) and structural risk minimization (SRM). This foundation diverges from other methods that are grounded in empirical risk minimization (ERM). While these methods are focused on minimizing the error on the training dataset, SVMs, on the other hand, aims to minimize an upper bound of the generalization error. This bound is a composite of both the training error and a measure of confidence, resulting in more robust and reliable models.

Initially, SVMs were primarily developed to handle classification problems. However, the power and potential of SVMs extended beyond mere classification. This led to the introduction of Support Vector Machines for Regression (SVR), a significant variation on the model originally proposed by Vapnik. SVRs holds true to the core principles of SVMs while adopting a new approach to apply them to regression problems. 


Mathematical Formulation

Suppose we have data in the form $x = [x_1, x_2,  \ldots , x_n]$ and $y = [y_1, y_2,  \ldots , y_n]$ where $x_i \in \mathbb{R}^d$ are the input variables (features) and $y_i \in \mathbb{R}$ the output variables. The regression function of SVR is defined as:

\begin{equation} f(x) = w  \cdot x + b\end{equation}

Where $w$ is a weight vector and $b$ is a bias term, notice that this equation is the hyperplane equation in the formulation of SVMs. A graphic example of the regression function (in two dimensions) is: 


We can map the input vector $x$ to a high dimensional space with a function $\phi(x)$, introducing this function in Eq. (1) we obtain: 

\begin{equation} f(x) = w  \cdot \phi(x) + b\end{equation}

This form of the regression function allows us to generate non-linear regressions, like:



As mentioned earlier, SVR minimizes bounds of the generalization error. This error term is handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, $\varepsilon$. We can tune $\varepsilon$ to gain the desired accuracy of our model. For this case, the objective function and constraints are:

\begin{equation} \begin{aligned} \min_{w} \quad & \frac{1}{2}\|w\|^2 \\ \textrm{s.t.} \quad & |y_i - (w \cdot x_i +b)| \leq \varepsilon \\ \end{aligned}\end{equation}

The first part is read as: Minimize the objective function $\frac{1}{2}\|w\|^2$ based on the variable $w$ subject to the constrain $|y_i - (w \cdot x_i +b) | \leq \varepsilon$

An illustrative example of this is:


Depends the value of $\varepsilon$, this algorithm may doesn’t work for all data points. The algorithm solved the objective function as best as possible but some of the points can fall outside the margins. We need to account for the possibility of errors that are larger than $\varepsilon$. To handle this, we introduce the slack variables.

The slack variable $\xi$ is defined as: for any value (point) that falls outside of $\varepsilon$, we can denote its deviation from the margin as $\xi$



We aim to  to minimize these deviations $\xi$ as much as possible. Thus, we can add these deviations to the objective function, along with an additional hyperparameter, the penalty factor, $C$.

\begin{equation} \begin{aligned} \min_{w,\xi} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{|\xi_{i}|} \\ \\ \textrm{s.t.} \quad & |y_i - w \cdot x_i - b| \leq \varepsilon + |\xi_{i}|\\ \end{aligned}\end{equation}

We can expand the terms that include $|\xi_{i}|$, considering $\xi_{i}$ as the upper deviations and $\xi_{i}^*$ as the lower deviations. Expanding these terms and including the function $\phi(x)$ previously seen we obtain:

\begin{equation} \begin{aligned} \min_{w,\xi^{(*)}} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} \\ \\ \textrm{s.t.} \quad & y_i - w  \cdot \phi(x_i)  - b \leq \varepsilon + \xi_{i}^*\\ & w  \cdot \phi(x_i) + b - y_i \leq \varepsilon + \xi_{i}\\ & \xi^{(*)} \geq 0 \\ \end{aligned}\end{equation}

Where $\xi^{(*)} = \left(  \xi_1, \xi_{1}^{*},  \ldots ,  \xi_n, \xi_{n}^{*} \right) $ is the slack variable. As $C$ increases, the tolerance for points outside of $\varepsilon$ also increases. To derive dual problem from this optimization (minimization) problems, the Lagrange function is introduced:

\begin{equation} \begin{aligned} L = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} - \sum_{i=1}^{n}{\left ( \eta_{i}\xi_{i} + \eta_{i}^*\xi_{i}^* \right )} \\  & - \sum_{i=1}^{n}{\alpha_{i} \left ( \varepsilon + \xi_{i} + y_{i} - w  \cdot \phi(x_i)  - b \right )} \\ & - \sum_{i=1}^{n}{\alpha_{i}^{*} \left ( \varepsilon + \xi_{i}^{*} - y_{i} + w  \cdot \phi(x_i) + b \right )} \\  \end{aligned}\end{equation}

Where $L$ is the Lagrangian, $\alpha^{(*)} = \left(  \alpha_1, \alpha_{1}^{*},  \ldots ,  \alpha_n, \alpha_{n}^{*} \right) $ and $\eta^{(*)} = \left(  \eta_1, \eta_{1}^{*},  \ldots ,  \eta_n, \eta_{n}^{*} \right) $ are the Lagrange multipliers. The optimal conditions occur when:

\begin{equation} \begin{aligned} & \frac{\partial{L}}{\partial{w}} = 0 \; \Rightarrow \; w =  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right ) \phi(x_i)  } \\ \\ & \frac{\partial{L}}{\partial{b}} = 0 \;  \Rightarrow \;  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right )} = 0 \\  \\ & \frac{\partial{L}}{\partial{\xi_i}} = 0 \; \Rightarrow \;  C - \alpha_{i} - \eta_{i} = 0 \\ \\ & \frac{\partial{L}}{\partial{\xi_{i}^{*}}} = 0 \;  \Rightarrow \;  C - \alpha_{i}^{*} - \eta_{i}^{*} = 0 \\ \end{aligned}\end{equation}

Combining Eq. (6) and (7) we get the dual optimization problem: 

\begin{equation} \begin{aligned} \min_{\alpha^{(*)}} \quad & \frac{1}{2} \sum_{i, j =1}^{n}{( \alpha_{i}^* - \alpha_{i}) ( \alpha_{j}^* - \alpha_{j} ) \phi(x_i) \phi(x_j)} \\ & + \varepsilon \sum_{i=1}^{n}{( \alpha_{i}^* + \alpha_{i} )} - \sum_{i=1}^{n}{y_i(\alpha_{i}^* - \alpha_{i})}\\ \\ \textrm{s.t.} \quad & \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i})} = 0 \\ & 0 \leq \alpha^{(*)} \leq C \\ \end{aligned}\end{equation}

Solving this optimization problem, in other words, obtaining the values of the Lagrange Multipliers $\alpha^{(*)}$ that minimize the objective function, the regression function of SVR can be described as:

\begin{equation} \begin{aligned} f(x) & = \; w  \cdot \phi(x) + b \\  & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) [\phi(x_i)  \cdot \phi(x) ]} + b \\ & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) K(x_i, x)} + b \\ \end{aligned}\end{equation}

Where $K(x_i, x)$ is known as Kernel Function.


Importance of hyperparameters tuning

We have seen the mathematical meaning of the hyperparameters $C$ and $\varepsilon$, and how they are introduced in the mathematical formulation of SVR as factors for model tuning. These hyperparameters controls the accuracy and the generalizability of the model. For a better understanding, let's look at a visual example of how these parameters affect the model performance.

In this video, we can notice that for obtain a good regression, we need to find the optimal values for both $C$ and $\varepsilon$. We have shown that isn't enough variate one parameter, to get a robust solution we must optimize both. We also need to choose the right Kernel depending on the data distribution, in the video above we selected a Radial Basis Function (RBF) Kernel, which is a popular choice for many problems, as it can capture complex and non-linear relationships.


Share:

Sunday, May 28, 2023

Support Vector Machines for Regression

Support Vector Machines (SVM) is a powerful and versatile machine learning algorithm that is widely used in both Classification and Regression tasks. Originally developed for binary classification, it operates on the principle of finding the hyperplane that best separates the data into two classes. The "best" hyperplane is defined as the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points from each class. This approach helps to ensure that the model generalizes well to unseen data, reducing the risk of overfitting.

Over time, SVM has been extended to handle multi-class classification and regression problems. In these cases, the algorithm seeks to find the hyperplane (or set of hyperplanes) that best separates the data into multiple classes or best fits the data points in the case of regression. This versatility has led to SVM being applied across a wide range of domains, from image recognition to natural language processing, from stock market prediction to bioinformatics.

While SVM is traditionally associated with classification tasks, it can be adapted for regression tasks in a framework known as Support Vector Regression (SVR). In this post, I focus on the use of SVM for regression, exploring how it can be used to model complex, non-linear relationships in data.

SVR offers several advantages. It can handle high-dimensional data, making it suitable for problems where the number of features is large. It also offers flexibility in modeling different types of relationships through the use of different kernel functions. This means that it can capture both linear and non-linear relationships between the input features and the target variable. 

However, SVR also has its limitations. It can be computationally intensive for large datasets, which can limit its scalability. It also requires careful selection of parameters, such as the regularization factor C and the kernel parameters. Incorrect parameter selection can lead to poor model performance.


Regression with SVM

In the context of regression, SVM is adapted to find a function that has at most $\varepsilon$ deviation from the actually obtained targets for all the training data, and at the same time is as flat as possible. This is known as $\varepsilon$-insensitive loss, which essentially means that errors within a certain margin ($\varepsilon$) are ignored. This approach makes SVM robust to outliers and allows it to find a function that generalizes well to unseen data.

A key concept in SVM, both for classification and regression, is the use of kernel functions. Kernel functions are used to transform the input data into a higher-dimensional space where it becomes easier to find a hyperplane that separates the data. This is particularly useful when dealing with non-linear relationships in the data, as it allows SVM to capture these relationships without explicitly computing the coordinates of the data in the high-dimensional space.

There are several types of kernel functions commonly used in SVM, including linear, polynomial, and Radial Basis Function (RBF). The linear kernel is the simplest and is used when the data is linearly separable. The polynomial kernel allows SVM to capture polynomial relationships in the data. The RBF kernel is a popular choice for many problems, as it can capture complex, non-linear relationships.

The regularization parameter C is a crucial component of SVM. It determines the trade-off between allowing the model to increase its complexity (and potentially overfit the data) and keeping it simple (and potentially underfitting the data). A high value of C encourages the model to fit the training data as closely as possible, while a low value encourages a simpler model. Refer to this link for a complete mathematical formulation of SVR.


Python Implementation


Importing Libraries

The first step in implementing Support Vector Regression (SVR) in Python is importing the necessary libraries. For data manipulation and numerical operations we used pandas and numpy respectively, for data visualization we used seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error


Data Loading and Preprocessing

In this section, we load the diamonds dataset from the seaborn library and perform the preprocessing steps. This dataset contains information about a large number of diamonds, including their cut, color, clarity, and price.

First, we load the dataset and show a sample of it.

# Loading the diamonds dataset
diamonds = sns.load_dataset("diamonds")
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

This sample is shown below:


The diamonds dataset contains both numerical and categorical variables. The categorical variables are: cut, color and clarity, these must be encoded into numerical values before we can use them to train the SVR model. We identify the category values of these features and list them in order of importance.

# Identifying the categories of each feature
# We reversed the categories to maintain the order of importance
categories_cut = diamonds['cut'].cat.categories[::-1].tolist()
categories_color = diamonds['color'].cat.categories[::-1].tolist()
categories_clarity = diamonds['clarity'].cat.categories[::-1].tolist()

# Array containing all the categories
categories = [categories_cut, categories_color, categories_clarity]

Then, we create an instance of the OrdinalEncoder class, passing the categories to maintain their order of importance. We fit and transform the labels in the 'cut', 'color', and 'clarity' columns.

# Creating an instance of the Ordinal Encoder
# We pass the categories to maintain the order of importance
ordinal_encoder = OrdinalEncoder(categories=categories)

# Creating a copy of the dataset
diamonds_encoded= diamonds.copy()
# Fitting and transforming the labels in the 'cut', 'color', and 'clarity' columns
diamonds_encoded[['cut', 'color', 'clarity']] = ordinal_encoder.fit_transform(diamonds_encoded[['cut', 'color', 'clarity']])

# Displaying a sample of the encoded dataset
diamonds_encoded.sample(5, random_state = 6)

The sample of the encoded dataset is shown below:


As we can see, the features 'cut', 'color' and 'clarity' now only have numeric values instead of alphanumeric ones.


SVR Training and Evaluation

First, we take the target variable (price) from the rest of the dataset. We then split the data into training and testing sets, using 70% of the data for training and 30% for testing.

y = diamonds_encoded['price'].to_numpy()
X = diamonds_encoded.drop(['price'], axis=1).to_numpy()

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We define three SVR models with different regularization parameters (C = 5.0, 10.0, 15.0) to see how this parameter affects the model's performance, all of these models were defined with a kernel RBF. We use a pipeline to first scale the data using StandardScaler and then apply the SVR model.

# Defining models
pipeline1 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 5.0))])
pipeline2 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 10.0))])
pipeline3 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 15.0))])
# Grouping models
models = [pipeline1, pipeline2, pipeline3]

Finally, we fit these models on the training data and make predictions on the testing data. We evaluate the performance of the models using two metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). RMSE gives the standard deviation of the residuals, while MAE gives the average magnitude of the errors in a set of predictions.

# Applying models and computing regression scores
RMSE = []
MAE = []
for pipeline in models:
  pipeline.fit(X_train,y_train)
  y_predict = pipeline.predict(X_test)
  # Computing the Root Mean Squared Error (RMSE)
  rmse = mean_squared_error(y_test, y_predict)**0.5
  RMSE.append(rmse)
  # Computing the Mean Absolute Error (MAE)
  mae = mean_absolute_error(y_test, y_predict)
  MAE.append(mae)


Model Performance Visualization

In this section, we visualize the performance of the models using bar plots. This allows us to compare the RMSE and MAE for each model.

We create a pandas DataFrame to store the previous calculated metrics (RMSE and MAE).

# Creating a DataFrame to store the calculated metrics
df = pd.DataFrame(np.array([RMSE, MAE]).T, columns=['Root of Mean Squared Error', 'Mean Absolute Error'])

Then, we define a function to add the labels (values of the metrics) above each bar in the bar plot.

def addlabels(bars, ax):
  # Adding the value above each bar
  for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize = 12

Finally, we create a bar chart. We use different colors for each model to distinguish them. We also add a legend to indicate which color corresponds to which model.

plt.rcParams['font.size'] = '11'
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')

# Bar width
barWidth = 0.28
# Position of the bars
br1 = np.arange(2)
br2 = [x + barWidth for x in br1]
br3 = [x + barWidth for x in br2]

# Creating the bars for each model
bars = ax.bar(br1, df.loc[0], color ='firebrick', width = barWidth, label ='C = 5.0')
addlabels(bars, ax)

bars2 = ax.bar(br2, df.loc[1], color ='seagreen', width = barWidth, label ='C = 10.0')
addlabels(bars2, ax)

bars3 = ax.bar(br3, df.loc[2], color ='mediumblue', width = barWidth, label ='C = 15.0')
addlabels(bars3, ax)

# Adding ticks and title
ax.set_xticks([r + barWidth for r in range(len(df.loc[0]))], ['RMSE', 'MAE'])
ax.set_title('Metrics for Support Vector Regression', fontsize = 15)
ax.set_ylim([0, 1950])
ax.legend()
plt.show()

The results are the following:

This visualization provides a clear comparison of the performance of our models. It shows how the choice of the regularization parameter C affects the RMSE and MAE. The model that obtained best results was the one with the parameter C = 15.0, because its RMSE and MAE were the lowest.


Conclusion

In this post, we've understood and implemented Support Vector Machines (SVM) for regression tasks. We've seen how SVM can be a powerful tool for regression, capable of handling high-dimensional spaces and offering the flexibility of different kernel functions. We've also learned the importance of parameter tuning in machine learning, specifically the role of the regularization parameter C in SVM. Through hands-on coding, we've seen how changes in this parameter can significantly affect the performance of our model.

By working with the diamonds dataset, we've gained practical experience in applying SVM for regression in Python, from data preprocessing to model training and evaluation. This approach has provided us with valuable insights into the workings of SVM and its application in real-world problems.

Share:

Saturday, May 13, 2023

Regression in Machine Learning

Regression is a supervised learning technique widely used in Machine Learning and Data Science. These kind of techniques aims to predict a continuous outcome variable (y) based on the values of one or more predictor variables and features (X), which can be continuous and/or discrete. Regression methods generally works by finding the mathematical relationship between the features and the outcome. When dealing with multiple predictors, the regression model is often represented as a multi-dimensional plane (also called hyperplane).

Unlike classification, which predicts a discrete category, regression quantifies the relationship between features, allowing us to predict a continuous outcome. Let's consider a practical example: If we want to predict the price of a house (a continuous variable), we might look at features such as size, location, number of rooms, construction materials, etc. A regression algorithm would find the best mathematical relationship between these features and the price of the house.


Uses of Regression

Regression models have incredibly versatile applications and find utility across a spectrum of fields. Let's delve into some areas where regression models are essentials:

  • Finance: Regression models serve as powerful tools in the field of finance, particularly in risk assessment and financial forecasting. For example, with them analysts can predict a company's future earnings based on various financial indicators or forecast future stock prices based on historical data.

  • Economics: Regression models are crucial in the field of economics. Economists employ them to forecast various economic indicators and measure the impact of changes in certain variables like employment rates or interest rates. For instance, economists could use regression to understand how a change in government policy might influence and affect the overall employment rate.

  • Healthcare: This is one of the fields where regression models shine and have valuable applications. They can predict patient outcomes based on various factors such as age, weight, genetic predispositions, lifestyle, etc. They can also be used to understand the impact of different treatment approaches on patient recovery or even disease progression. For example, regression models could be used to predict the progression of a disease like diabetes based on various patient-specific factors.
  • Marketing: In marketing, regression models are widely used for predictive tasks, such as predicting the success of marketing campaigns, forecasting sales, and understanding customers behavior. For instance, they can predict customer lifetime value based on purchase history, helping companies to optimize their customer retention strategies and approaches in future campaigns.

Regression Algorithms

There are various types of regression algorithms, each with its strengths, weaknesses, and suitability for different types of problems, some of them are:

  • Linear Regression: One of the simplest forms of regression. It assumes a linear relationship between the predictors and the outcome variable.
  • Polynomial Regression: This is an extension of linear regression, where the model is not restricted to a straight line and can fit polynomial curves to the data.
  • Ridge and Lasso Regression: These are regularization techniques that help prevent overfitting in regression models. They do this by adding a penalty term to the cost function, which reduces the magnitude of the model coefficients.
  • Decision Tree Regression: Decision trees can also be used for regression tasks, with the leaf nodes predicting continuous values.
  • Support Vector Regression: This is the regression version of the Support Vector Machines (SVM), a robust and widely used algorithm in both classification and regression tasks.
  • Random Forest and Gradient Boosting Regression: Ensemble techniques that combine predictions from multiple models to give a final prediction.
  • Neural Network Regression: Neural networks can model complex non-linear relationships. They are especially useful for high-dimensional and complex data, where traditional regression techniques might struggle.


Regression models are a powerful tool in the hands of data scientists and analysts. They allow us to understand and quantify relationships between variables, make predictions about future data points, and even extrapolate trends outside the range of our current data. Being able to estimate the potential impact of different variables on an outcome can provide valuable insights for decision-making.

In future posts, I will dive deeper into these regression techniques, exploring how they work and implementing them using Python.


Share:

Monday, May 1, 2023

Data Encoding

In the field of data science, the quality and format of the data we work with are just as important as the algorithms and models we use. One of the crucial aspects of data preprocessing is Data Encoding, a process that transforms categorical data into a format that can be understood by machine learning algorithms.

Categorical data refers to variables that contain label values rather than numeric values. They are often divided into nominal (variables without order or priority) and ordinal (variables with some level of priority) categories. While this data can be insightful and necessary, many machine learning algorithms can only handle numerical inputs. This is where Data Encoding comes in. It converts these categorical variables into numerical values, allowing us to include this valuable information in the models we use.


Types of Data Encoding

There are several common types of Data Encoding, each with its own strengths and weaknesses. The choice of which to use often depends on the problem, the specific characteristics of the data and the requirements of the machine learning algorithm.


Nominal / One-Hot Encoding

The One-Hot Encoding, also called nominal encoding, is a widely used encoding technique for handling categorical variables. In this method, each category of is converted into a new binary column (values 0 or 1). This results in a sparse matrix where each row represents an observation and each column represents a unique category of the feature. The value is 1 if the category is present for that observation and 0 if not.

One-Hot Encoding is particularly useful for nominal data where no ordinal relationship exists. It allows the representation of categorical data in a way that can be understood by machine learning algorithms. However, it can lead to a high memory consumption and computational cost when the cardinality of the categorical variable is high, because it creates a new column for each unique category.

Now, let's see an example of how One-Hot Encoding can be implemented in Python.

First, we import the necessary libraries and create a pandas dataframe with two categorical features: Color and Shape

# Importing libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating a DataFrame with example data
data = {'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Red', 'Green', 'Blue'],
        'Shape': ['Circular', 'Square', 'Square', 'Circular', 'Square', 'Circular', 'Square']}

df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)

The created dataframe is:

Then, we define an instance of the OneHotEncoder class from the sklearn library. We apply this encoder to the dataframe using fit_transform, this method works by first learning the unique values of each feature (the fit part), and then transforming these values into a binary one-hot encoding (the transform part).

# Creating an instance of the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
# Fitting and transforming the data
encoded_data = encoder.fit_transform(df)

Finally, we create and display a new dataframe with the encoded data.

# Creating a DataFrame with the encoded data
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(df.columns))
# Displaying the encoded DataFrame
print(df_encoded)

The encoded dataframe resulted as:



Label Encoding

Label Encoding is another popular technique for encoding categorical variables. In this method, each unique category value is assigned an integer value. The assignment of integers is arbitrary and is usually based on the order of the categories in the feature.

Label Encoding is simple and efficient, making it a good choice for ordinal data and for nominal data with only two categories (binary data). However, it can introduce a potential issue for nominal data with more than two categories. The algorithm might misinterpret the numerical values as having an ordinal relationship, which can lead to poor performance or unexpected results.

Let's examine an example of Label Encoding implementation in Python.

We import the libraries and load the Iris dataset using the load_dataset function from the seaborn library. This is a classic dataset in machine learning and statistics. It includes measurements for 150 iris flowers from three different species (setosa, versicolor and virginica). 

# Importing libraries
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Loading the iris dataset
iris = sns.load_dataset("iris")
# Displaying a sample of the dataset
iris.sample(5, random_state = 0)

A sample of the Iris dataset is shown below:

Then, we define an instance of the LabelEncoder class from the sklearn library. The fit_transform method is used to apply the encoder to the feature species.

# Creating an instance of the Label Encoder
label_encoder = LabelEncoder()
  
# Fitting and transforming the labels in column 'species'
iris_encoded = iris.copy()
iris_encoded['species']= label_encoder.fit_transform(iris_encoded['species'])

# Displaying a sample of the encoded dataset
iris_encoded.sample(5, random_state = 0)

The same sample of the Iris dataset, but with the feature species encoded, is:


Ordinal Encoding

Ordinal Encoding is a type of encoding that is similar to Label Encoding but with a key difference. In Ordinal Encoding, the assignment of integers to categories is not arbitrary. Instead, the integers are assigned in a way that respects the ordinal nature of the category.

For example, if a feature describes a rating as "low", "medium", or "high", ordinal encoding would assign the values 0, 1, and 2, respectively, preserving the order of the categories. This method is ideal for ordinal data, as it allows the model to understand the inherent order of the categories.

Let's explore an example of implementing Ordinal Encoding in Python.

We import the libraries and load the cut and price attributes from the Diamonds dataset using the load_dataset function from the seaborn library.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

# Loading the attributes "cut" and "price" from the diamonds dataset
diamonds = sns.load_dataset("diamonds")[["cut", "price"]]
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

Below is a sample of the dataset used:

Then, we create an instance of the OrdinalEncoder class from the sklearn library, by passing the categories of the cut feature as an argument. These categories were 'Fair', 'Good', 'Very Good', 'Premium' and 'Ideal', the ordinal encoder model assigned the numbers 0-4 to these categories in the order of their quality. The fit_transform method apply the encoder to the feature cut.

Below is the same sample of the dataset, but after applying the encoder:


Conclusion

In conclusion, data encoding is a crucial step in the data preprocessing pipeline, especially when dealing with categorical data. It allows us to transform non-numerical data into a format that can be understood and used by the machine learning algorithms.

In this post, we have explored several types of data encoding methods, including One-Hot Encoding, Label Encoding and Ordinal Encoding. Each of these methods has its own use cases and is appropriate in different circumstances. For example, One-Hot Encoding is useful when the categorical variables are nominal, while Label and Ordinal Encoding are beneficial when the categories have an inherent order.

Remember, the choice of encoding method can significantly impact the performance of the machine learning model. Therefore, it's essential to understand these methods and make an informed decision based on the nature of the data and the requirements of the model.


Share:

About Me

My photo
I am an Engineering Physicist, graduated with academic excellence as the top of my class. I have experience programming in several languages, including C++, MATLAB, and especially Python. I have worked on projects in image and signal processing, as well as in machine learning and data analysis.

Recent Post

Quantum Kernels

The quantum version of SVM was initially proposed by  Rebentrost et al. (2014) , but this formulation requires the input data to be provided...

Pages