• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Principal Components Analysis

    Making high dimensions comprehensible and actionable, it captures the maximum amount of variance in the data with a reduced number of features

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Wednesday, June 28, 2023

Random Forest Regression

Random Forest is a powerful machine learning algorithm that has gained significant popularity in both academia and industry due to its excellent performance, robustness, and easy use. It's an ensemble learning method, which means that it combines the predictions of several base estimators with a given learning algorithm to improve generalizability and robustness. In the case of Random Forest, the base estimators are decision trees.

The concept behind Random Forest is simple and effective. It operates by building a multitude of decision trees at training time, and taking as output the mean prediction (for regression) or majority class (for classification) of the individual trees. The model creates is a set of decision trees, usually trained with the "bagging" method, the main idea is to inject randomness into the tree building to ensure each tree is different. Consequently, although some trees may be wrong, many others will be right, so as a group, the trees are able to move in the correct direction.

The strength of Random Forest lies in its ability to overcome the overfitting problem, which is common in decision trees. While a single decision tree can create overly complex models that don't generalize well for new data, Random Forest mitigates this by employing multiple decision trees to build a more robust and generalizable model. It maintains a high level of accuracy and provides a measure of "interpretability" via feature importance estimation, which is not available in many other machine learning algorithms.



Learning Method

Random Forest is a perfect example of ensemble learning, a powerful concept in machine learning where multiple models, known as base learners, are trained to solve the same problem and combined to get better results. The main principle behind ensemble learning is that a group of weak learners can come together to form a strong learner, the models in the ensemble may not be strong predictors individually, but together they can provide a more accurate and stable prediction.

In the case of Random Forest, the base learners are decision trees. The algorithm introduces randomness into the model building process, which results in an ensemble of different models. Two key concepts in ensemble learning applied to Random Forest are Bagging and Feature Randomness.

Bagging: Also known as bootstrap aggregating, it's used to create different subsets of the original data by sampling with replacement. For each subset, a decision tree is built. This means that each tree in the forest is trained on a different set of data. This process decorrelates the trees, which means the errors of individual trees become less correlated with each other. When the predictions of these trees are averaged (in the case of regression), the variance of the final prediction is reduced.

Feature Randomness: In addition to bagging, Random Forest also uses a method called feature randomness. In traditional decision trees, when it is time to split a node, we consider every possible feature and choose the one that produces the most significant improvement in the objective function. Random Forest changes this by selecting a random subset of the features at each split point, which adds an extra layer of randomness to the model building process. This further decorrelates the trees and ultimately results in a more robust model.

Another type of ensemble learning is the Boosting method, which is used by some of the most popular ensemble algorithms such as AdaBoost, Gradient Boosting, XGBoost and LightGBM

Boosting: It consists of a sequential process, where each model is trained to correct the mistakes made by the previous model. In the context of regression, these mistakes are the residuals or differences between the predicted and actual values. The final prediction is made by a weighted sum of the individual predictions of all models, where models that perform better have more weight. Boosting is effective at reducing both bias and variance, making it a powerful method for improving the accuracy of regression models.


Random Forest Hyperparameters

When building a Random Forest model for regression, there are several parameters that can be tuned to optimize the performance of the model. Two of the most important hyperparameters are Number of Trees and Max Depth.

Number of Trees: It  determines the number of estimators in the forest. Each tree is built independently and contributes equally to the final prediction, which is the mean of the predictions of all trees. Increasing the number of trees can make the model more robust and reduce variance, but beyond a certain point, the benefits in prediction performance may be outweighed by the computational cost. There is no definitive rule for choosing the optimal number of trees, as it depends on the specific problem and the trade-off between computational efficiency and model performance.

Max Depth: It determines the maximum depth of each tree. The depth of a tree is the maximum distance between the root node and any leaf node. A tree of depth k can model interactions involving up to k features. If the max depth is too high, the model may overfit the training data and not work well with unseen data. If it's too low, the model may be too simple to capture complex patterns in the data, leading to underfitting.

A visual example of the hyperparameters tuning process is shown in the next video:

Here I built a Random Forest Regression varying the parameters Max Depth and Number of Tree, showing the regression result, as well as the individual Decision Trees built for each case. We can appreciate how the shape of the Random Forest Regression changes. When the model is underfitting, the regression result may appear overly simplified and unable to capture the complexity of the data. As we increase the max depth and number of trees, the model becomes more complex and better able to fit the data. However, if we go too far, the model may start to overfit, capturing noise and outliers in the data.


Python Implementation


Importing Libraries

The first step is importing the necessary libraries. For data manipulation we use pandas and numpy, for data visualization we use seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


Data Loading

The next step is to load a dataset and split it into a training set and a test set. We're using the load_diabetes dataset from sklearn, which is a commonly used dataset for regression tasks. It consists of 442 patients with ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline. It's important to notice that when we loaded the dataset, the features have been mean centered and scaled by the standard deviation by default.

# Loading the dataset
diabetes = load_diabetes(as_frame = True)
data = diabetes['frame']

# Displaying the first few rows of the dataset
data.head()

We're displaying the first few rows of the dataset using the head method. This gives us a quick overview of the data we're working with:


We separate the features (X) and the target variable (y) from the dataset. The features are all the columns except the last one, which is the target variable. We also split these into a training set and a test set using the train_test_split function, taking 80% of the data for training and 20% for testing.

# Separate the features and the target
X = data.iloc[:,0:10].to_numpy()
y = data["target"].to_numpy()

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)


Random Forest Training and Evaluation

Now that we've loaded and split the diabetes dataset, we can move on to apply the model. We create a Random Forest Regression model with 100 estimators (trees), train it on the training set and make predictions on the testing set.

# Creating the Random Forest Regression instance
RF = RandomForestRegressor(n_estimators=100, random_state=0)
# Training the model on the training data
RF.fit(X_train, y_train)
# Making predictions on the test set
y_pred = RF.predict(X_test)

Then we calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model's performance.

# Computing the Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test, y_pred)
# Computing the Root Mean Squared Error (RMSE)
RMSE = mean_squared_error(y_test, y_pred)**0.5

To better understand the performance of the model, we can visualize the error metrics using a bar chart. This gives us a clear, visual comparison of the two metrics.

# Creating the figure
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
plt.rcParams['font.size'] = '14'
ax.set_facecolor('#F5F5F5')

# Creating the bar for MAE
bar1 = ax.bar(0.25, MAE, color ='firebrick', width = 0.23)[0]
# Adding the value of the MAE on top of the bar
yval1 = bar1.get_height()
ax.text(bar1.get_x() + bar1.get_width()/2, yval1, round(yval1, 2), ha='center', va='bottom', fontsize = 16)

# Creating the bar for RMSE
bar2 = ax.bar(0.75, RMSE, color ='seagreen', width = 0.23)[0]
# Adding the value of the RMSE on top of the bar
yval2 = bar2.get_height()
ax.text(bar2.get_x() + bar2.get_width()/2, yval2, round(yval2, 2), ha='center', va='bottom', fontsize = 16)

# Adding ticks and title
ax.set_xticks([0.25,0.75], ['MAE', 'RMSE'], fontsize = 16)
ax.set_title('Metrics for Random Forest', fontsize = 18)
ax.set_ylim([0, 70])
ax.set_xlim([0, 1])

# Display the plot
plt.show()

The results are the following:


Conclusions

Through this exploration of the Random Forest algorithm for regression, we've gained valuable insights into this powerful machine learning technique. We've learned how it leverages the concept of ensemble learning, specifically bagging, to build robust and accurate models. We've also understood the importance of key parameters like the number of trees and maximum depth, and how they influence the model's performance. Through our practical implementation in Python, we've seen how to apply these concepts in practice. We've learned how to train a Random Forest model, make predictions, and evaluate the model's performance using metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

By reading this post, you've not only gained a deeper understanding of Random Forest for regression, but also acquired practical skills that you can apply to your own machine learning projects. Remember, the key to successful machine learning is understanding the data, the algorithm, and the problem at hand.


Share:

Saturday, June 10, 2023

Mathematical Formulation SVR

Support Vector Machines (SVMs), first introduced by Vapnik in 1995, represent one of the most impactful advancements in the realm of machine learning. Since their inception, SVMs have proven highly versatile, with successful implementations across a myriad of domains including text categorization, image analysis, speech recognition, time series forecasting, and even information security.

The core principle of SVMs is rooted in the concepts of statistical learning theory (SLT) and structural risk minimization (SRM). This foundation diverges from other methods that are grounded in empirical risk minimization (ERM). While these methods are focused on minimizing the error on the training dataset, SVMs, on the other hand, aims to minimize an upper bound of the generalization error. This bound is a composite of both the training error and a measure of confidence, resulting in more robust and reliable models.

Initially, SVMs were primarily developed to handle classification problems. However, the power and potential of SVMs extended beyond mere classification. This led to the introduction of Support Vector Machines for Regression (SVR), a significant variation on the model originally proposed by Vapnik. SVRs holds true to the core principles of SVMs while adopting a new approach to apply them to regression problems. 


Mathematical Formulation

Suppose we have data in the form $x = [x_1, x_2,  \ldots , x_n]$ and $y = [y_1, y_2,  \ldots , y_n]$ where $x_i \in \mathbb{R}^d$ are the input variables (features) and $y_i \in \mathbb{R}$ the output variables. The regression function of SVR is defined as:

\begin{equation} f(x) = w  \cdot x + b\end{equation}

Where $w$ is a weight vector and $b$ is a bias term, notice that this equation is the hyperplane equation in the formulation of SVMs. A graphic example of the regression function (in two dimensions) is: 


We can map the input vector $x$ to a high dimensional space with a function $\phi(x)$, introducing this function in Eq. (1) we obtain: 

\begin{equation} f(x) = w  \cdot \phi(x) + b\end{equation}

This form of the regression function allows us to generate non-linear regressions, like:



As mentioned earlier, SVR minimizes bounds of the generalization error. This error term is handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, $\varepsilon$. We can tune $\varepsilon$ to gain the desired accuracy of our model. For this case, the objective function and constraints are:

\begin{equation} \begin{aligned} \min_{w} \quad & \frac{1}{2}\|w\|^2 \\ \textrm{s.t.} \quad & |y_i - (w \cdot x_i +b)| \leq \varepsilon \\ \end{aligned}\end{equation}

The first part is read as: Minimize the objective function $\frac{1}{2}\|w\|^2$ based on the variable $w$ subject to the constrain $|y_i - (w \cdot x_i +b) | \leq \varepsilon$

An illustrative example of this is:


Depends the value of $\varepsilon$, this algorithm may doesn’t work for all data points. The algorithm solved the objective function as best as possible but some of the points can fall outside the margins. We need to account for the possibility of errors that are larger than $\varepsilon$. To handle this, we introduce the slack variables.

The slack variable $\xi$ is defined as: for any value (point) that falls outside of $\varepsilon$, we can denote its deviation from the margin as $\xi$



We aim to  to minimize these deviations $\xi$ as much as possible. Thus, we can add these deviations to the objective function, along with an additional hyperparameter, the penalty factor, $C$.

\begin{equation} \begin{aligned} \min_{w,\xi} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{|\xi_{i}|} \\ \\ \textrm{s.t.} \quad & |y_i - w \cdot x_i - b| \leq \varepsilon + |\xi_{i}|\\ \end{aligned}\end{equation}

We can expand the terms that include $|\xi_{i}|$, considering $\xi_{i}$ as the upper deviations and $\xi_{i}^*$ as the lower deviations. Expanding these terms and including the function $\phi(x)$ previously seen we obtain:

\begin{equation} \begin{aligned} \min_{w,\xi^{(*)}} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} \\ \\ \textrm{s.t.} \quad & y_i - w  \cdot \phi(x_i)  - b \leq \varepsilon + \xi_{i}^*\\ & w  \cdot \phi(x_i) + b - y_i \leq \varepsilon + \xi_{i}\\ & \xi^{(*)} \geq 0 \\ \end{aligned}\end{equation}

Where $\xi^{(*)} = \left(  \xi_1, \xi_{1}^{*},  \ldots ,  \xi_n, \xi_{n}^{*} \right) $ is the slack variable. As $C$ increases, the tolerance for points outside of $\varepsilon$ also increases. To derive dual problem from this optimization (minimization) problems, the Lagrange function is introduced:

\begin{equation} \begin{aligned} L = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} - \sum_{i=1}^{n}{\left ( \eta_{i}\xi_{i} + \eta_{i}^*\xi_{i}^* \right )} \\  & - \sum_{i=1}^{n}{\alpha_{i} \left ( \varepsilon + \xi_{i} + y_{i} - w  \cdot \phi(x_i)  - b \right )} \\ & - \sum_{i=1}^{n}{\alpha_{i}^{*} \left ( \varepsilon + \xi_{i}^{*} - y_{i} + w  \cdot \phi(x_i) + b \right )} \\  \end{aligned}\end{equation}

Where $L$ is the Lagrangian, $\alpha^{(*)} = \left(  \alpha_1, \alpha_{1}^{*},  \ldots ,  \alpha_n, \alpha_{n}^{*} \right) $ and $\eta^{(*)} = \left(  \eta_1, \eta_{1}^{*},  \ldots ,  \eta_n, \eta_{n}^{*} \right) $ are the Lagrange multipliers. The optimal conditions occur when:

\begin{equation} \begin{aligned} & \frac{\partial{L}}{\partial{w}} = 0 \; \Rightarrow \; w =  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right ) \phi(x_i)  } \\ \\ & \frac{\partial{L}}{\partial{b}} = 0 \;  \Rightarrow \;  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right )} = 0 \\  \\ & \frac{\partial{L}}{\partial{\xi_i}} = 0 \; \Rightarrow \;  C - \alpha_{i} - \eta_{i} = 0 \\ \\ & \frac{\partial{L}}{\partial{\xi_{i}^{*}}} = 0 \;  \Rightarrow \;  C - \alpha_{i}^{*} - \eta_{i}^{*} = 0 \\ \end{aligned}\end{equation}

Combining Eq. (6) and (7) we get the dual optimization problem: 

\begin{equation} \begin{aligned} \min_{\alpha^{(*)}} \quad & \frac{1}{2} \sum_{i, j =1}^{n}{( \alpha_{i}^* - \alpha_{i}) ( \alpha_{j}^* - \alpha_{j} ) \phi(x_i) \phi(x_j)} \\ & + \varepsilon \sum_{i=1}^{n}{( \alpha_{i}^* + \alpha_{i} )} - \sum_{i=1}^{n}{y_i(\alpha_{i}^* - \alpha_{i})}\\ \\ \textrm{s.t.} \quad & \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i})} = 0 \\ & 0 \leq \alpha^{(*)} \leq C \\ \end{aligned}\end{equation}

Solving this optimization problem, in other words, obtaining the values of the Lagrange Multipliers $\alpha^{(*)}$ that minimize the objective function, the regression function of SVR can be described as:

\begin{equation} \begin{aligned} f(x) & = \; w  \cdot \phi(x) + b \\  & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) [\phi(x_i)  \cdot \phi(x) ]} + b \\ & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) K(x_i, x)} + b \\ \end{aligned}\end{equation}

Where $K(x_i, x)$ is known as Kernel Function.


Importance of hyperparameters tuning

We have seen the mathematical meaning of the hyperparameters $C$ and $\varepsilon$, and how they are introduced in the mathematical formulation of SVR as factors for model tuning. These hyperparameters controls the accuracy and the generalizability of the model. For a better understanding, let's look at a visual example of how these parameters affect the model performance.

In this video, we can notice that for obtain a good regression, we need to find the optimal values for both $C$ and $\varepsilon$. We have shown that isn't enough variate one parameter, to get a robust solution we must optimize both. We also need to choose the right Kernel depending on the data distribution, in the video above we selected a Radial Basis Function (RBF) Kernel, which is a popular choice for many problems, as it can capture complex and non-linear relationships.


Share:

About Me

My photo
I am a Physics Engineer graduated with academic excellence as the first in my generation. I have experience programming in several languages, like C++, Matlab and especially Python, using the last two I have worked on projects in the area of Image and signal processing, as well as machine learning and data analysis projects.

Recent Post

Particle Swarm Optimization

The Concept of "Optimization" Optimization is a fundamental aspect of many scientific and engineering disciplines. It involves fi...

Pages