Random Forest Regression ~ Data Science For Lifelong Learning

Random Forest is a powerful machine learning algorithm that has gained significant popularity in both academia and industry due to its excellent performance, robustness, and easy use. It's an ensemble learning method, which means that it combines the predictions of several base estimators with a given learning algorithm to improve generalizability and robustness. In the case of Random Forest, the base estimators are decision trees.

The concept behind Random Forest is simple and effective. It operates by building a multitude of decision trees at training time, and taking as output the mean prediction (for regression) or majority class (for classification) of the individual trees. The model creates is a set of decision trees, usually trained with the "bagging" method, the main idea is to inject randomness into the tree building to ensure each tree is different. Consequently, although some trees may be wrong, many others will be right, so as a group, the trees are able to move in the correct direction.

The strength of Random Forest lies in its ability to overcome the overfitting problem, which is common in decision trees. While a single decision tree can create overly complex models that don't generalize well for new data, Random Forest mitigates this by employing multiple decision trees to build a more robust and generalizable model. It maintains a high level of accuracy and provides a measure of "interpretability" via feature importance estimation, which is not available in many other machine learning algorithms.

Learning Method

Random Forest is a perfect example of ensemble learning, a powerful concept in machine learning where multiple models, known as base learners, are trained to solve the same problem and combined to get better results. The main principle behind ensemble learning is that a group of weak learners can come together to form a strong learner, the models in the ensemble may not be strong predictors individually, but together they can provide a more accurate and stable prediction.

In the case of Random Forest, the base learners are decision trees. The algorithm introduces randomness into the model building process, which results in an ensemble of different models. Two key concepts in ensemble learning applied to Random Forest are Bagging and Feature Randomness.

Bagging: Also known as bootstrap aggregating, it's used to create different subsets of the original data by sampling with replacement. For each subset, a decision tree is built. This means that each tree in the forest is trained on a different set of data. This process decorrelates the trees, which means the errors of individual trees become less correlated with each other. When the predictions of these trees are averaged (in the case of regression), the variance of the final prediction is reduced.

Feature Randomness: In addition to bagging, Random Forest also uses a method called feature randomness. In traditional decision trees, when it is time to split a node, we consider every possible feature and choose the one that produces the most significant improvement in the objective function. Random Forest changes this by selecting a random subset of the features at each split point, which adds an extra layer of randomness to the model building process. This further decorrelates the trees and ultimately results in a more robust model.

Another type of ensemble learning is the Boosting method, which is used by some of the most popular ensemble algorithms such as AdaBoost, Gradient Boosting, XGBoost and LightGBM.

Boosting: It consists of a sequential process, where each model is trained to correct the mistakes made by the previous model. In the context of regression, these mistakes are the residuals or differences between the predicted and actual values. The final prediction is made by a weighted sum of the individual predictions of all models, where models that perform better have more weight. Boosting is effective at reducing both bias and variance, making it a powerful method for improving the accuracy of regression models.

Random Forest Hyperparameters

When building a Random Forest model for regression, there are several parameters that can be tuned to optimize the performance of the model. Two of the most important hyperparameters are Number of Trees and Max Depth.

Number of Trees: It determines the number of estimators in the forest. Each tree is built independently and contributes equally to the final prediction, which is the mean of the predictions of all trees. Increasing the number of trees can make the model more robust and reduce variance, but beyond a certain point, the benefits in prediction performance may be outweighed by the computational cost. There is no definitive rule for choosing the optimal number of trees, as it depends on the specific problem and the trade-off between computational efficiency and model performance.

Max Depth: It determines the maximum depth of each tree. The depth of a tree is the maximum distance between the root node and any leaf node. A tree of depth k can model interactions involving up to k features. If the max depth is too high, the model may overfit the training data and not work well with unseen data. If it's too low, the model may be too simple to capture complex patterns in the data, leading to underfitting.

A visual example of the hyperparameters tuning process is shown in the next video:

Here I built a Random Forest Regression varying the parameters Max Depth and Number of Tree, showing the regression result, as well as the individual Decision Trees built for each case. We can appreciate how the shape of the Random Forest Regression changes. When the model is underfitting, the regression result may appear overly simplified and unable to capture the complexity of the data. As we increase the max depth and number of trees, the model becomes more complex and better able to fit the data. However, if we go too far, the model may start to overfit, capturing noise and outliers in the data.

Python Implementation

Importing Libraries

The first step is importing the necessary libraries. For data manipulation we use pandas and numpy, for data visualization we use seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Data Loading

The next step is to load a dataset and split it into a training set and a test set. We're using the load_diabetes dataset from sklearn, which is a commonly used dataset for regression tasks. It consists of 442 patients with ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline. It's important to notice that when we loaded the dataset, the features have been mean centered and scaled by the standard deviation by default.

# Loading the dataset
diabetes = load_diabetes(as_frame = True)
data = diabetes['frame']

# Displaying the first few rows of the dataset
data.head()

We're displaying the first few rows of the dataset using the head method. This gives us a quick overview of the data we're working with:

We separate the features (X) and the target variable (y) from the dataset. The features are all the columns except the last one, which is the target variable. We also split these into a training set and a test set using the train_test_split function, taking 80% of the data for training and 20% for testing.

# Separate the features and the target
X = data.iloc[:,0:10].to_numpy()
y = data["target"].to_numpy()

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Random Forest Training and Evaluation

Now that we've loaded and split the diabetes dataset, we can move on to apply the model. We create a Random Forest Regression model with 100 estimators (trees), train it on the training set and make predictions on the testing set.

# Creating the Random Forest Regression instance
RF = RandomForestRegressor(n_estimators=100, random_state=0)
# Training the model on the training data
RF.fit(X_train, y_train)
# Making predictions on the test set
y_pred = RF.predict(X_test)

Then we calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model's performance.

# Computing the Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test, y_pred)
# Computing the Root Mean Squared Error (RMSE)
RMSE = mean_squared_error(y_test, y_pred)**0.5

To better understand the performance of the model, we can visualize the error metrics using a bar chart. This gives us a clear, visual comparison of the two metrics.

# Creating the figure
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
plt.rcParams['font.size'] = '14'
ax.set_facecolor('#F5F5F5')

# Creating the bar for MAE
bar1 = ax.bar(0.25, MAE, color ='firebrick', width = 0.23)[0]
# Adding the value of the MAE on top of the bar
yval1 = bar1.get_height()
ax.text(bar1.get_x() + bar1.get_width()/2, yval1, round(yval1, 2), ha='center', va='bottom', fontsize = 16)

# Creating the bar for RMSE
bar2 = ax.bar(0.75, RMSE, color ='seagreen', width = 0.23)[0]
# Adding the value of the RMSE on top of the bar
yval2 = bar2.get_height()
ax.text(bar2.get_x() + bar2.get_width()/2, yval2, round(yval2, 2), ha='center', va='bottom', fontsize = 16)

# Adding ticks and title
ax.set_xticks([0.25,0.75], ['MAE', 'RMSE'], fontsize = 16)
ax.set_title('Metrics for Random Forest', fontsize = 18)
ax.set_ylim([0, 70])
ax.set_xlim([0, 1])

# Display the plot
plt.show()

The results are the following:

Conclusions

Through this exploration of the Random Forest algorithm for regression, we've gained valuable insights into this powerful machine learning technique. We've learned how it leverages the concept of ensemble learning, specifically bagging, to build robust and accurate models. We've also understood the importance of key parameters like the number of trees and maximum depth, and how they influence the model's performance. Through our practical implementation in Python, we've seen how to apply these concepts in practice. We've learned how to train a Random Forest model, make predictions, and evaluate the model's performance using metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

By reading this post, you've not only gained a deeper understanding of Random Forest for regression, but also acquired practical skills that you can apply to your own machine learning projects. Remember, the key to successful machine learning is understanding the data, the algorithm, and the problem at hand.

Wednesday, June 28, 2023