Support Vector Machines for Regression ~ Data Science For Lifelong Learning

Support Vector Machines (SVM) is a powerful and versatile machine learning algorithm that is widely used in both Classification and Regression tasks. Originally developed for binary classification, it operates on the principle of finding the hyperplane that best separates the data into two classes. The "best" hyperplane is defined as the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points from each class. This approach helps to ensure that the model generalizes well to unseen data, reducing the risk of overfitting.

Over time, SVM has been extended to handle multi-class classification and regression problems. In these cases, the algorithm seeks to find the hyperplane (or set of hyperplanes) that best separates the data into multiple classes or best fits the data points in the case of regression. This versatility has led to SVM being applied across a wide range of domains, from image recognition to natural language processing, from stock market prediction to bioinformatics.

While SVM is traditionally associated with classification tasks, it can be adapted for regression tasks in a framework known as Support Vector Regression (SVR). In this post, I focus on the use of SVM for regression, exploring how it can be used to model complex, non-linear relationships in data.

SVR offers several advantages. It can handle high-dimensional data, making it suitable for problems where the number of features is large. It also offers flexibility in modeling different types of relationships through the use of different kernel functions. This means that it can capture both linear and non-linear relationships between the input features and the target variable.

However, SVR also has its limitations. It can be computationally intensive for large datasets, which can limit its scalability. It also requires careful selection of parameters, such as the regularization factor C and the kernel parameters. Incorrect parameter selection can lead to poor model performance.

Regression with SVM

In the context of regression, SVM is adapted to find a function that has at most $\varepsilon$ deviation from the actually obtained targets for all the training data, and at the same time is as flat as possible. This is known as $\varepsilon$-insensitive loss, which essentially means that errors within a certain margin ($\varepsilon$) are ignored. This approach makes SVM robust to outliers and allows it to find a function that generalizes well to unseen data.

A key concept in SVM, both for classification and regression, is the use of kernel functions. Kernel functions are used to transform the input data into a higher-dimensional space where it becomes easier to find a hyperplane that separates the data. This is particularly useful when dealing with non-linear relationships in the data, as it allows SVM to capture these relationships without explicitly computing the coordinates of the data in the high-dimensional space.

There are several types of kernel functions commonly used in SVM, including linear, polynomial, and Radial Basis Function (RBF). The linear kernel is the simplest and is used when the data is linearly separable. The polynomial kernel allows SVM to capture polynomial relationships in the data. The RBF kernel is a popular choice for many problems, as it can capture complex, non-linear relationships.

The regularization parameter C is a crucial component of SVM. It determines the trade-off between allowing the model to increase its complexity (and potentially overfit the data) and keeping it simple (and potentially underfitting the data). A high value of C encourages the model to fit the training data as closely as possible, while a low value encourages a simpler model. Refer to this link for a complete mathematical formulation of SVR.

Python Implementation

Importing Libraries

The first step in implementing Support Vector Regression (SVR) in Python is importing the necessary libraries. For data manipulation and numerical operations we used pandas and numpy respectively, for data visualization we used seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

Data Loading and Preprocessing

In this section, we load the diamonds dataset from the seaborn library and perform the preprocessing steps. This dataset contains information about a large number of diamonds, including their cut, color, clarity, and price.

First, we load the dataset and show a sample of it.

# Loading the diamonds dataset
diamonds = sns.load_dataset("diamonds")
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

This sample is shown below:

The diamonds dataset contains both numerical and categorical variables. The categorical variables are: cut, color and clarity, these must be encoded into numerical values before we can use them to train the SVR model. We identify the category values of these features and list them in order of importance.

# Identifying the categories of each feature
# We reversed the categories to maintain the order of importance
categories_cut = diamonds['cut'].cat.categories[::-1].tolist()
categories_color = diamonds['color'].cat.categories[::-1].tolist()
categories_clarity = diamonds['clarity'].cat.categories[::-1].tolist()

# Array containing all the categories
categories = [categories_cut, categories_color, categories_clarity]

Then, we create an instance of the OrdinalEncoder class, passing the categories to maintain their order of importance. We fit and transform the labels in the 'cut', 'color', and 'clarity' columns.

# Creating an instance of the Ordinal Encoder
# We pass the categories to maintain the order of importance
ordinal_encoder = OrdinalEncoder(categories=categories)

# Creating a copy of the dataset
diamonds_encoded= diamonds.copy()
# Fitting and transforming the labels in the 'cut', 'color', and 'clarity' columns
diamonds_encoded[['cut', 'color', 'clarity']] = ordinal_encoder.fit_transform(diamonds_encoded[['cut', 'color', 'clarity']])

# Displaying a sample of the encoded dataset
diamonds_encoded.sample(5, random_state = 6)

The sample of the encoded dataset is shown below:

As we can see, the features 'cut', 'color' and 'clarity' now only have numeric values instead of alphanumeric ones.

SVR Training and Evaluation

First, we take the target variable (price) from the rest of the dataset. We then split the data into training and testing sets, using 70% of the data for training and 30% for testing.

y = diamonds_encoded['price'].to_numpy()
X = diamonds_encoded.drop(['price'], axis=1).to_numpy()

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We define three SVR models with different regularization parameters (C = 5.0, 10.0, 15.0) to see how this parameter affects the model's performance, all of these models were defined with a kernel RBF. We use a pipeline to first scale the data using StandardScaler and then apply the SVR model.

# Defining models
pipeline1 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 5.0))])
pipeline2 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 10.0))])
pipeline3 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 15.0))])
# Grouping models
models = [pipeline1, pipeline2, pipeline3]

Finally, we fit these models on the training data and make predictions on the testing data. We evaluate the performance of the models using two metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). RMSE gives the standard deviation of the residuals, while MAE gives the average magnitude of the errors in a set of predictions.

# Applying models and computing regression scores
RMSE = []
MAE = []
for pipeline in models:
  pipeline.fit(X_train,y_train)
  y_predict = pipeline.predict(X_test)
  # Computing the Root Mean Squared Error (RMSE)
  rmse = mean_squared_error(y_test, y_predict)**0.5
  RMSE.append(rmse)
  # Computing the Mean Absolute Error (MAE)
  mae = mean_absolute_error(y_test, y_predict)
  MAE.append(mae)

Model Performance Visualization

In this section, we visualize the performance of the models using bar plots. This allows us to compare the RMSE and MAE for each model.

We create a pandas DataFrame to store the previous calculated metrics (RMSE and MAE).

# Creating a DataFrame to store the calculated metrics
df = pd.DataFrame(np.array([RMSE, MAE]).T, columns=['Root of Mean Squared Error', 'Mean Absolute Error'])

Then, we define a function to add the labels (values of the metrics) above each bar in the bar plot.

def addlabels(bars, ax):
  # Adding the value above each bar
  for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize = 12

Finally, we create a bar chart. We use different colors for each model to distinguish them. We also add a legend to indicate which color corresponds to which model.

plt.rcParams['font.size'] = '11'
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')

# Bar width
barWidth = 0.28
# Position of the bars
br1 = np.arange(2)
br2 = [x + barWidth for x in br1]
br3 = [x + barWidth for x in br2]

# Creating the bars for each model
bars = ax.bar(br1, df.loc[0], color ='firebrick', width = barWidth, label ='C = 5.0')
addlabels(bars, ax)

bars2 = ax.bar(br2, df.loc[1], color ='seagreen', width = barWidth, label ='C = 10.0')
addlabels(bars2, ax)

bars3 = ax.bar(br3, df.loc[2], color ='mediumblue', width = barWidth, label ='C = 15.0')
addlabels(bars3, ax)

# Adding ticks and title
ax.set_xticks([r + barWidth for r in range(len(df.loc[0]))], ['RMSE', 'MAE'])
ax.set_title('Metrics for Support Vector Regression', fontsize = 15)
ax.set_ylim([0, 1950])
ax.legend()
plt.show()

The results are the following:

This visualization provides a clear comparison of the performance of our models. It shows how the choice of the regularization parameter C affects the RMSE and MAE. The model that obtained best results was the one with the parameter C = 15.0, because its RMSE and MAE were the lowest.

Conclusion

In this post, we've understood and implemented Support Vector Machines (SVM) for regression tasks. We've seen how SVM can be a powerful tool for regression, capable of handling high-dimensional spaces and offering the flexibility of different kernel functions. We've also learned the importance of parameter tuning in machine learning, specifically the role of the regularization parameter C in SVM. Through hands-on coding, we've seen how changes in this parameter can significantly affect the performance of our model.

By working with the diamonds dataset, we've gained practical experience in applying SVM for regression in Python, from data preprocessing to model training and evaluation. This approach has provided us with valuable insights into the workings of SVM and its application in real-world problems.

Sunday, May 28, 2023