• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Principal Components Analysis

    Making high dimensions comprehensible and actionable, it captures the maximum amount of variance in the data with a reduced number of features

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Sunday, May 28, 2023

Support Vector Machines for Regression

Support Vector Machines (SVM) is a powerful and versatile machine learning algorithm that is widely used in both Classification and Regression tasks. Originally developed for binary classification, it operates on the principle of finding the hyperplane that best separates the data into two classes. The "best" hyperplane is defined as the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points from each class. This approach helps to ensure that the model generalizes well to unseen data, reducing the risk of overfitting.

Over time, SVM has been extended to handle multi-class classification and regression problems. In these cases, the algorithm seeks to find the hyperplane (or set of hyperplanes) that best separates the data into multiple classes or best fits the data points in the case of regression. This versatility has led to SVM being applied across a wide range of domains, from image recognition to natural language processing, from stock market prediction to bioinformatics.

While SVM is traditionally associated with classification tasks, it can be adapted for regression tasks in a framework known as Support Vector Regression (SVR). In this post, I focus on the use of SVM for regression, exploring how it can be used to model complex, non-linear relationships in data.

SVR offers several advantages. It can handle high-dimensional data, making it suitable for problems where the number of features is large. It also offers flexibility in modeling different types of relationships through the use of different kernel functions. This means that it can capture both linear and non-linear relationships between the input features and the target variable. 

However, SVR also has its limitations. It can be computationally intensive for large datasets, which can limit its scalability. It also requires careful selection of parameters, such as the regularization factor C and the kernel parameters. Incorrect parameter selection can lead to poor model performance.


Regression with SVM

In the context of regression, SVM is adapted to find a function that has at most $\varepsilon$ deviation from the actually obtained targets for all the training data, and at the same time is as flat as possible. This is known as $\varepsilon$-insensitive loss, which essentially means that errors within a certain margin ($\varepsilon$) are ignored. This approach makes SVM robust to outliers and allows it to find a function that generalizes well to unseen data.

A key concept in SVM, both for classification and regression, is the use of kernel functions. Kernel functions are used to transform the input data into a higher-dimensional space where it becomes easier to find a hyperplane that separates the data. This is particularly useful when dealing with non-linear relationships in the data, as it allows SVM to capture these relationships without explicitly computing the coordinates of the data in the high-dimensional space.

There are several types of kernel functions commonly used in SVM, including linear, polynomial, and Radial Basis Function (RBF). The linear kernel is the simplest and is used when the data is linearly separable. The polynomial kernel allows SVM to capture polynomial relationships in the data. The RBF kernel is a popular choice for many problems, as it can capture complex, non-linear relationships.

The regularization parameter C is a crucial component of SVM. It determines the trade-off between allowing the model to increase its complexity (and potentially overfit the data) and keeping it simple (and potentially underfitting the data). A high value of C encourages the model to fit the training data as closely as possible, while a low value encourages a simpler model. Refer to this link for a complete mathematical formulation of SVR.


Python Implementation


Importing Libraries

The first step in implementing Support Vector Regression (SVR) in Python is importing the necessary libraries. For data manipulation and numerical operations we used pandas and numpy respectively, for data visualization we used seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error


Data Loading and Preprocessing

In this section, we load the diamonds dataset from the seaborn library and perform the preprocessing steps. This dataset contains information about a large number of diamonds, including their cut, color, clarity, and price.

First, we load the dataset and show a sample of it.

# Loading the diamonds dataset
diamonds = sns.load_dataset("diamonds")
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

This sample is shown below:


The diamonds dataset contains both numerical and categorical variables. The categorical variables are: cut, color and clarity, these must be encoded into numerical values before we can use them to train the SVR model. We identify the category values of these features and list them in order of importance.

# Identifying the categories of each feature
# We reversed the categories to maintain the order of importance
categories_cut = diamonds['cut'].cat.categories[::-1].tolist()
categories_color = diamonds['color'].cat.categories[::-1].tolist()
categories_clarity = diamonds['clarity'].cat.categories[::-1].tolist()

# Array containing all the categories
categories = [categories_cut, categories_color, categories_clarity]

Then, we create an instance of the OrdinalEncoder class, passing the categories to maintain their order of importance. We fit and transform the labels in the 'cut', 'color', and 'clarity' columns.

# Creating an instance of the Ordinal Encoder
# We pass the categories to maintain the order of importance
ordinal_encoder = OrdinalEncoder(categories=categories)

# Creating a copy of the dataset
diamonds_encoded= diamonds.copy()
# Fitting and transforming the labels in the 'cut', 'color', and 'clarity' columns
diamonds_encoded[['cut', 'color', 'clarity']] = ordinal_encoder.fit_transform(diamonds_encoded[['cut', 'color', 'clarity']])

# Displaying a sample of the encoded dataset
diamonds_encoded.sample(5, random_state = 6)

The sample of the encoded dataset is shown below:


As we can see, the features 'cut', 'color' and 'clarity' now only have numeric values instead of alphanumeric ones.


SVR Training and Evaluation

First, we take the target variable (price) from the rest of the dataset. We then split the data into training and testing sets, using 70% of the data for training and 30% for testing.

y = diamonds_encoded['price'].to_numpy()
X = diamonds_encoded.drop(['price'], axis=1).to_numpy()

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We define three SVR models with different regularization parameters (C = 5.0, 10.0, 15.0) to see how this parameter affects the model's performance, all of these models were defined with a kernel RBF. We use a pipeline to first scale the data using StandardScaler and then apply the SVR model.

# Defining models
pipeline1 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 5.0))])
pipeline2 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 10.0))])
pipeline3 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 15.0))])
# Grouping models
models = [pipeline1, pipeline2, pipeline3]

Finally, we fit these models on the training data and make predictions on the testing data. We evaluate the performance of the models using two metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). RMSE gives the standard deviation of the residuals, while MAE gives the average magnitude of the errors in a set of predictions.

# Applying models and computing regression scores
RMSE = []
MAE = []
for pipeline in models:
  pipeline.fit(X_train,y_train)
  y_predict = pipeline.predict(X_test)
  # Computing the Root Mean Squared Error (RMSE)
  rmse = mean_squared_error(y_test, y_predict)**0.5
  RMSE.append(rmse)
  # Computing the Mean Absolute Error (MAE)
  mae = mean_absolute_error(y_test, y_predict)
  MAE.append(mae)


Model Performance Visualization

In this section, we visualize the performance of the models using bar plots. This allows us to compare the RMSE and MAE for each model.

We create a pandas DataFrame to store the previous calculated metrics (RMSE and MAE).

# Creating a DataFrame to store the calculated metrics
df = pd.DataFrame(np.array([RMSE, MAE]).T, columns=['Root of Mean Squared Error', 'Mean Absolute Error'])

Then, we define a function to add the labels (values of the metrics) above each bar in the bar plot.

def addlabels(bars, ax):
  # Adding the value above each bar
  for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize = 12

Finally, we create a bar chart. We use different colors for each model to distinguish them. We also add a legend to indicate which color corresponds to which model.

plt.rcParams['font.size'] = '11'
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')

# Bar width
barWidth = 0.28
# Position of the bars
br1 = np.arange(2)
br2 = [x + barWidth for x in br1]
br3 = [x + barWidth for x in br2]

# Creating the bars for each model
bars = ax.bar(br1, df.loc[0], color ='firebrick', width = barWidth, label ='C = 5.0')
addlabels(bars, ax)

bars2 = ax.bar(br2, df.loc[1], color ='seagreen', width = barWidth, label ='C = 10.0')
addlabels(bars2, ax)

bars3 = ax.bar(br3, df.loc[2], color ='mediumblue', width = barWidth, label ='C = 15.0')
addlabels(bars3, ax)

# Adding ticks and title
ax.set_xticks([r + barWidth for r in range(len(df.loc[0]))], ['RMSE', 'MAE'])
ax.set_title('Metrics for Support Vector Regression', fontsize = 15)
ax.set_ylim([0, 1950])
ax.legend()
plt.show()

The results are the following:

This visualization provides a clear comparison of the performance of our models. It shows how the choice of the regularization parameter C affects the RMSE and MAE. The model that obtained best results was the one with the parameter C = 15.0, because its RMSE and MAE were the lowest.


Conclusion

In this post, we've understood and implemented Support Vector Machines (SVM) for regression tasks. We've seen how SVM can be a powerful tool for regression, capable of handling high-dimensional spaces and offering the flexibility of different kernel functions. We've also learned the importance of parameter tuning in machine learning, specifically the role of the regularization parameter C in SVM. Through hands-on coding, we've seen how changes in this parameter can significantly affect the performance of our model.

By working with the diamonds dataset, we've gained practical experience in applying SVM for regression in Python, from data preprocessing to model training and evaluation. This approach has provided us with valuable insights into the workings of SVM and its application in real-world problems.

Share:

Saturday, May 13, 2023

Regression in Machine Learning

Regression is a supervised learning technique widely used in Machine Learning and Data Science. These kind of techniques aims to predict a continuous outcome variable (y) based on the values of one or more predictor variables and features (X), which can be continuous and/or discrete. Regression methods generally works by finding the mathematical relationship between the features and the outcome. When dealing with multiple predictors, the regression model is often represented as a multi-dimensional plane (also called hyperplane).

Unlike classification, which predicts a discrete category, regression quantifies the relationship between features, allowing us to predict a continuous outcome. Let's consider a practical example: If we want to predict the price of a house (a continuous variable), we might look at features such as size, location, number of rooms, construction materials, etc. A regression algorithm would find the best mathematical relationship between these features and the price of the house.


Uses of Regression

Regression models have incredibly versatile applications and find utility across a spectrum of fields. Let's delve into some areas where regression models are essentials:

  • Finance: Regression models serve as powerful tools in the field of finance, particularly in risk assessment and financial forecasting. For example, with them analysts can predict a company's future earnings based on various financial indicators or forecast future stock prices based on historical data.

  • Economics: Regression models are crucial in the field of economics. Economists employ them to forecast various economic indicators and measure the impact of changes in certain variables like employment rates or interest rates. For instance, economists could use regression to understand how a change in government policy might influence and affect the overall employment rate.

  • Healthcare: This is one of the fields where regression models shine and have valuable applications. They can predict patient outcomes based on various factors such as age, weight, genetic predispositions, lifestyle, etc. They can also be used to understand the impact of different treatment approaches on patient recovery or even disease progression. For example, regression models could be used to predict the progression of a disease like diabetes based on various patient-specific factors.
  • Marketing: In marketing, regression models are widely used for predictive tasks, such as predicting the success of marketing campaigns, forecasting sales, and understanding customers behavior. For instance, they can predict customer lifetime value based on purchase history, helping companies to optimize their customer retention strategies and approaches in future campaigns.

Regression Algorithms

There are various types of regression algorithms, each with its strengths, weaknesses, and suitability for different types of problems, some of them are:

  • Linear Regression: One of the simplest forms of regression. It assumes a linear relationship between the predictors and the outcome variable.
  • Polynomial Regression: This is an extension of linear regression, where the model is not restricted to a straight line and can fit polynomial curves to the data.
  • Ridge and Lasso Regression: These are regularization techniques that help prevent overfitting in regression models. They do this by adding a penalty term to the cost function, which reduces the magnitude of the model coefficients.
  • Decision Tree Regression: Decision trees can also be used for regression tasks, with the leaf nodes predicting continuous values.
  • Support Vector Regression: This is the regression version of the Support Vector Machines (SVM), a robust and widely used algorithm in both classification and regression tasks.
  • Random Forest and Gradient Boosting Regression: Ensemble techniques that combine predictions from multiple models to give a final prediction.
  • Neural Network Regression: Neural networks can model complex non-linear relationships. They are especially useful for high-dimensional and complex data, where traditional regression techniques might struggle.


Regression models are a powerful tool in the hands of data scientists and analysts. They allow us to understand and quantify relationships between variables, make predictions about future data points, and even extrapolate trends outside the range of our current data. Being able to estimate the potential impact of different variables on an outcome can provide valuable insights for decision-making.

In future posts, I will dive deeper into these regression techniques, exploring how they work and implementing them using Python.


Share:

Monday, May 1, 2023

Data Encoding

In the field of data science, the quality and format of the data we work with are just as important as the algorithms and models we use. One of the crucial aspects of data preprocessing is Data Encoding, a process that transforms categorical data into a format that can be understood by machine learning algorithms.

Categorical data refers to variables that contain label values rather than numeric values. They are often divided into nominal (variables without order or priority) and ordinal (variables with some level of priority) categories. While this data can be insightful and necessary, many machine learning algorithms can only handle numerical inputs. This is where Data Encoding comes in. It converts these categorical variables into numerical values, allowing us to include this valuable information in the models we use.


Types of Data Encoding

There are several common types of Data Encoding, each with its own strengths and weaknesses. The choice of which to use often depends on the problem, the specific characteristics of the data and the requirements of the machine learning algorithm.


Nominal / One-Hot Encoding

The One-Hot Encoding, also called nominal encoding, is a widely used encoding technique for handling categorical variables. In this method, each category of is converted into a new binary column (values 0 or 1). This results in a sparse matrix where each row represents an observation and each column represents a unique category of the feature. The value is 1 if the category is present for that observation and 0 if not.

One-Hot Encoding is particularly useful for nominal data where no ordinal relationship exists. It allows the representation of categorical data in a way that can be understood by machine learning algorithms. However, it can lead to a high memory consumption and computational cost when the cardinality of the categorical variable is high, because it creates a new column for each unique category.

Now, let's see an example of how One-Hot Encoding can be implemented in Python.

First, we import the necessary libraries and create a pandas dataframe with two categorical features: Color and Shape

# Importing libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating a DataFrame with example data
data = {'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Red', 'Green', 'Blue'],
        'Shape': ['Circular', 'Square', 'Square', 'Circular', 'Square', 'Circular', 'Square']}

df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)

The created dataframe is:

Then, we define an instance of the OneHotEncoder class from the sklearn library. We apply this encoder to the dataframe using fit_transform, this method works by first learning the unique values of each feature (the fit part), and then transforming these values into a binary one-hot encoding (the transform part).

# Creating an instance of the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
# Fitting and transforming the data
encoded_data = encoder.fit_transform(df)

Finally, we create and display a new dataframe with the encoded data.

# Creating a DataFrame with the encoded data
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(df.columns))
# Displaying the encoded DataFrame
print(df_encoded)

The encoded dataframe resulted as:



Label Encoding

Label Encoding is another popular technique for encoding categorical variables. In this method, each unique category value is assigned an integer value. The assignment of integers is arbitrary and is usually based on the order of the categories in the feature.

Label Encoding is simple and efficient, making it a good choice for ordinal data and for nominal data with only two categories (binary data). However, it can introduce a potential issue for nominal data with more than two categories. The algorithm might misinterpret the numerical values as having an ordinal relationship, which can lead to poor performance or unexpected results.

Let's examine an example of Label Encoding implementation in Python.

We import the libraries and load the Iris dataset using the load_dataset function from the seaborn library. This is a classic dataset in machine learning and statistics. It includes measurements for 150 iris flowers from three different species (setosa, versicolor and virginica). 

# Importing libraries
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Loading the iris dataset
iris = sns.load_dataset("iris")
# Displaying a sample of the dataset
iris.sample(5, random_state = 0)

A sample of the Iris dataset is shown below:

Then, we define an instance of the LabelEncoder class from the sklearn library. The fit_transform method is used to apply the encoder to the feature species.

# Creating an instance of the Label Encoder
label_encoder = LabelEncoder()
  
# Fitting and transforming the labels in column 'species'
iris_encoded = iris.copy()
iris_encoded['species']= label_encoder.fit_transform(iris_encoded['species'])

# Displaying a sample of the encoded dataset
iris_encoded.sample(5, random_state = 0)

The same sample of the Iris dataset, but with the feature species encoded, is:


Ordinal Encoding

Ordinal Encoding is a type of encoding that is similar to Label Encoding but with a key difference. In Ordinal Encoding, the assignment of integers to categories is not arbitrary. Instead, the integers are assigned in a way that respects the ordinal nature of the category.

For example, if a feature describes a rating as "low", "medium", or "high", ordinal encoding would assign the values 0, 1, and 2, respectively, preserving the order of the categories. This method is ideal for ordinal data, as it allows the model to understand the inherent order of the categories.

Let's explore an example of implementing Ordinal Encoding in Python.

We import the libraries and load the cut and price attributes from the Diamonds dataset using the load_dataset function from the seaborn library.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

# Loading the attributes "cut" and "price" from the diamonds dataset
diamonds = sns.load_dataset("diamonds")[["cut", "price"]]
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

Below is a sample of the dataset used:

Then, we create an instance of the OrdinalEncoder class from the sklearn library, by passing the categories of the cut feature as an argument. These categories were 'Fair', 'Good', 'Very Good', 'Premium' and 'Ideal', the ordinal encoder model assigned the numbers 0-4 to these categories in the order of their quality. The fit_transform method apply the encoder to the feature cut.

Below is the same sample of the dataset, but after applying the encoder:


Conclusion

In conclusion, data encoding is a crucial step in the data preprocessing pipeline, especially when dealing with categorical data. It allows us to transform non-numerical data into a format that can be understood and used by the machine learning algorithms.

In this post, we have explored several types of data encoding methods, including One-Hot Encoding, Label Encoding and Ordinal Encoding. Each of these methods has its own use cases and is appropriate in different circumstances. For example, One-Hot Encoding is useful when the categorical variables are nominal, while Label and Ordinal Encoding are beneficial when the categories have an inherent order.

Remember, the choice of encoding method can significantly impact the performance of the machine learning model. Therefore, it's essential to understand these methods and make an informed decision based on the nature of the data and the requirements of the model.


Share:

About Me

My photo
I am a Physics Engineer graduated with academic excellence as the first in my generation. I have experience programming in several languages, like C++, Matlab and especially Python, using the last two I have worked on projects in the area of Image and signal processing, as well as machine learning and data analysis projects.

Recent Post

Particle Swarm Optimization

The Concept of "Optimization" Optimization is a fundamental aspect of many scientific and engineering disciplines. It involves fi...

Pages