• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Thursday, November 16, 2023

Representation of a Color Image

To appreciate how color images are represented mathematically, we first need to understand grayscale images. Imagine an image as a function $I(x,y)$, this function takes two inputs – spatial coordinates $(x,y)$ – and provides an output, which is the intensity or gray level at those coordinates. When we deal with digital images, both these coordinates and the intensity values are finite and discrete. Digital images are made up of pixels (also known as image elements or pels), where each pixel holds a specific location and an intensity value. In essence, the function $I(x,y)$ indicates the gray value of the image at each pixel, painting a picture in varying shades of gray.



Delving into Color Spaces

A color space can be visualized as a three-dimensional geometric realm, where each possible color perception fits neatly within. This concept is crucial for creating a system that encompasses all possible color combinations. Historically, color perceptions, also called color solids, were envisioned in simple geometric forms like pyramids or cones. The specific shape of a color solid is determined by the spatial axes definitions and their divisions.

In a similar vein to grayscale images, color images use a multi-variable function $I$ but with an important difference, this function maps from:

\begin{equation*} I\!: \: \mathbb{R}^2 \! \rightarrow \mathbb{R}^3\end{equation*}

Meaning, unlike the grayscale function, the output here is a three-dimensional vector. This is because color spaces themselves are three-dimensional. The function $I$ is defined for the image size, meaning it is confined to the dimensions of the rows and columns of the image. Several color spaces have been developed for different applications, such as printing, computer graphics, or even how we perceive color. Notable examples include RGB, CMY, CMYK, HSI, and HSV.


RGB

This is the most commonly used space in computer technology. This space is based on the mixture of the three primary colors Red, Green, and Blue. These primary colors are the reference colors for most image sensors. In this space, the origin (zero vector) represents the color black, and the standard values of the three colors range from 0 to 255 (8-bit channels).

Experimental evidence has established that the 6-7 million cones (eye sensors responsible for color vision) can be divided into three main detection categories roughly corresponding to red, green, and blue. About 65% of all cones are sensitive to red light, 33% to green light, and around 2% to blue light; but, blue cones are the most sensitive.


The image below represents the color solid corresponding to RGB space, the axes (or dimensions) of this solid are Red, Green and Blue, with these three dimensions we can define any color. In this solid, the diagonal line that connects Black (zero vector) and White (maximum value of each color channel) represents the gray shades, where the values of each components are equal to each other, it means that the gray-scale can be understood as a subspace of the RGB space.


CMY

This color space is usually used in color printing and uses Cyan (C), Magenta (M), and Yellow (Y) as its base. Since these colors are the complements of red, green, and blue, a transformation can be made from a vector in RGB space to one in CMY in the following way:


\begin{equation*} \begin{bmatrix} C \\ M \\ Y \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} - \begin{bmatrix} R \\ G \\ B \end{bmatrix} \end{equation*}


Where all RGB colors are considered normalized, so the value '1' represents the maximum of these colors. This equation demonstrates that light reflected from a surface coated with pure Cyan does not contain Red (that is, $C = 1 - R$ in the equation), similarly, pure Magenta does not reflect Green, and pure Yellow does not reflect Blue.


CMYK

As explained above, when color printing is required, the CMY color space is often used. However, Cyan, Magenta, and Yellow inks rarely are pure colors, resulting in a brownish color when combined to print black. Since black is widely used in printing, it was decided to add this color (denoted by the letter K) to the CMY color space to form the new CMYK space, but black is added only in the necessary proportions to produce a true black color in prints. To transform from a CMY space to a CMYK space, the following formulas are used:


\begin{equation*} K = \min(C^\prime, M^\prime, Y^\prime) \end{equation*}

\begin{equation*} C = \frac{C^\prime - K}{1 - K}; \quad M = \frac{M^\prime - K}{1 - K}; \quad Y = \frac{Y^\prime - K}{1 - K} \end{equation*}


Where $C^\prime$, $M^\prime$ and $Y^\prime$ are the colors in CMY space, while $C$, $M$, $Y$ and $K$ are the colors in CMYK space, all colors are considered normalized.

This is the most commonly used space in computer technology. This space is based on the mixture of the three primary colors Red, Green, and Blue. These primary colors are the reference colors for most image sensors. In this space, the origin (zero vector) represents the color black, and the standard values of the three colors range from 0 to 255 (8-bit channels).


Final Thoughts

The relevance of these color spaces transcends academic interest, profoundly influencing many aspects of our daily lives. From the vivid displays on our digital devices to the precision of color in print media, our interaction with and perception of digital images is fundamentally shaped by these color spaces. As technological advancements continue and our understanding of color theory deepens, the accuracy in color representation and processing in digital mediums grows increasingly significant. This knowledge is indispensable not only to professionals in fields like graphic design and printing but also enhances personal experiences in photography and visual arts.

In conclusion, the study of color spaces within digital image processing is a remarkable synthesis of science, technology, and artistic expression. It offers a dynamic field of study, rich with opportunities for innovation and creativity, captivating both professionals and enthusiasts alike in the vibrant and ever-progressing world of digital color.


Share:

Monday, October 2, 2023

The Five V's of Big Data

In the fascinating field of data analytics, we distinguish between two primary data categories: small data and big data. Small data is akin to a tranquil stream (orderly, manageable, and straightforward). In contrast, big data resembles a vast, turbulent ocean, daunting in its sheer scale and complexity. It is within this context that the Five V's of Big Data emerge as an invaluable framework, guiding us through this extensive data landscape.

In today's digital age, data is often described as the new oil, a resource so immensely valuable that its impact on our lives and businesses is both transformative and expansive. Central to understanding this data-driven revolution is the concept of The Five V's of Big Data'. These five dimensions are: Volume, Velocity, Variety, Veracity, and Value. These provide a framework for understanding the complexities and potential of big data. 


In this post, we shall delve deeply into each of these V's, examining their implications and the manner in which they redefine our engagement with big data in the modern context.


Volume

The concept of Volume in the realm of big data is a testament to the immense and ever-growing quantities of data that our digital era generates. This characteristic is not merely about the substantial size of individual datasets but also encompasses the aggregation of countless smaller data elements amassed over time. To put this in perspective, consider the staggering rate at which data is created every minute: as of 2023, it's estimated that approximately 306 million emails are sent, over 500,000 comments are posted on Facebook, around 4.5 million videos are viewed on YouTube, with over 500 hours of content uploaded every minute. 


But, how much data are we talking about?

For small data, volumes typically span gigabytes or less. In contrast, big data often encompasses terabytes, petabytes, or even more extensive scales of storage.



For comparison, 100 MB can store a couple of encyclopedias, a DVD holds about 5 GB, 1 TB can accommodate around 300 hours of high-quality video, and CERN's Large Hadron Collider produces about 15 petabytes of data annually.


Challenges

The primary challenge with big data is storage; as data volume increases, so does the requisite storage space. Efficient storage and quick retrieval of such vast data volumes are essential for timely processing and obtaining results. This leads to additional challenges, including networking, bandwidth, and the costs associated with data storage.

Additional challenges arise during processing of such large data, most existing analytical methods are not scalable to these magnitudes in terms of memory and processing capacity, resulting in decreased performance.

As the volume escalates, scalability, performance, and cost become increasingly significant challenges, we need innovative solutions and advanced technologies to manage and utilize big data effectively.


Velocity

Velocity in big data refers to the astounding speed at which data is generated, coupled with the increasing necessity to store and analyze this data swiftly. A key objective in big data analytics is processing data in real-time, keeping pace with its generation rate. This allows for immediate applications, such as personalizing advertisements on web pages based on a user’s recent searches, viewing, and purchase history.

In our interconnected world, data flows at an unprecedented pace. Every click, swipe, and interaction online contributes to this data stream, requiring capture and analysis in real-time.


Why Velocity is important?

The ability of a business to leverage data as it is generated, or to analyze it at the required speed, is crucial. Delayed processing can lead to missed opportunities and potential profit losses. The financial markets are a prime example, where even milliseconds can equate to significant financial gains or losses. Similarly, social media platforms manage millions of updates and transactions every minute, necessitating rapid processing and analytics.

In today's dynamic environment, customer conditions, preferences, and interests change rapidly. Hence, the most recent information becomes a pivotal tool for successful analysis. If the information is outdated, its accuracy becomes less relevant.

Adapting to the velocity of big data and analyzing it as it is produced can also enhance the quality of life. For instance, sensors and smart devices monitoring human health can detect abnormalities in real time, prompting immediate medical response. Similarly, predicting natural disasters through sensor data can provide crucial lead time for evacuation and protective measures.


Real Time Processing vs Batch Processing

The contrast between Real-Time Processing and Batch Processing is significant in the context of data velocity. Real-Time Processing involves the immediate processing of data as it is generated, enabling instantaneous insights and actions. This method is vital in scenarios where timely responses are critical, such as in financial trading, emergency services, or online customer interactions.



On the other hand, Batch Processing involves collecting data over a period, then processing it in large batches at a scheduled time. This method is efficient for tasks that are not time-sensitive and can handle large volumes of data more economically. Batch Processing is common in scenarios like daily sales reports, monthly billing cycles, or data backup routines.



The choice between Real-Time and Batch Processing depends on the specific needs and constraints of the situation. While Real-Time Processing offers immediacy, it requires more resources and sophisticated technology. Batch Processing, though less resource-intensive, may not be suitable for scenarios where instant data analysis and response are essential.


Variety

In the world of big data, variety refers to the diverse range of data types and sources where data comes from. This diversity includes structured data such as numbers and dates in databases, unstructured data like text, images, and videos, and semi-structured data exemplified by JSON and XML files. A multitude of sources contribute to this variety, ranging from social media feeds and digital sensors to mobile applications and satellite imagery.


The Significance of Data Variety

The true essence of Variety lies in its capacity to provide comprehensive insights that are unattainable with a singular data type. For instance, in healthcare, amalgamating patient records with lifestyle data from wearable devices can lead to enhanced healthcare outcomes. In the retail sector, integrating transactional data with social media interactions can yield deeper understanding of consumer preferences.

Effectively managing data variety involves using advanced data management systems capable of handling different data formats. It also requires robust data integration tools and analytics platforms that can process and analyze diverse data types to extract meaningful insights.


Axes of Data Variety

The heterogeneity of data can be characterized along several dimensions, some of these are: Structural Variety, Media Variety, Semantic Variety and Availability Variations.

  • Structural Variety refers to the differences in data representation. For example, an electrocardiogram (EKG) signal markedly differs from a newspaper article, and a satellite image of wildfires from NASA is distinct from tweets about the fire.
  • Media Variety pertains to the medium through which data is delivered. For instance, the audio and transcript of a speech present the same information in different media. Some data objects, like a news video, may encompass multiple media such as image sequences, audio, and synchronized captioned text.
  • Semantic Variety is exemplified by real-life examples where data interpretation varies. Age, for instance, might be represented numerically or categorically (infant, juvenile, adult). Different units of measure or varying assumptions about data conditions also contribute to semantic variety. For instance, income surveys from different groups might not be directly comparable without understanding the underlying population demographics.
  • The variation and availability encompass the different forms in which data is available and accessible. Data may be real-time (like sensor data) or stored (such as patient records), and its accessibility can range from continuous (like traffic cameras) to intermittent (such as satellite data available only when the satellite passes over a specific region). These variations influence the operations and analyses that can be performed on the data, particularly when dealing with large volumes.


Veracity

In the context of big data, veracity pertains to the accuracy, trustworthiness, quality and integrity of data. It sometimes is called validity or volatility referring to the lifetime of the data. Given the voluminous and diverse nature of data, ascertaining its credibility is essential.

The Critical Nature of Veracity

Data that is inaccurate or of inferior quality can lead to erroneous conclusions and poor decision-making. This challenge is particularly pronounced when the data comes from a wide array of unregulated and diverse sources. In critical sectors like finance and healthcare, where decisions based on data have significant implications, maintaining data veracity is not just important, it's essential.

For example, in the financial sector, inaccurate data can result in misguided investment strategies or flawed risk assessments. In the healthcare industry, incorrect patient information can have dire consequences, leading to wrong diagnoses or ineffective treatment plans. Therefore, ensuring high-quality data, verifying the authenticity of data sources, and employing techniques to cleanse and correct data are of paramount importance.



Strategies for Maintaining Data Veracity

To uphold data veracity, organizations must develop and adhere to rigorous data governance policies. This involves implementing comprehensive data validation processes that span the entire data lifecycle, from collection to analysis.

  • Data Cleaning Tools: Utilizing sophisticated data cleaning tools is crucial for identifying and rectifying errors, inconsistencies, outliers and duplications in data sets.
  • Source Authentication: Verifying the legitimacy of data sources is essential, especially when dealing with data from external or less controlled environments.
  • Advanced Analytics and AI: Leveraging advanced analytics and artificial intelligence can aid in automating the process of data verification and cleansing, thereby enhancing the overall quality of the data.
  • Quality Assessments: Regular quality assessments of data help in maintaining its accuracy and reliability. This includes periodic audits and validation checks to ensure the data remains relevant and credible.

By prioritizing and effectively managing data veracity, organizations can make well-informed decisions, mitigate risks, and maintain trust in their data-driven initiatives.


Value

In the context of big data, "value" revolves around deriving pertinent and actionable insights from extensive datasets. It represents the ultimate objective of big data initiatives, transforming data into invaluable information that can guide decision-making and strategic planning.


The Essential Role of Value

The true merit of big data lies not in its immense volume or intricate complexity, but in its practical utility. Data, irrespective of its amount or variety, holds limited value if it cannot be harnessed to enhance informed decision-making. In the business world, for instance, data analytics can uncover market trends, predict customer behavior, and shape key strategic decisions.

In the marketing domain, big data analytics plays a critical role in identifying consumer preferences, leading to customized marketing approaches. In healthcare, analyzing data can bring forth breakthroughs in personalized medicine and improve patient care practices.


Strategies for Unlocking Value from Big Data

Realizing value from big data involves a strategic approach that goes beyond merely employing advanced technology and analytics. It requires:

  • Setting Clear Objectives: Organizations must define what they seek to achieve with their data. Clear objectives guide the analytical process and ensure that the insights gained are relevant and actionable.
  • Comprehensive Data Understanding: It is crucial for organizations to have a thorough understanding of their data. This includes knowledge of the data sources, the quality of data, and the context in which the data was collected.
  • Employing Suitable Analytics Tools: Utilizing the right analytics tools and methodologies is key to effectively processing and analyzing big data. This includes advanced statistical methods, machine learning algorithms, and data visualization techniques.
  • Role of Data Scientists and Analysts: Professionals like data scientists and analysts are integral in the process of translating data into meaningful insights. Their expertise in data interpretation, trend analysis, and predictive modeling is vital for extracting value from big data.
  • Actionable Insights: The ultimate goal is to translate data into actionable insights. This means converting the findings of data analysis into concrete actions or decisions that can positively impact the organization.
  • Continuous Improvement and Adaptation: As the business environment and technologies evolve, so should the approaches to data analytics. Continuous learning and adaptation are necessary to keep deriving value from big data.

By focusing on these key aspects, organizations can ensure that their big data initiatives are not just about collecting and storing data, but about deriving meaningful insights that can lead to tangible benefits and informed decisions.


Concluding Thoughts

In conclusion, the Five V's of Big Data described above are fundamental concepts that collectively provide a vital framework for navigating the complex world of big data. Volume challenges us to effectively manage and process vast quantities of data, while Velocity emphasizes the need for speed in data processing and analysis. Variety enriches our insights by incorporating diverse data types, and Veracity ensures the reliability and accuracy of our data. Finally, Value is the culmination of these efforts, focusing on extracting meaningful and actionable insights that can drive decision-making and innovation.

This exploration into the Five V's demonstrates their integral role in the realm of data analytics. They are not merely theoretical concepts but practical pillars that guide professionals in managing, analyzing, and deriving significant value from big data. As we continue to advance in the digital age, these principles will remain essential, helping us to harness the full potential of big data in a way that is both informed and forward-looking. For anyone involved in data science or analytics, the Five V's offer a roadmap to turning the vast expanse of data into a rich resource for strategic insights and informed decisions.



Share:

Saturday, August 12, 2023

Particle Swarm Optimization


The Concept of "Optimization"

Optimization is a fundamental aspect of many scientific and engineering disciplines. It involves finding the best solution from a set of possible solutions, often with the goal of minimizing or maximizing a particular function. Optimization algorithms are used in a wide range of applications, from training machine learning models to solving complex problems in many other fields.  

In logistics, optimization algorithms can be used to find the most efficient routes for delivery trucks, saving time and fuel. In finance, they can be used to optimize investment portfolios, balancing the trade-off between risk and return. In manufacturing, they can be used to optimize production schedules, maximizing efficiency and minimizing downtime. In energy production, they can be used to optimize the operation of power plants, reducing costs and emissions. The list goes on.


What is PSO?

One of the most famous and widely used optimization models is the Particle Swarm Optimization (PSO) algorithm. PSO is a population-based stochastic optimization technique inspired by the social behavior of bird flocking and fish schooling. It was developed by Dr. Eberhart and Dr. Kennedy in 1995, and since then, it has been applied to solve various complex optimization problems.

The PSO algorithm works by initializing a group of random solutions in the search space, known as particles. Each particle represents a potential solution to the optimization problem. These particles then search the best solution through the space with a velocity that is dynamically adjusted according to its own and its companions' results. A visual example is the next:

Consider a swarm of ants searching the best place in their anthill to keep digging and expanding it, these ants are scattered through all the space. At an instance, one ant finds a good place and warns the others, they approach it to continue searching around that point. But, another ant finds a better place, it warns the others and they approach to it. This is repeated until the swarm finds the best possible place



The beauty of PSO lies in its simplicity and its ability to efficiently solve complex optimization problems. It doesn't require gradient information, which makes it suitable for non-differentiable and discontinuous functions. Moreover, it's easy to implement and has few parameters to adjust, making it a practical choice for many optimization tasks.


Mathematical Description

The PSO algorithm consists on initializing a swarm of particles, each particle has a position vector $P$ and a velocity vector $V$. The position vector, denoted as $P_i$ (for the i-th particle), represents the current solution, while the velocity vector $V_i$ determines the direction and distance that the particle move in the respective iteration. In addition to its current position and velocity, each particle remembers the best position it has ever encountered, denoted as $P_{bi}$ (personal best position). The best position among all particles in the swarm is also tracked, denoted as $G_b$ (global best position).

The PSO algorithm updates the position and velocity of each particle at each iteration. The velocity is computed based on the equation:


\begin{equation*}V_i^{t+1} = W \cdot V_i^t + C_1 U_1^t \otimes \left (  P_{bi}^t - P_i^t \right) +  C_2 U_2^t \otimes \left (  G_{b}^t - P_i^t \right) \end{equation*}


Where $W$ is a constant value known as inertia weight, $C_1$ and $C_2$ are two constant values, $U_1$ and $U_2$ are two random vectors with values between $[0, 1]$ and same dimension as the position. Is important to notice that the operator $\otimes$ represent a component-to-component multiplication.

The inertia weight  $W$ controls the impact of the previous velocity of a particle on its current velocity, the "personal component" $C_1 U_1^t \otimes \left (  P_{bi}^t - P_i^t \right)$ represents the particle's tendency to return to its personal best position, and the "social component" $C_2 U_2^t \otimes \left (  G_{b}^t - P_i^t \right)$ represents the particle's tendency to move towards the global best position.

To update the position of each particle we just follow the equation:


\begin{equation*}P_i^{t+1} = P_i^{t} + V_i^{t+1} \end{equation*}


We can describe the PSO algorithm as follows:



Python Implementation


Importing Libraries

Before we can implement the Particle Swarm Optimization (PSO) algorithm, we first need to import the necessary libraries. For this implementation, we use numpy for numerical computations, and matplotlib for data visualization.

# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Seed for numpy random number generator
np.random.seed(0)


Defining and Visualizing the Optimization Function

Before we can apply the PSO algorithm, we first need to define the function that we want to optimize. We select the Griewank function, a commonly used test function in optimization. The Griewank function is known for its numerous, regularly distributed local minima, which makes it a challenging optimization problem.

The Griewank function takes a 2-dimensional array as input and returns a scalar output. The function is composed of a sum of squares term and a cosine product term, which together create a complex landscape of local minima.

# Griewank function
def griewank(x):
    sum_sq = x[0]**2/4000 + x[1]**2/4000
    prod_cos = np.cos(x[0]/np.sqrt(1)) * np.cos(x[1]/np.sqrt(2))
    return 1 + sum_sq - prod_cos

To better understand the optimization problem, it's helpful to visualize the function that we're trying to optimize. We can do this by creating a grid of points within a specified range, calculating the value of the function at each point, and then plotting the function as a 3D surface.

# Defining the limits of the x and y axes
xlim = [-11, 11]
ylim = [-11, 11]
# Creating a grid of points within these limits
x = np.linspace(xlim[0], xlim[1], 500)
y = np.linspace(ylim[0], ylim[1], 500)
grid = np.meshgrid(x, y)
# Calculating the value of the Griewank function at each point in the grid
Z = griewank(grid)

# Creating a 3D figure
plt.rcParams['font.size'] = '16'
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.patch.set_alpha(0)

# Plotting the surface
surf = ax.plot_surface(grid[0],grid[1], Z, cmap='viridis', edgecolor='none')

# Configure style
fig.colorbar(surf, shrink=0.5, aspect=5)
ax.set_box_aspect([1, 1, 0.6])
ax.set_zticks([])
ax.grid(False)
plt.tight_layout()

# Add a title
plt.suptitle("Griewank Function", y=0.95, x=0.45, fontsize=24)

# Display the plot
plt.show()

The resulting plot gives us a visual representation of the optimization problem that we're trying to solve with the PSO algorithm.


We can notice that this function has many local minima distributed throughout the space, so it can be a complex challenge to find its minimum value.


Defining the Particle Class

The PSO algorithm operates on a swarm of particles, where each particle represents a potential solution to the optimization problem. To implement this in Python, we can define a Particle class that encapsulates the corresponding properties and behaviors. Here's the code to define the Particle class:

class Particle:
    def __init__(self, lim_range, no_dim):
        self.dim = no_dim
        # Initialize positions and velocities randomly
        self.pos = np.random.uniform(lim_range[0],lim_range[1],no_dim)
        self.vel = np.random.uniform(lim_range[0],lim_range[1],no_dim)
        # Set best position as the current one
        self.best_pos = np.copy(self.pos)
        # Set best value as the current one
        self.best_val = griewank(self.pos)

    def compute_velocity(self, global_best):
        # Update the particle's velocity based on its personal best and the global best
        t1 = 0.7298 * self.vel
        U1 = np.random.uniform(0, 1.49618, self.dim)
        t2 = U1 * (self.best_pos - self.pos)
        U2 = np.random.uniform(0, 1.49618, self.dim)
        t3 = U2 * (global_best - self.pos)
        self.vel = t1 + t2 + t3

    def compute_position(self):
        # Update the particle's position based on its velocity
        self.pos += self.vel

    def update_bests(self):
        new_val = griewank(self.pos)
        if new_val < self.best_val:
          # Update the particle's personal best position and value
          self.best_pos = np.copy(self.pos)
          self.best_val = new_val

We defined a Particle class with several methods:

The __init__ method initializes a particle with random positions and velocities within a specified range. It also sets the particle's best position and best value to its initial position and value.

The compute_velocity method updates the particle's velocity based on its current velocity, the difference between its best previous position and its current position, and the difference between the global best position and its current position. We defined the constants $W=0.7298$ and $C_1 = C_2 = 1.49618$.

The compute_position method updates the particle's position by adding its velocity to its current position.

The update_bests method updates the particle's best position and best value if its current position has a better value.


PSO Implementation

Now that we have defined the Particle class, we can use it to implement the PSO algorithm described earlier.

def PSO(no_part, it, limits, no_dim):
    # Initialization
    # Create a list to hold the swarm
    swarm = []
    # Fill the swarm with particles
    for i in range(no_part):
        swarm.append(Particle(limits, no_dim))
        # Set the first particle as the best in the swarm
        if i == 0:
            # Best values in the swarm
            G_best_position = np.copy(swarm[i].pos)
            G_best_value = swarm[i].best_val
        # Compare with the previous particle
        if i > 0 and swarm[i].best_val < G_best_value:
            # If the particle is better than the previous one
            G_best_value = swarm[i].best_val
            G_best_position = np.copy(swarm[i].pos)

    # Main loop
    for _ in range(it):
        for i in range(no_part):
            # Compute new velocity
            swarm[i].compute_velocity(G_best_position)
            # Compute new position
            swarm[i].compute_position()
            # Update best personal values
            swarm[i].update_bests()
            # Update the best global values of the swarm
            swarm[i].update_bests()
            if swarm[i].best_val < G_best_value:
                G_best_position = np.copy(swarm[i].pos)
                G_best_value = swarm[i].best_val

    return G_best_position, G_best_value

The PSO function takes the number of particles, the number of iterations, the limits of the search space, and the number of dimensions as inputs. The function initializes a swarm of particles, then enters a loop where it updates the positions and velocities of the particles, evaluates the objective function at the new positions, and updates the personal best positions and the global best position. The function returns the global best position and its corresponding value at the end of the iterations.


Applying PSO algorithm

Now we can use the PSO function defined earlier to find the minimum of the Griewank function. We run the algorithm with 300 particles for 150 iterations, and we search in the range from -10 to 10 in both dimensions. We store the resulting best position and its corresponding value.

# Applying the PSO algorithm
best_pos, best_val = PSO(no_part = 300, it = 150, limits = [-10, 10], no_dim = 2)

To visualize the result, we plot the level curves of the Griewank function and mark the minimum found by the PSO algorithm as a red point.

# Creating the figure
fig, ax = plt.subplots(1,1, figsize=(8,6), facecolor='#F5F5F5')
fig.subplots_adjust(left=0.1, bottom=0.06, right=0.97, top=0.94)
plt.rcParams['font.size'] = '16'
ax.set_facecolor('#F5F5F5')

# Plotting of the Griewank function's level curves
ax.contour(grid[0], grid[1], Z,  levels=10, linewidths = 1, alpha=0.7)
# Plotting the minimum found by the PSO algorithm
ax.scatter(best_pos[0], best_pos[1], c = 'red', s=100)
ax.set_title("Griewank Level Curves")

# Display the figure
plt.show()

The resulting plot gives us a visual representation of the optimization problem and the solution found by the PSO algorithm.



The PSO algorithm found that the minimum value of the Griewank function is $\mathbf{0}$ located in the position $\mathbf{(0,0)}$. Below is a video showing the process of PSO algorithm, visualizing all the particles at each iteration.



Conclusions

Through this post, we've gained a deeper understanding of the Particle Swarm Optimization algorithm and its application in optimization problems. We've explored the mathematical background of the algorithm, implemented it in Python, and applied it to find the minimum of the Griewank function, a commonly used benchmark function in optimization. The PSO algorithm has proven to be an effective method for solving the optimization problem, as evidenced by the results we obtained. The visualization of the Griewank function and the minimum found by the PSO algorithm provided a clear illustration of the algorithm's ability to navigate the search space and converge to a solution.

This exploration of the PSO algorithm serves as a foundation for further study and application of swarm intelligence algorithms in optimization. Whether you're a data scientist looking for a new optimization technique, a researcher studying swarm intelligence, or a curious reader interested in machine learning, I hope this post has provided valuable insights and sparked further interest in this fascinating area.


Share:

Wednesday, June 28, 2023

Random Forest Regression

Random Forest is a powerful machine learning algorithm that has gained significant popularity in both academia and industry due to its excellent performance, robustness, and easy use. It's an ensemble learning method, which means that it combines the predictions of several base estimators with a given learning algorithm to improve generalizability and robustness. In the case of Random Forest, the base estimators are decision trees.

The concept behind Random Forest is simple and effective. It operates by building a multitude of decision trees at training time, and taking as output the mean prediction (for regression) or majority class (for classification) of the individual trees. The model creates is a set of decision trees, usually trained with the "bagging" method, the main idea is to inject randomness into the tree building to ensure each tree is different. Consequently, although some trees may be wrong, many others will be right, so as a group, the trees are able to move in the correct direction.

The strength of Random Forest lies in its ability to overcome the overfitting problem, which is common in decision trees. While a single decision tree can create overly complex models that don't generalize well for new data, Random Forest mitigates this by employing multiple decision trees to build a more robust and generalizable model. It maintains a high level of accuracy and provides a measure of "interpretability" via feature importance estimation, which is not available in many other machine learning algorithms.



Learning Method

Random Forest is a perfect example of ensemble learning, a powerful concept in machine learning where multiple models, known as base learners, are trained to solve the same problem and combined to get better results. The main principle behind ensemble learning is that a group of weak learners can come together to form a strong learner, the models in the ensemble may not be strong predictors individually, but together they can provide a more accurate and stable prediction.

In the case of Random Forest, the base learners are decision trees. The algorithm introduces randomness into the model building process, which results in an ensemble of different models. Two key concepts in ensemble learning applied to Random Forest are Bagging and Feature Randomness.

Bagging: Also known as bootstrap aggregating, it's used to create different subsets of the original data by sampling with replacement. For each subset, a decision tree is built. This means that each tree in the forest is trained on a different set of data. This process decorrelates the trees, which means the errors of individual trees become less correlated with each other. When the predictions of these trees are averaged (in the case of regression), the variance of the final prediction is reduced.

Feature Randomness: In addition to bagging, Random Forest also uses a method called feature randomness. In traditional decision trees, when it is time to split a node, we consider every possible feature and choose the one that produces the most significant improvement in the objective function. Random Forest changes this by selecting a random subset of the features at each split point, which adds an extra layer of randomness to the model building process. This further decorrelates the trees and ultimately results in a more robust model.

Another type of ensemble learning is the Boosting method, which is used by some of the most popular ensemble algorithms such as AdaBoost, Gradient Boosting, XGBoost and LightGBM

Boosting: It consists of a sequential process, where each model is trained to correct the mistakes made by the previous model. In the context of regression, these mistakes are the residuals or differences between the predicted and actual values. The final prediction is made by a weighted sum of the individual predictions of all models, where models that perform better have more weight. Boosting is effective at reducing both bias and variance, making it a powerful method for improving the accuracy of regression models.


Random Forest Hyperparameters

When building a Random Forest model for regression, there are several parameters that can be tuned to optimize the performance of the model. Two of the most important hyperparameters are Number of Trees and Max Depth.

Number of Trees: It  determines the number of estimators in the forest. Each tree is built independently and contributes equally to the final prediction, which is the mean of the predictions of all trees. Increasing the number of trees can make the model more robust and reduce variance, but beyond a certain point, the benefits in prediction performance may be outweighed by the computational cost. There is no definitive rule for choosing the optimal number of trees, as it depends on the specific problem and the trade-off between computational efficiency and model performance.

Max Depth: It determines the maximum depth of each tree. The depth of a tree is the maximum distance between the root node and any leaf node. A tree of depth k can model interactions involving up to k features. If the max depth is too high, the model may overfit the training data and not work well with unseen data. If it's too low, the model may be too simple to capture complex patterns in the data, leading to underfitting.

A visual example of the hyperparameters tuning process is shown in the next video:

Here I built a Random Forest Regression varying the parameters Max Depth and Number of Tree, showing the regression result, as well as the individual Decision Trees built for each case. We can appreciate how the shape of the Random Forest Regression changes. When the model is underfitting, the regression result may appear overly simplified and unable to capture the complexity of the data. As we increase the max depth and number of trees, the model becomes more complex and better able to fit the data. However, if we go too far, the model may start to overfit, capturing noise and outliers in the data.


Python Implementation


Importing Libraries

The first step is importing the necessary libraries. For data manipulation we use pandas and numpy, for data visualization we use seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


Data Loading

The next step is to load a dataset and split it into a training set and a test set. We're using the load_diabetes dataset from sklearn, which is a commonly used dataset for regression tasks. It consists of 442 patients with ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline. It's important to notice that when we loaded the dataset, the features have been mean centered and scaled by the standard deviation by default.

# Loading the dataset
diabetes = load_diabetes(as_frame = True)
data = diabetes['frame']

# Displaying the first few rows of the dataset
data.head()

We're displaying the first few rows of the dataset using the head method. This gives us a quick overview of the data we're working with:


We separate the features (X) and the target variable (y) from the dataset. The features are all the columns except the last one, which is the target variable. We also split these into a training set and a test set using the train_test_split function, taking 80% of the data for training and 20% for testing.

# Separate the features and the target
X = data.iloc[:,0:10].to_numpy()
y = data["target"].to_numpy()

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)


Random Forest Training and Evaluation

Now that we've loaded and split the diabetes dataset, we can move on to apply the model. We create a Random Forest Regression model with 100 estimators (trees), train it on the training set and make predictions on the testing set.

# Creating the Random Forest Regression instance
RF = RandomForestRegressor(n_estimators=100, random_state=0)
# Training the model on the training data
RF.fit(X_train, y_train)
# Making predictions on the test set
y_pred = RF.predict(X_test)

Then we calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model's performance.

# Computing the Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test, y_pred)
# Computing the Root Mean Squared Error (RMSE)
RMSE = mean_squared_error(y_test, y_pred)**0.5

To better understand the performance of the model, we can visualize the error metrics using a bar chart. This gives us a clear, visual comparison of the two metrics.

# Creating the figure
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
plt.rcParams['font.size'] = '14'
ax.set_facecolor('#F5F5F5')

# Creating the bar for MAE
bar1 = ax.bar(0.25, MAE, color ='firebrick', width = 0.23)[0]
# Adding the value of the MAE on top of the bar
yval1 = bar1.get_height()
ax.text(bar1.get_x() + bar1.get_width()/2, yval1, round(yval1, 2), ha='center', va='bottom', fontsize = 16)

# Creating the bar for RMSE
bar2 = ax.bar(0.75, RMSE, color ='seagreen', width = 0.23)[0]
# Adding the value of the RMSE on top of the bar
yval2 = bar2.get_height()
ax.text(bar2.get_x() + bar2.get_width()/2, yval2, round(yval2, 2), ha='center', va='bottom', fontsize = 16)

# Adding ticks and title
ax.set_xticks([0.25,0.75], ['MAE', 'RMSE'], fontsize = 16)
ax.set_title('Metrics for Random Forest', fontsize = 18)
ax.set_ylim([0, 70])
ax.set_xlim([0, 1])

# Display the plot
plt.show()

The results are the following:


Conclusions

Through this exploration of the Random Forest algorithm for regression, we've gained valuable insights into this powerful machine learning technique. We've learned how it leverages the concept of ensemble learning, specifically bagging, to build robust and accurate models. We've also understood the importance of key parameters like the number of trees and maximum depth, and how they influence the model's performance. Through our practical implementation in Python, we've seen how to apply these concepts in practice. We've learned how to train a Random Forest model, make predictions, and evaluate the model's performance using metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

By reading this post, you've not only gained a deeper understanding of Random Forest for regression, but also acquired practical skills that you can apply to your own machine learning projects. Remember, the key to successful machine learning is understanding the data, the algorithm, and the problem at hand.


Share:

Saturday, June 10, 2023

Mathematical Formulation SVR

Support Vector Machines (SVMs), first introduced by Vapnik in 1995, represent one of the most impactful advancements in the realm of machine learning. Since their inception, SVMs have proven highly versatile, with successful implementations across a myriad of domains including text categorization, image analysis, speech recognition, time series forecasting, and even information security.

The core principle of SVMs is rooted in the concepts of statistical learning theory (SLT) and structural risk minimization (SRM). This foundation diverges from other methods that are grounded in empirical risk minimization (ERM). While these methods are focused on minimizing the error on the training dataset, SVMs, on the other hand, aims to minimize an upper bound of the generalization error. This bound is a composite of both the training error and a measure of confidence, resulting in more robust and reliable models.

Initially, SVMs were primarily developed to handle classification problems. However, the power and potential of SVMs extended beyond mere classification. This led to the introduction of Support Vector Machines for Regression (SVR), a significant variation on the model originally proposed by Vapnik. SVRs holds true to the core principles of SVMs while adopting a new approach to apply them to regression problems. 


Mathematical Formulation

Suppose we have data in the form $x = [x_1, x_2,  \ldots , x_n]$ and $y = [y_1, y_2,  \ldots , y_n]$ where $x_i \in \mathbb{R}^d$ are the input variables (features) and $y_i \in \mathbb{R}$ the output variables. The regression function of SVR is defined as:

\begin{equation} f(x) = w  \cdot x + b\end{equation}

Where $w$ is a weight vector and $b$ is a bias term, notice that this equation is the hyperplane equation in the formulation of SVMs. A graphic example of the regression function (in two dimensions) is: 


We can map the input vector $x$ to a high dimensional space with a function $\phi(x)$, introducing this function in Eq. (1) we obtain: 

\begin{equation} f(x) = w  \cdot \phi(x) + b\end{equation}

This form of the regression function allows us to generate non-linear regressions, like:



As mentioned earlier, SVR minimizes bounds of the generalization error. This error term is handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, $\varepsilon$. We can tune $\varepsilon$ to gain the desired accuracy of our model. For this case, the objective function and constraints are:

\begin{equation} \begin{aligned} \min_{w} \quad & \frac{1}{2}\|w\|^2 \\ \textrm{s.t.} \quad & |y_i - (w \cdot x_i +b)| \leq \varepsilon \\ \end{aligned}\end{equation}

The first part is read as: Minimize the objective function $\frac{1}{2}\|w\|^2$ based on the variable $w$ subject to the constrain $|y_i - (w \cdot x_i +b) | \leq \varepsilon$

An illustrative example of this is:


Depends the value of $\varepsilon$, this algorithm may doesn’t work for all data points. The algorithm solved the objective function as best as possible but some of the points can fall outside the margins. We need to account for the possibility of errors that are larger than $\varepsilon$. To handle this, we introduce the slack variables.

The slack variable $\xi$ is defined as: for any value (point) that falls outside of $\varepsilon$, we can denote its deviation from the margin as $\xi$



We aim to  to minimize these deviations $\xi$ as much as possible. Thus, we can add these deviations to the objective function, along with an additional hyperparameter, the penalty factor, $C$.

\begin{equation} \begin{aligned} \min_{w,\xi} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{|\xi_{i}|} \\ \\ \textrm{s.t.} \quad & |y_i - w \cdot x_i - b| \leq \varepsilon + |\xi_{i}|\\ \end{aligned}\end{equation}

We can expand the terms that include $|\xi_{i}|$, considering $\xi_{i}$ as the upper deviations and $\xi_{i}^*$ as the lower deviations. Expanding these terms and including the function $\phi(x)$ previously seen we obtain:

\begin{equation} \begin{aligned} \min_{w,\xi^{(*)}} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} \\ \\ \textrm{s.t.} \quad & y_i - w  \cdot \phi(x_i)  - b \leq \varepsilon + \xi_{i}^*\\ & w  \cdot \phi(x_i) + b - y_i \leq \varepsilon + \xi_{i}\\ & \xi^{(*)} \geq 0 \\ \end{aligned}\end{equation}

Where $\xi^{(*)} = \left(  \xi_1, \xi_{1}^{*},  \ldots ,  \xi_n, \xi_{n}^{*} \right) $ is the slack variable. As $C$ increases, the tolerance for points outside of $\varepsilon$ also increases. To derive dual problem from this optimization (minimization) problems, the Lagrange function is introduced:

\begin{equation} \begin{aligned} L = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} - \sum_{i=1}^{n}{\left ( \eta_{i}\xi_{i} + \eta_{i}^*\xi_{i}^* \right )} \\  & - \sum_{i=1}^{n}{\alpha_{i} \left ( \varepsilon + \xi_{i} + y_{i} - w  \cdot \phi(x_i)  - b \right )} \\ & - \sum_{i=1}^{n}{\alpha_{i}^{*} \left ( \varepsilon + \xi_{i}^{*} - y_{i} + w  \cdot \phi(x_i) + b \right )} \\  \end{aligned}\end{equation}

Where $L$ is the Lagrangian, $\alpha^{(*)} = \left(  \alpha_1, \alpha_{1}^{*},  \ldots ,  \alpha_n, \alpha_{n}^{*} \right) $ and $\eta^{(*)} = \left(  \eta_1, \eta_{1}^{*},  \ldots ,  \eta_n, \eta_{n}^{*} \right) $ are the Lagrange multipliers. The optimal conditions occur when:

\begin{equation} \begin{aligned} & \frac{\partial{L}}{\partial{w}} = 0 \; \Rightarrow \; w =  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right ) \phi(x_i)  } \\ \\ & \frac{\partial{L}}{\partial{b}} = 0 \;  \Rightarrow \;  \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right )} = 0 \\  \\ & \frac{\partial{L}}{\partial{\xi_i}} = 0 \; \Rightarrow \;  C - \alpha_{i} - \eta_{i} = 0 \\ \\ & \frac{\partial{L}}{\partial{\xi_{i}^{*}}} = 0 \;  \Rightarrow \;  C - \alpha_{i}^{*} - \eta_{i}^{*} = 0 \\ \end{aligned}\end{equation}

Combining Eq. (6) and (7) we get the dual optimization problem: 

\begin{equation} \begin{aligned} \min_{\alpha^{(*)}} \quad & \frac{1}{2} \sum_{i, j =1}^{n}{( \alpha_{i}^* - \alpha_{i}) ( \alpha_{j}^* - \alpha_{j} ) \phi(x_i) \phi(x_j)} \\ & + \varepsilon \sum_{i=1}^{n}{( \alpha_{i}^* + \alpha_{i} )} - \sum_{i=1}^{n}{y_i(\alpha_{i}^* - \alpha_{i})}\\ \\ \textrm{s.t.} \quad & \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i})} = 0 \\ & 0 \leq \alpha^{(*)} \leq C \\ \end{aligned}\end{equation}

Solving this optimization problem, in other words, obtaining the values of the Lagrange Multipliers $\alpha^{(*)}$ that minimize the objective function, the regression function of SVR can be described as:

\begin{equation} \begin{aligned} f(x) & = \; w  \cdot \phi(x) + b \\  & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) [\phi(x_i)  \cdot \phi(x) ]} + b \\ & = \;  \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) K(x_i, x)} + b \\ \end{aligned}\end{equation}

Where $K(x_i, x)$ is known as Kernel Function.


Importance of hyperparameters tuning

We have seen the mathematical meaning of the hyperparameters $C$ and $\varepsilon$, and how they are introduced in the mathematical formulation of SVR as factors for model tuning. These hyperparameters controls the accuracy and the generalizability of the model. For a better understanding, let's look at a visual example of how these parameters affect the model performance.

In this video, we can notice that for obtain a good regression, we need to find the optimal values for both $C$ and $\varepsilon$. We have shown that isn't enough variate one parameter, to get a robust solution we must optimize both. We also need to choose the right Kernel depending on the data distribution, in the video above we selected a Radial Basis Function (RBF) Kernel, which is a popular choice for many problems, as it can capture complex and non-linear relationships.


Share:

Sunday, May 28, 2023

Support Vector Machines for Regression

Support Vector Machines (SVM) is a powerful and versatile machine learning algorithm that is widely used in both Classification and Regression tasks. Originally developed for binary classification, it operates on the principle of finding the hyperplane that best separates the data into two classes. The "best" hyperplane is defined as the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points from each class. This approach helps to ensure that the model generalizes well to unseen data, reducing the risk of overfitting.

Over time, SVM has been extended to handle multi-class classification and regression problems. In these cases, the algorithm seeks to find the hyperplane (or set of hyperplanes) that best separates the data into multiple classes or best fits the data points in the case of regression. This versatility has led to SVM being applied across a wide range of domains, from image recognition to natural language processing, from stock market prediction to bioinformatics.

While SVM is traditionally associated with classification tasks, it can be adapted for regression tasks in a framework known as Support Vector Regression (SVR). In this post, I focus on the use of SVM for regression, exploring how it can be used to model complex, non-linear relationships in data.

SVR offers several advantages. It can handle high-dimensional data, making it suitable for problems where the number of features is large. It also offers flexibility in modeling different types of relationships through the use of different kernel functions. This means that it can capture both linear and non-linear relationships between the input features and the target variable. 

However, SVR also has its limitations. It can be computationally intensive for large datasets, which can limit its scalability. It also requires careful selection of parameters, such as the regularization factor C and the kernel parameters. Incorrect parameter selection can lead to poor model performance.


Regression with SVM

In the context of regression, SVM is adapted to find a function that has at most $\varepsilon$ deviation from the actually obtained targets for all the training data, and at the same time is as flat as possible. This is known as $\varepsilon$-insensitive loss, which essentially means that errors within a certain margin ($\varepsilon$) are ignored. This approach makes SVM robust to outliers and allows it to find a function that generalizes well to unseen data.

A key concept in SVM, both for classification and regression, is the use of kernel functions. Kernel functions are used to transform the input data into a higher-dimensional space where it becomes easier to find a hyperplane that separates the data. This is particularly useful when dealing with non-linear relationships in the data, as it allows SVM to capture these relationships without explicitly computing the coordinates of the data in the high-dimensional space.

There are several types of kernel functions commonly used in SVM, including linear, polynomial, and Radial Basis Function (RBF). The linear kernel is the simplest and is used when the data is linearly separable. The polynomial kernel allows SVM to capture polynomial relationships in the data. The RBF kernel is a popular choice for many problems, as it can capture complex, non-linear relationships.

The regularization parameter C is a crucial component of SVM. It determines the trade-off between allowing the model to increase its complexity (and potentially overfit the data) and keeping it simple (and potentially underfitting the data). A high value of C encourages the model to fit the training data as closely as possible, while a low value encourages a simpler model. Refer to this link for a complete mathematical formulation of SVR.


Python Implementation


Importing Libraries

The first step in implementing Support Vector Regression (SVR) in Python is importing the necessary libraries. For data manipulation and numerical operations we used pandas and numpy respectively, for data visualization we used seaborn and matplotlib, and several modules from sklearn for preprocessing, model building, and score calculation.

# Importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error


Data Loading and Preprocessing

In this section, we load the diamonds dataset from the seaborn library and perform the preprocessing steps. This dataset contains information about a large number of diamonds, including their cut, color, clarity, and price.

First, we load the dataset and show a sample of it.

# Loading the diamonds dataset
diamonds = sns.load_dataset("diamonds")
# Displaying a sample of the dataset
diamonds.sample(5, random_state = 6)

This sample is shown below:


The diamonds dataset contains both numerical and categorical variables. The categorical variables are: cut, color and clarity, these must be encoded into numerical values before we can use them to train the SVR model. We identify the category values of these features and list them in order of importance.

# Identifying the categories of each feature
# We reversed the categories to maintain the order of importance
categories_cut = diamonds['cut'].cat.categories[::-1].tolist()
categories_color = diamonds['color'].cat.categories[::-1].tolist()
categories_clarity = diamonds['clarity'].cat.categories[::-1].tolist()

# Array containing all the categories
categories = [categories_cut, categories_color, categories_clarity]

Then, we create an instance of the OrdinalEncoder class, passing the categories to maintain their order of importance. We fit and transform the labels in the 'cut', 'color', and 'clarity' columns.

# Creating an instance of the Ordinal Encoder
# We pass the categories to maintain the order of importance
ordinal_encoder = OrdinalEncoder(categories=categories)

# Creating a copy of the dataset
diamonds_encoded= diamonds.copy()
# Fitting and transforming the labels in the 'cut', 'color', and 'clarity' columns
diamonds_encoded[['cut', 'color', 'clarity']] = ordinal_encoder.fit_transform(diamonds_encoded[['cut', 'color', 'clarity']])

# Displaying a sample of the encoded dataset
diamonds_encoded.sample(5, random_state = 6)

The sample of the encoded dataset is shown below:


As we can see, the features 'cut', 'color' and 'clarity' now only have numeric values instead of alphanumeric ones.


SVR Training and Evaluation

First, we take the target variable (price) from the rest of the dataset. We then split the data into training and testing sets, using 70% of the data for training and 30% for testing.

y = diamonds_encoded['price'].to_numpy()
X = diamonds_encoded.drop(['price'], axis=1).to_numpy()

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We define three SVR models with different regularization parameters (C = 5.0, 10.0, 15.0) to see how this parameter affects the model's performance, all of these models were defined with a kernel RBF. We use a pipeline to first scale the data using StandardScaler and then apply the SVR model.

# Defining models
pipeline1 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 5.0))])
pipeline2 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 10.0))])
pipeline3 = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel = 'rbf', C = 15.0))])
# Grouping models
models = [pipeline1, pipeline2, pipeline3]

Finally, we fit these models on the training data and make predictions on the testing data. We evaluate the performance of the models using two metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). RMSE gives the standard deviation of the residuals, while MAE gives the average magnitude of the errors in a set of predictions.

# Applying models and computing regression scores
RMSE = []
MAE = []
for pipeline in models:
  pipeline.fit(X_train,y_train)
  y_predict = pipeline.predict(X_test)
  # Computing the Root Mean Squared Error (RMSE)
  rmse = mean_squared_error(y_test, y_predict)**0.5
  RMSE.append(rmse)
  # Computing the Mean Absolute Error (MAE)
  mae = mean_absolute_error(y_test, y_predict)
  MAE.append(mae)


Model Performance Visualization

In this section, we visualize the performance of the models using bar plots. This allows us to compare the RMSE and MAE for each model.

We create a pandas DataFrame to store the previous calculated metrics (RMSE and MAE).

# Creating a DataFrame to store the calculated metrics
df = pd.DataFrame(np.array([RMSE, MAE]).T, columns=['Root of Mean Squared Error', 'Mean Absolute Error'])

Then, we define a function to add the labels (values of the metrics) above each bar in the bar plot.

def addlabels(bars, ax):
  # Adding the value above each bar
  for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize = 12

Finally, we create a bar chart. We use different colors for each model to distinguish them. We also add a legend to indicate which color corresponds to which model.

plt.rcParams['font.size'] = '11'
fig, ax = plt.subplots(figsize=(9,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')

# Bar width
barWidth = 0.28
# Position of the bars
br1 = np.arange(2)
br2 = [x + barWidth for x in br1]
br3 = [x + barWidth for x in br2]

# Creating the bars for each model
bars = ax.bar(br1, df.loc[0], color ='firebrick', width = barWidth, label ='C = 5.0')
addlabels(bars, ax)

bars2 = ax.bar(br2, df.loc[1], color ='seagreen', width = barWidth, label ='C = 10.0')
addlabels(bars2, ax)

bars3 = ax.bar(br3, df.loc[2], color ='mediumblue', width = barWidth, label ='C = 15.0')
addlabels(bars3, ax)

# Adding ticks and title
ax.set_xticks([r + barWidth for r in range(len(df.loc[0]))], ['RMSE', 'MAE'])
ax.set_title('Metrics for Support Vector Regression', fontsize = 15)
ax.set_ylim([0, 1950])
ax.legend()
plt.show()

The results are the following:

This visualization provides a clear comparison of the performance of our models. It shows how the choice of the regularization parameter C affects the RMSE and MAE. The model that obtained best results was the one with the parameter C = 15.0, because its RMSE and MAE were the lowest.


Conclusion

In this post, we've understood and implemented Support Vector Machines (SVM) for regression tasks. We've seen how SVM can be a powerful tool for regression, capable of handling high-dimensional spaces and offering the flexibility of different kernel functions. We've also learned the importance of parameter tuning in machine learning, specifically the role of the regularization parameter C in SVM. Through hands-on coding, we've seen how changes in this parameter can significantly affect the performance of our model.

By working with the diamonds dataset, we've gained practical experience in applying SVM for regression in Python, from data preprocessing to model training and evaluation. This approach has provided us with valuable insights into the workings of SVM and its application in real-world problems.

Share:

About Me

My photo
I am a Physics Engineer graduated with academic excellence as the first in my generation. I have experience programming in several languages, like C++, Matlab and especially Python, using the last two I have worked on projects in the area of Image and signal processing, as well as machine learning and data analysis projects.

Recent Post

Intensity Transformations (Part 2)

Welcome back to our exploration of Intensity Transformations . This post is a continuation of our previous discussion, where we covered the ...

Pages