Mathematical Formulation SVR ~ Data Science For Lifelong Learning

Support Vector Machines (SVMs), first introduced by Vapnik in 1995, represent one of the most impactful advancements in the realm of machine learning. Since their inception, SVMs have proven highly versatile, with successful implementations across a myriad of domains including text categorization, image analysis, speech recognition, time series forecasting, and even information security.

The core principle of SVMs is rooted in the concepts of statistical learning theory (SLT) and structural risk minimization (SRM). This foundation diverges from other methods that are grounded in empirical risk minimization (ERM). While these methods are focused on minimizing the error on the training dataset, SVMs, on the other hand, aims to minimize an upper bound of the generalization error. This bound is a composite of both the training error and a measure of confidence, resulting in more robust and reliable models.

Initially, SVMs were primarily developed to handle classification problems. However, the power and potential of SVMs extended beyond mere classification. This led to the introduction of Support Vector Machines for Regression (SVR), a significant variation on the model originally proposed by Vapnik. SVRs holds true to the core principles of SVMs while adopting a new approach to apply them to regression problems.

Mathematical Formulation

Suppose we have data in the form $x = [x_1, x_2, \ldots , x_n]$ and $y = [y_1, y_2, \ldots , y_n]$ where $x_i \in \mathbb{R}^d$ are the input variables (features) and $y_i \in \mathbb{R}$ the output variables. The regression function of SVR is defined as:

\begin{equation} f(x) = w \cdot x + b\end{equation}

Where $w$ is a weight vector and $b$ is a bias term, notice that this equation is the hyperplane equation in the formulation of SVMs. A graphic example of the regression function (in two dimensions) is:

We can map the input vector $x$ to a high dimensional space with a function $\phi(x)$, introducing this function in Eq. (1) we obtain:

\begin{equation} f(x) = w \cdot \phi(x) + b\end{equation}

This form of the regression function allows us to generate non-linear regressions, like:

As mentioned earlier, SVR minimizes bounds of the generalization error. This error term is handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, $\varepsilon$. We can tune $\varepsilon$ to gain the desired accuracy of our model. For this case, the objective function and constraints are:

\begin{equation} \begin{aligned} \min_{w} \quad & \frac{1}{2}\|w\|^2 \\ \textrm{s.t.} \quad & |y_i - (w \cdot x_i +b)| \leq \varepsilon \\ \end{aligned}\end{equation}

The first part is read as: Minimize the objective function $\frac{1}{2}\|w\|^2$ based on the variable $w$ subject to the constrain $|y_i - (w \cdot x_i +b) | \leq \varepsilon$

An illustrative example of this is:

Depends the value of $\varepsilon$, this algorithm may doesn’t work for all data points. The algorithm solved the objective function as best as possible but some of the points can fall outside the margins. We need to account for the possibility of errors that are larger than $\varepsilon$. To handle this, we introduce the slack variables.

The slack variable $\xi$ is defined as: for any value (point) that falls outside of $\varepsilon$, we can denote its deviation from the margin as $\xi$.

We aim to to minimize these deviations $\xi$ as much as possible. Thus, we can add these deviations to the objective function, along with an additional hyperparameter, the penalty factor, $C$.

\begin{equation} \begin{aligned} \min_{w,\xi} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{|\xi_{i}|} \\ \\ \textrm{s.t.} \quad & |y_i - w \cdot x_i - b| \leq \varepsilon + |\xi_{i}|\\ \end{aligned}\end{equation}

We can expand the terms that include $|\xi_{i}|$, considering $\xi_{i}$ as the upper deviations and $\xi_{i}^*$ as the lower deviations. Expanding these terms and including the function $\phi(x)$ previously seen we obtain:

\begin{equation} \begin{aligned} \min_{w,\xi^{(*)}} \quad & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} \\ \\ \textrm{s.t.} \quad & y_i - w \cdot \phi(x_i) - b \leq \varepsilon + \xi_{i}^*\\ & w \cdot \phi(x_i) + b - y_i \leq \varepsilon + \xi_{i}\\ & \xi^{(*)} \geq 0 \\ \end{aligned}\end{equation}

Where $\xi^{(*)} = \left( \xi_1, \xi_{1}^{*}, \ldots , \xi_n, \xi_{n}^{*} \right) $ is the slack variable. As $C$ increases, the tolerance for points outside of $\varepsilon$ also increases. To derive dual problem from this optimization (minimization) problems, the Lagrange function is introduced:

\begin{equation} \begin{aligned} L = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}{\left ( \xi_{i} + \xi_{i}^* \right )} - \sum_{i=1}^{n}{\left ( \eta_{i}\xi_{i} + \eta_{i}^*\xi_{i}^* \right )} \\ & - \sum_{i=1}^{n}{\alpha_{i} \left ( \varepsilon + \xi_{i} + y_{i} - w \cdot \phi(x_i) - b \right )} \\ & - \sum_{i=1}^{n}{\alpha_{i}^{*} \left ( \varepsilon + \xi_{i}^{*} - y_{i} + w \cdot \phi(x_i) + b \right )} \\ \end{aligned}\end{equation}

Where $L$ is the Lagrangian, $\alpha^{(*)} = \left( \alpha_1, \alpha_{1}^{*}, \ldots , \alpha_n, \alpha_{n}^{*} \right) $ and $\eta^{(*)} = \left( \eta_1, \eta_{1}^{*}, \ldots , \eta_n, \eta_{n}^{*} \right) $ are the Lagrange multipliers. The optimal conditions occur when:

\begin{equation} \begin{aligned} & \frac{\partial{L}}{\partial{w}} = 0 \; \Rightarrow \; w = \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right ) \phi(x_i) } \\ \\ & \frac{\partial{L}}{\partial{b}} = 0 \; \Rightarrow \; \sum_{i=1}^{n}{\left ( \alpha_{i}^* - \alpha_{i} \right )} = 0 \\ \\ & \frac{\partial{L}}{\partial{\xi_i}} = 0 \; \Rightarrow \; C - \alpha_{i} - \eta_{i} = 0 \\ \\ & \frac{\partial{L}}{\partial{\xi_{i}^{*}}} = 0 \; \Rightarrow \; C - \alpha_{i}^{*} - \eta_{i}^{*} = 0 \\ \end{aligned}\end{equation}

Combining Eq. (6) and (7) we get the dual optimization problem:

\begin{equation} \begin{aligned} \min_{\alpha^{(*)}} \quad & \frac{1}{2} \sum_{i, j =1}^{n}{( \alpha_{i}^* - \alpha_{i}) ( \alpha_{j}^* - \alpha_{j} ) \phi(x_i) \phi(x_j)} \\ & + \varepsilon \sum_{i=1}^{n}{( \alpha_{i}^* + \alpha_{i} )} - \sum_{i=1}^{n}{y_i(\alpha_{i}^* - \alpha_{i})}\\ \\ \textrm{s.t.} \quad & \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i})} = 0 \\ & 0 \leq \alpha^{(*)} \leq C \\ \end{aligned}\end{equation}

Solving this optimization problem, in other words, obtaining the values of the Lagrange Multipliers $\alpha^{(*)}$ that minimize the objective function, the regression function of SVR can be described as:

\begin{equation} \begin{aligned} f(x) & = \; w \cdot \phi(x) + b \\ & = \; \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) [\phi(x_i) \cdot \phi(x) ]} + b \\ & = \; \sum_{i=1}^{n}{( \alpha_{i}^* - \alpha_{i}) K(x_i, x)} + b \\ \end{aligned}\end{equation}

Where $K(x_i, x)$ is known as Kernel Function.

Importance of hyperparameters tuning

We have seen the mathematical meaning of the hyperparameters $C$ and $\varepsilon$, and how they are introduced in the mathematical formulation of SVR as factors for model tuning. These hyperparameters controls the accuracy and the generalizability of the model. For a better understanding, let's look at a visual example of how these parameters affect the model performance.

In this video, we can notice that for obtain a good regression, we need to find the optimal values for both $C$ and $\varepsilon$. We have shown that isn't enough variate one parameter, to get a robust solution we must optimize both. We also need to choose the right Kernel depending on the data distribution, in the video above we selected a Radial Basis Function (RBF) Kernel, which is a popular choice for many problems, as it can capture complex and non-linear relationships.

Saturday, June 10, 2023

Mathematical Formulation SVR

Mathematical Formulation

Importance of hyperparameters tuning

0 comments:

Post a Comment

About Me

Recent Post

Intensity Transformations (Part 2)

Pages