• K-Means

    Separating data into distinct clusters, organizing diverse information and simplifying complexity with vibrant clarity

  • Principal Components Analysis

    Making high dimensions comprehensible and actionable, it captures the maximum amount of variance in the data with a reduced number of features

  • Random Forest for Regression

    Combining decision trees, it provides predictive accuracy that illuminates the path to regression analysis

  • Support Vector Machines for Regression

    Leveraging mathematical precision, it excels in predicting values by carving precise pathways through data complexities

Thursday, April 20, 2023

Drawing the Line: Understanding Decision Boundaries

In the domain of data science, classification problems are everywhere. From identifying spam emails to diagnosing diseases, Classification algorithms have transformed the way we make decisions. Understanding and visualizing decision boundaries offers significant insights into the behavior and performance of these algorithms.

A decision boundary is a significant concept, as it provides a visual interpretation of how different classification algorithms work. Selecting the right algorithm for a specific problem significantly impacts the classification performance, because no model is universally best for all problems. Therefore, it's imperative to compare different classification algorithms to understand which one has the best performance and generalization capacity.

In this blog post, I explore the concept of decision boundaries and compare the performance of three popular classification algorithms: K-Nearest Neighbors (KNN), Gaussian Naive Bayes, and Decision Tree. For more information about these models, refer to the respective posts here.


Understanding Decision Boundaries

A decision boundary, in the context of classification algorithms, is a hypersurface that effectively partitions the underlying feature space into two decision regions, each representing a different class. The geometry and complexity of these boundaries depend on the classification algorithm and can range from simple linear separators to convoluted hypersurfaces. 

By visualizing these decision boundaries, one can grasp how different algorithms partition the feature space and understand their classification logic. A straight boundary might suggest that an algorithm is making predictions based on a single feature, while a more complex boundary may indicate the use of multiple features. Understanding the nature of these boundaries also helps in comprehending the model's robustness, susceptibility to noise, and its potential for overfitting or underfitting.


Classification Algorithms to Compare

K-Nearest Neighbors: It's a simple and powerful classification method that works by comparing an unclassified sample with K similar samples in the training set. The unclassified sample is then assigned the most common class among its 'K' neighbors. KNN's decision boundaries can vary greatly with the choice of 'K' and distance metric, providing a flexible way to classify complex datasets. More information on this algorithm can be found here.

Gaussian Naive Bayes: These classifiers are based on applying Bayes' theorem, with strong independence assumptions between the features. It assumes that the data from each label is drawn from a simple Gaussian distribution. Despite its simplicity, Gaussian Naive Bayes can be surprisingly effective and is especially suitable for datasets with many features. It tends to produce decision boundaries that are quadratic. Refer to this link for more details.

Decision Tree: A decision tree is a flowchart-like structure where each internal node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf node holds a class label. The decision boundaries are typically axis-aligned, partitioning the feature space into cuboids. Decision trees can handle both numerical and categorical data, making them a versatile choice for many classification problems. Learn more about decision trees here.

In the following section, I compare these three algorithms, examining their decision boundaries and computing their performance in terms of accuracy.


Visualizing Decision Boundaries

To visualize the decision boundary I implemented the machine learning models in Python using the scikit-learn library. I also developed functions to plot the decision boundaries and compute the accuracy scores of the models.


Importing Necessary Libraries

First, we need to import the essential Python libraries that were used throughout the project.

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import cross_val_score

Here we import the libraries pandas and numpy for handling and manipulating data, matplotlib for creating plots, sklearn for the machine learning algorithms, metrics and datasets.


Defining functions

We develop two functions: DecisionBoundaries and Cal_Acc.

The function DecisionBoundaries plots the decision boundaries for different classifiers on various datasets. It creates a grid of subplots where each row represents a specific dataset and each column represents a classifier. The first column in each row presents the datasets with their true labels.

# Function to Plot Decision Boundaries
def DecisionBoundaries(model_names, models, arr_datasets, arr_labels, size):

    h = 0.02  # step size in the mesh
    fig, axs = plt.subplots(len(arr_datasets), len(model_names) +1, figsize = size, facecolor='#F5F5F5')

    # Defining colormaps for visualizing the decision boundaries
    points_colormap = ListedColormap(['#FF0000', '#00FF00'])
    background_colormap = ListedColormap(['#FFAAAA', '#c2f0c2'])
    # Defining patches for the legend
    class0 = mpatches.Patch(color='#FF0000', label='0')
    class1 = mpatches.Patch(color='#00FF00', label='1')

    # Iterate through each dataset
    for i, (dataset, labels) in enumerate(zip(arr_datasets, arr_labels)):
        # Setting limits for the meshgrid
        x_min, x_max = dataset[:, 0].min() - 0.1*abs(dataset[:, 0].min()), dataset[:, 0].max() + 0.1*abs(dataset[:, 0].max())
        y_min, y_max = dataset[:, 1].min() - 0.1*abs(dataset[:, 1].min()), dataset[:, 1].max() + 0.1*abs(dataset[:, 1].max())
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
        # Displaying input data
        axs[i,0].set_facecolor('#F5F5F5')
        scatter = axs[i,0].scatter(dataset[:, 0], dataset[:, 1], c=labels, cmap=points_colormap, edgecolor='k')
        axs[i,0].legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')
        if i == 0:
          axs[i,0].set_title("Input Data")

        # Iterate through each model
        for j, (name, model) in enumerate(zip(model_names, models)):
            # Applying the model to generate decision regions
            model.fit(dataset, labels)
            decision_boundary = model.predict(np.c_[xx.ravel(), yy.ravel()])

            # Plotting the decision boundaries
            decision_boundary = decision_boundary.reshape(xx.shape)
            axs[i,j+1].set_facecolor('#F5F5F5')
            axs[i,j+1].pcolormesh(xx, yy, decision_boundary, cmap=background_colormap)
            # Plotting the data points
            scatter = axs[i,j+1].scatter(dataset[:, 0], dataset[:, 1], c=labels, cmap=points_colormap, edgecolor='k')
            axs[i,j+1].legend(handles=scatter.legend_elements()[0], labels=['0', '1'], title = 'Classes')

            # Adding title to each subplot
            if i == 0:
              axs[i,j+1].set_title(name)

The Cal_Acc function computes the average accuracy of each model on each dataset using 10-fold cross-validation. It stores the accuracy results in a pandas DataFrame for easier visualization and interpretation.

# Function to Calculate Average Accuracies using K-Fold Cross Validation
def Cal_Acc(model_names, models, dataset_names, arr_datasets, arr_labels):

    Accuracies = np.zeros((len(arr_datasets), len(model_names)))
    # Iterate through each dataset
    for i, (dataset, labels) in enumerate(zip(arr_datasets, arr_labels)):

        # Iterate through each model
        for j, model in enumerate(models):
            # Calculating cross-validation accuracy and storing in an array
            acc = cross_val_score(model, dataset, labels, cv=10, scoring='accuracy').mean()
            Accuracies[i, j] = round(acc, 5)

    # Generating a DataFrame from the accuracy array
    df_accuracies = pd.DataFrame(Accuracies, columns = model_names)
    df_accuracies.insert(0, "Datasets", dataset_names, True)

    return df_accuracies


Dataset Generation

We create three synthetic datasets, each presenting a different type of distribution and classification challenge: concentric circles (make_circles), half-moon shapes (make_moons), and two blobs (make_blobs). Each dataset have the corresponding set of labels.

# Generating Datasets
Dataset1, Labels1 = datasets.make_circles(n_samples=600, noise=0.13, factor = 0.5, random_state=0)
Dataset2, Labels2 = datasets.make_moons(n_samples=600, noise=0.22, random_state=0)
Dataset3, Labels3 = datasets.make_blobs(n_samples=600, centers = 2, cluster_std=1.3, random_state=0)

Then, we group them together in arrays.

# Grouping Datasets
arr_Datasets = [Dataset1, Dataset2, Dataset3]
arr_Labels = [Labels1, Labels2, Labels3]
dataset_names = ["Dataset 1", "Dataset 2", "Dataset 3"]


Defining Machine Learning Models

We define the classification models: K-Nearest Neighbors (KNN), Gaussian Naive Bayes, and Decision Tree. 

# Defining Models
KNN_model = KNeighborsClassifier(n_neighbors=5)
NaiveBayes_model = GaussianNB()
DecisionTree_model =  DecisionTreeClassifier(criterion = 'gini', splitter = 'best', random_state = 0)

We also group the models in an array.

# Grouping Models
models = [KNN_model, NaiveBayes_model, DecisionTree_model]
model_names = ["K-Nearest Neigbohrs", "Gaussian Naive-Bayes", "Decision Tree"]


Results

We use the functions to plot the decision boundaries and calculate the accuracies of the models, applying them to each dataset generated in previous sections.

# Plotting Decision Boundaries
DecisionBoundaries(model_names, models, arr_Datasets, arr_Labels, size = (18,10))

# Calculating Average Accuracies
accuracies = Cal_Acc(model_names, models, dataset_names, arr_Datasets, arr_Labels)

The graph containing the decision boundaries of each model and dataset is shown below:


The results for the accuracy scores of the models applied to each dataset were:


Conclusion

Understanding decision boundaries provides valuable insights into the behavior of classification models. It reveals how each model draws the 'line' between different classes, which can be a crucial factor in certain applications. In this blog post, we explored decision boundaries and implemented popular classification models, namely K-Nearest Neighbors (KNN), Gaussian Naive Bayes, and Decision Trees, to understand their decision-making process.

Through a comparative study, we demonstrated that each model has its strengths and weaknesses, and their performance can significantly vary depending on the complexity and distribution of the data. This underlines the importance of comprehending the underlying mechanisms of these models and applying this understanding in the model selection process.


Share:

Saturday, April 8, 2023

Decision Trees for Classification

The Decision Tree algorithm is a powerful and versatile machine learning technique, it belongs to  supervised learning paradigm and is widely used for both Classification and Regression tasks, but is mainly used for classification problems. These algorithms work by creating a model that predicts the value of a target variable by learning simple decision rules based on the data features.

The importance of Decision Trees lies in their simplicity and interpretability. Unlike other machine learning algorithms, Decision Trees require little data preparation and can handle data which hasn't been normalized or encoded, as well as both categorical and numerical data. They also provide a clear indication of which fields are most important, making them a valuable tool for understanding complex datasets.

Decision Trees are fundamental components of Random Forests, which are among the most potent Machine Learning algorithms available today. In the real world, Decision Trees have a wide range of applications. They are used in the medical field for aiding in diagnosis, in finance to assess a potential investment risk, in retail to help predict a customer's likely purchases, and in many other fields.


Structure of Decision Trees

The structure of a Decision Tree is straightforward and intuitive. It consists of a root node, internal nodes, and leaf nodes. Each internal node represents a feature, each leaf node represents a class label, and each branch represents a rule. The topmost node in a Decision Tree is known as the root node. 

The tree starts at the root node and splits the data on the feature that results in the largest information gain (IG). In an iterative process, this splitting procedure is repeated at each child node until the leaves are pure. This means that all the samples at each node belong to the same class.



Metrics for Split Quality

There are many metrics used to evaluate the quality of a split in Decision Trees, some of which commonly used are Information Gain, Gini Impurity, and Entropy.

Entropy is a measure of the randomness or uncertainty, the goal of using entropy in a Decision Tree is to find and connect nodes with information that is most valuable for classification, which is done by minimizing the entropy, the formula of entropy is: which is the goal of the information gain criterion.

\begin{equation*} \mathbf{E} = \sum_{i=1}^{n}p_i \log_2 (p_i)  \end{equation*}

Information Gain is the reduction in entropy or impurity from a split. It measures how well a given attribute separates the training examples according to their target classification. It represents how much information a feature provides for the target variable, and this attribute with the highest information gain is chosen as the splitting feature at each node.

\begin{equation*} \mathbf{IG} = E_{parent} - E_{children}  \end{equation*}

Where $E_{parent}$ is the entropy of the parent node and $E_{children}$ represents the average entropy of the child nodes that follow this variable.

Gini Index is other alternative for splitting a decision tree. While the Entropy and Information Gain method focuses on purity and impurity in a node, the Gini Index measures the degree or probability for a particular variable being misclassified when it is randomly chosen. The formula for Gini Index is:

\begin{equation*} \mathbf{Gini} = 1 - \sum_{i=1}^{n}p_{i}{}^2  \end{equation*}

The lower the Gini Index, the lower the probability of misclassification, a Gini Impurity of 0 represents the best split.


Algorithms for Building Decision Trees

There are var popular algorithms used to build Decision Trees, including CART (Classification and Regression Trees), ID3 (Iterative Dichotomiser 3), and C4.5.

  • CART: The CART algorithm is a binary tree algorithm, meaning that each parent node splits into exactly two child nodes. It can handle both numerical and categorical variables and also can be used for both classification and regression tasks. CART uses the Gini Index as a metric to create decision points for classification tasks, and mean squared error for regression tasks.
  • ID3: It uses a top-down, greedy approach to construct a decision tree. It selects the best attribute to place at the root node based on the Information Gain metric. The ID3 algorithm can handle categorical input attributes well, but struggles with continuous attributes. It also doesn't handle missing values, so any missing data needs to be preprocessed before to be used in an ID3 algorithm.
  • C4.5: This algorithm is an extension of ID3 that adjusts the Information Gain using a "Gain Ratio" to handle the bias towards attributes with many outcomes. It can handle both continuous and categorical attributes, and also allows for missing feature values. The C4.5 model includes a pruning step, where it uses a statistical test to estimate whether expanding or collapsing some branches can improve its predictions, which helps to avoid overfitting.

Implementing algorithm in Python

In this section, I explain an implementation of the Decision Tree algorithm in Python using the scikit-learn library. Here I use the Iris dataset, which is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The task is building a Decision Tree model to classify the iris species based on their properties.


Importing Libraries

First, we import the libraries for this script, which includes:

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from graphviz import Source

  • numpy: It adds support for multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • pandas: Library for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.
  • sklearn: It is a machine learning library for Python. It features various machine learning algorithms and also supports Python's numerical and scientific libraries. We specifically imported the load_iris function for loading the Iris dataset, train_test_split for splitting our data into train and test sets, DecisionTreeClassifier for building the Decision Tree model, export_graphviz for exporting the tree in a Graphviz format, and accuracy_score for calculating the accuracy of the model.
  • matplotlib: It is a visualization library in Python for creating static, animated, and interactive visualizations, as well as drawing attractive and informative statistical graphs.
  • Graphviz: This is an open-source graph visualization software. It is used to represent structural information as diagrams of abstract graphs and networks. In this case, it is used to plot the tree diagram of the trained model.

Loading the Iris Dataset

This dataset is freely available and built into the scikit-learn library. It includes three iris species (setosa, versicolor and virginica) with 50 samples each, as well as some properties about each flower (sepal length, sepal width, petal length and petal width). We first load this dataset and split it into input features X and the target variable y. The input features are the measurements of each flower and the target variable is the species of the flower.

# Loading the Iris dataset
iris = load_iris()
# Splitting the data into input features (X) and target variable (y)
X = iris.data
y = iris.target

Next, we create a Pandas dataframe with the data and the column names. This allows us to easily manipulate and explore the data. 

# Creating a Pandas DataFrame with the data and the column names
df = pd.DataFrame(data=X, columns=iris.feature_names)
# Adding the target variable column to the DataFrame
df['target'] = y

To make the data more understandable, we map the integer values of the target variable to the names of the Iris species.

# Creating a dictionary to map the integer label values to the class names
class_mapping = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
# Map the integer label values to the class names
df['target'] = df['target'].map(class_mapping)

Finally, we print out a random sample of 5 rows from our dataset. This is useful to get a quick overview of the data we are working with.

df.sample(n = 5, random_state = 0)

The random sample is shown below:


Applying the Decision Tree Model

Now that we have loaded the dataset, we can move on to applying the Decision Tree model. First, we split the data into train and test sets. This allows us to evaluate the performance of our model on unseen data. We use the train_test_split function from scikit-learn library, resulting that 25% of the samples belong to test set and 75% to train set.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Next, we create an instance of the DecisionTreeClassifier and train it using the training dataset.

# Creating an instance of DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', random_state = 0)
# Training the model using the train set
decision_tree.fit(X_train, y_train)

Once the model is trained, we apply it to make predictions on the test data. We also calculate the accuracy score to evaluate the performance of the model.

# Make predictions on the test set
y_pred = decision_tree.predict(X_test)
# Computing the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("The accuracy is:", accuracy)

The resulting accuracy of the models is 0.974, which indicates a good performance of the model.


Understanding the Structure of the Tree

One of the advantages of Decision Trees is that they are relatively easy to interpret. We can visualize the trained Decision Tree model using sklearn and the graphviz libraries.

# Generating a Graphviz representation of the Decision Tree
dot_data = export_graphviz(decision_tree, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)
# Creating a graph from the Graphviz representation
graph = Source(dot_data)
# Displaying the graph
graph

The export_graphviz function generates a Graphviz representation of the Decision Tree, which is a text language for describing graphs. Then, we use the Source function from the graphviz library to create a graph from the Graphviz representation of the Decision Tree. Finally, we display the graph, which is below:


Visualizing Feature Importance

Another advantage of Decision Trees is that they can compute the relative importance of features. This can be very useful in understanding which features are the most influential in making the predictions.

# Get the feature importances
importances = decision_tree.feature_importances_

# Creating a bar plot to visualize the feature importances
plt.rcParams['font.size'] = '12'
fig, ax = plt.subplots(figsize=(8,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')
bars = ax.bar(range(len(importances)), importances, tick_label=iris.feature_names)
ax.set_ylabel('Importance')
ax.set_title('Importance of features in decision tree model')
ax.set_xticklabels(iris.feature_names, rotation=45)

# Adding the value above each bar
for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 4), ha='center', va='bottom')

plt.show()

The feature_importances_ attribute of the model gives us the importance of each feature. These importances were computed as the total reduction of the criterion (Gini) brought by that feature. Then, we plot these feature importances on a bar plot for easy visualization.

This graph indicates that sepal length was not used in the decision-making process of the model, as it have an importance of 0.0. Sepal width, with an importance of 0.0201, played a minor role. But, the petal length and petal width features were the most influential in the decision-making process, with importances of 0.3993 and 0.5806, respectively. This suggests that these two features contributed the most to the reduction of impurity in the nodes, in other words, they are the most important features for the Decision Tree model.


Training a New Decision Tree Model with less Features

Given the importance scores that we obtained from the initial Decision Tree model, we showed that petal length and petal width are the most influential features in classifying the Iris species. To illustrate the impact of these features, we train a new Decision Tree model using only these two features.

We create new training and testing sets that only include the petal length and petal width features.

# Selecting only the 'petal length' and 'petal width' features for training and testing
X_train_reduced = X_train[:, 2:4]
X_test_reduced = X_test[:, 2:4]

Then, we define a new instance of the DecisionTreeClassifier and train it using the reduced-features training data.

# Creating a new DecisionTreeClassifier model
decision_tree_2 = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', random_state = 0)
# Train the new model using the reduced training set
decision_tree_2.fit(X_train_reduced, y_train)

Finally, we compute the accuracy of the new model.

# Make predictions on the reduced testing set
y_pred_2 = decision_tree_2.predict(X_test_reduced)
# Computing the accuracy of the new model
accuracy_2 = accuracy_score(y_test, y_pred_2)
print("The accuracy of the new model is:", accuracy_2)

The accuracy of this new model, which only uses the petal length and petal width features, is 0.947. This is slightly lower than the accuracy of the initial model, which was 0.973. But, it's important to notice that this reduction is minimal.

Moreover, the fact that we are able to achieve nearly the same level of accuracy while reducing the number of features by half is quite significant. In larger, more complex datasets, reducing the dimensionality of the data can lead to substantial computational savings and can also help to mitigate the risk of overfitting.


Conclusion

In this post, we have explored the Decision Tree algorithm and its application in classification tasks. Through our hands-on implementation with the Iris dataset, we have seen how the model can be trained, how predictions can be made, and how the model's performance can be evaluated. We have also discovered the power of feature selection, by focusing on the most influential features, we were able to create a more efficient model that maintained nearly the same level of accuracy. This highlights the balance between model complexity and performance, a key consideration in machine learning.

In conclusion, we have learned about the power and importance of Decision Tree models, their simplicity, interpretability, and versatility make them a valuable tool in any data scientist's toolkit.


Share:

About Me

My photo
I am a Physics Engineer graduated with academic excellence as the first in my generation. I have experience programming in several languages, like C++, Matlab and especially Python, using the last two I have worked on projects in the area of Image and signal processing, as well as machine learning and data analysis projects.

Recent Post

Particle Swarm Optimization

The Concept of "Optimization" Optimization is a fundamental aspect of many scientific and engineering disciplines. It involves fi...

Pages