Decision Trees for Classification ~ Data Science For Lifelong Learning

The Decision Tree algorithm is a powerful and versatile machine learning technique, it belongs to supervised learning paradigm and is widely used for both Classification and Regression tasks, but is mainly used for classification problems. These algorithms work by creating a model that predicts the value of a target variable by learning simple decision rules based on the data features.

The importance of Decision Trees lies in their simplicity and interpretability. Unlike other machine learning algorithms, Decision Trees require little data preparation and can handle data which hasn't been normalized or encoded, as well as both categorical and numerical data. They also provide a clear indication of which fields are most important, making them a valuable tool for understanding complex datasets.

Decision Trees are fundamental components of Random Forests, which are among the most potent Machine Learning algorithms available today. In the real world, Decision Trees have a wide range of applications. They are used in the medical field for aiding in diagnosis, in finance to assess a potential investment risk, in retail to help predict a customer's likely purchases, and in many other fields.

Structure of Decision Trees

The structure of a Decision Tree is straightforward and intuitive. It consists of a root node, internal nodes, and leaf nodes. Each internal node represents a feature, each leaf node represents a class label, and each branch represents a rule. The topmost node in a Decision Tree is known as the root node.

The tree starts at the root node and splits the data on the feature that results in the largest information gain (IG). In an iterative process, this splitting procedure is repeated at each child node until the leaves are pure. This means that all the samples at each node belong to the same class.

Metrics for Split Quality

There are many metrics used to evaluate the quality of a split in Decision Trees, some of which commonly used are Information Gain, Gini Impurity, and Entropy.

Entropy is a measure of the randomness or uncertainty, the goal of using entropy in a Decision Tree is to find and connect nodes with information that is most valuable for classification, which is done by minimizing the entropy, the formula of entropy is: which is the goal of the information gain criterion.

\begin{equation*} \mathbf{E} = \sum_{i=1}^{n}p_i \log_2 (p_i) \end{equation*}

Information Gain is the reduction in entropy or impurity from a split. It measures how well a given attribute separates the training examples according to their target classification. It represents how much information a feature provides for the target variable, and this attribute with the highest information gain is chosen as the splitting feature at each node.

\begin{equation*} \mathbf{IG} = E_{parent} - E_{children} \end{equation*}

Where $E_{parent}$ is the entropy of the parent node and $E_{children}$ represents the average entropy of the child nodes that follow this variable.

Gini Index is other alternative for splitting a decision tree. While the Entropy and Information Gain method focuses on purity and impurity in a node, the Gini Index measures the degree or probability for a particular variable being misclassified when it is randomly chosen. The formula for Gini Index is:

\begin{equation*} \mathbf{Gini} = 1 - \sum_{i=1}^{n}p_{i}{}^2 \end{equation*}

The lower the Gini Index, the lower the probability of misclassification, a Gini Impurity of 0 represents the best split.

Algorithms for Building Decision Trees

There are var popular algorithms used to build Decision Trees, including CART (Classification and Regression Trees), ID3 (Iterative Dichotomiser 3), and C4.5.

CART: The CART algorithm is a binary tree algorithm, meaning that each parent node splits into exactly two child nodes. It can handle both numerical and categorical variables and also can be used for both classification and regression tasks. CART uses the Gini Index as a metric to create decision points for classification tasks, and mean squared error for regression tasks.

ID3: It uses a top-down, greedy approach to construct a decision tree. It selects the best attribute to place at the root node based on the Information Gain metric. The ID3 algorithm can handle categorical input attributes well, but struggles with continuous attributes. It also doesn't handle missing values, so any missing data needs to be preprocessed before to be used in an ID3 algorithm.

C4.5: This algorithm is an extension of ID3 that adjusts the Information Gain using a "Gain Ratio" to handle the bias towards attributes with many outcomes. It can handle both continuous and categorical attributes, and also allows for missing feature values. The C4.5 model includes a pruning step, where it uses a statistical test to estimate whether expanding or collapsing some branches can improve its predictions, which helps to avoid overfitting.

Implementing algorithm in Python

In this section, I explain an implementation of the Decision Tree algorithm in Python using the scikit-learn library. Here I use the Iris dataset, which is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The task is building a Decision Tree model to classify the iris species based on their properties.

Importing Libraries

First, we import the libraries for this script, which includes:

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from graphviz import Source

numpy: It adds support for multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

pandas: Library for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.

sklearn: It is a machine learning library for Python. It features various machine learning algorithms and also supports Python's numerical and scientific libraries. We specifically imported the load_iris function for loading the Iris dataset, train_test_split for splitting our data into train and test sets, DecisionTreeClassifier for building the Decision Tree model, export_graphviz for exporting the tree in a Graphviz format, and accuracy_score for calculating the accuracy of the model.

matplotlib: It is a visualization library in Python for creating static, animated, and interactive visualizations, as well as drawing attractive and informative statistical graphs.

Graphviz: This is an open-source graph visualization software. It is used to represent structural information as diagrams of abstract graphs and networks. In this case, it is used to plot the tree diagram of the trained model.

Loading the Iris Dataset

This dataset is freely available and built into the scikit-learn library. It includes three iris species (setosa, versicolor and virginica) with 50 samples each, as well as some properties about each flower (sepal length, sepal width, petal length and petal width). We first load this dataset and split it into input features X and the target variable y. The input features are the measurements of each flower and the target variable is the species of the flower.

# Loading the Iris dataset
iris = load_iris()
# Splitting the data into input features (X) and target variable (y)
X = iris.data
y = iris.target

Next, we create a Pandas dataframe with the data and the column names. This allows us to easily manipulate and explore the data.

# Creating a Pandas DataFrame with the data and the column names
df = pd.DataFrame(data=X, columns=iris.feature_names)
# Adding the target variable column to the DataFrame
df['target'] = y

To make the data more understandable, we map the integer values of the target variable to the names of the Iris species.

# Creating a dictionary to map the integer label values to the class names
class_mapping = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
# Map the integer label values to the class names
df['target'] = df['target'].map(class_mapping)

Finally, we print out a random sample of 5 rows from our dataset. This is useful to get a quick overview of the data we are working with.

df.sample(n = 5, random_state = 0)

The random sample is shown below:

Applying the Decision Tree Model

Now that we have loaded the dataset, we can move on to applying the Decision Tree model. First, we split the data into train and test sets. This allows us to evaluate the performance of our model on unseen data. We use the train_test_split function from scikit-learn library, resulting that 25% of the samples belong to test set and 75% to train set.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Next, we create an instance of the DecisionTreeClassifier and train it using the training dataset.

# Creating an instance of DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', random_state = 0)
# Training the model using the train set
decision_tree.fit(X_train, y_train)

Once the model is trained, we apply it to make predictions on the test data. We also calculate the accuracy score to evaluate the performance of the model.

# Make predictions on the test set
y_pred = decision_tree.predict(X_test)
# Computing the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("The accuracy is:", accuracy)

The resulting accuracy of the models is 0.974, which indicates a good performance of the model.

Understanding the Structure of the Tree

One of the advantages of Decision Trees is that they are relatively easy to interpret. We can visualize the trained Decision Tree model using sklearn and the graphviz libraries.

# Generating a Graphviz representation of the Decision Tree
dot_data = export_graphviz(decision_tree, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)
# Creating a graph from the Graphviz representation
graph = Source(dot_data)
# Displaying the graph
graph

The export_graphviz function generates a Graphviz representation of the Decision Tree, which is a text language for describing graphs. Then, we use the Source function from the graphviz library to create a graph from the Graphviz representation of the Decision Tree. Finally, we display the graph, which is below:

Visualizing Feature Importance

Another advantage of Decision Trees is that they can compute the relative importance of features. This can be very useful in understanding which features are the most influential in making the predictions.

# Get the feature importances
importances = decision_tree.feature_importances_

# Creating a bar plot to visualize the feature importances
plt.rcParams['font.size'] = '12'
fig, ax = plt.subplots(figsize=(8,6), facecolor='#F5F5F5')
ax.set_facecolor('#F5F5F5')
bars = ax.bar(range(len(importances)), importances, tick_label=iris.feature_names)
ax.set_ylabel('Importance')
ax.set_title('Importance of features in decision tree model')
ax.set_xticklabels(iris.feature_names, rotation=45)

# Adding the value above each bar
for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 4), ha='center', va='bottom')

plt.show()

The feature_importances_ attribute of the model gives us the importance of each feature. These importances were computed as the total reduction of the criterion (Gini) brought by that feature. Then, we plot these feature importances on a bar plot for easy visualization.

This graph indicates that sepal length was not used in the decision-making process of the model, as it have an importance of 0.0. Sepal width, with an importance of 0.0201, played a minor role. But, the petal length and petal width features were the most influential in the decision-making process, with importances of 0.3993 and 0.5806, respectively. This suggests that these two features contributed the most to the reduction of impurity in the nodes, in other words, they are the most important features for the Decision Tree model.

Training a New Decision Tree Model with less Features

Given the importance scores that we obtained from the initial Decision Tree model, we showed that petal length and petal width are the most influential features in classifying the Iris species. To illustrate the impact of these features, we train a new Decision Tree model using only these two features.

We create new training and testing sets that only include the petal length and petal width features.

# Selecting only the 'petal length' and 'petal width' features for training and testing
X_train_reduced = X_train[:, 2:4]
X_test_reduced = X_test[:, 2:4]

Then, we define a new instance of the DecisionTreeClassifier and train it using the reduced-features training data.

# Creating a new DecisionTreeClassifier model
decision_tree_2 = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', random_state = 0)
# Train the new model using the reduced training set
decision_tree_2.fit(X_train_reduced, y_train)

Finally, we compute the accuracy of the new model.

# Make predictions on the reduced testing set
y_pred_2 = decision_tree_2.predict(X_test_reduced)
# Computing the accuracy of the new model
accuracy_2 = accuracy_score(y_test, y_pred_2)
print("The accuracy of the new model is:", accuracy_2)

The accuracy of this new model, which only uses the petal length and petal width features, is 0.947. This is slightly lower than the accuracy of the initial model, which was 0.973. But, it's important to notice that this reduction is minimal.

Moreover, the fact that we are able to achieve nearly the same level of accuracy while reducing the number of features by half is quite significant. In larger, more complex datasets, reducing the dimensionality of the data can lead to substantial computational savings and can also help to mitigate the risk of overfitting.

Conclusion

In this post, we have explored the Decision Tree algorithm and its application in classification tasks. Through our hands-on implementation with the Iris dataset, we have seen how the model can be trained, how predictions can be made, and how the model's performance can be evaluated. We have also discovered the power of feature selection, by focusing on the most influential features, we were able to create a more efficient model that maintained nearly the same level of accuracy. This highlights the balance between model complexity and performance, a key consideration in machine learning.

In conclusion, we have learned about the power and importance of Decision Tree models, their simplicity, interpretability, and versatility make them a valuable tool in any data scientist's toolkit.

Saturday, April 8, 2023