On this page

08 Decision Trees

Decision trees are popular machine learning algorithms. They use a tree-like graph of decisions to make predictions. Each internal node represents a decision based on an input feature, and each leaf node represents the outcome of the decision.

Decision Tree Example

Deciding whether an animal is a cat or not, depending on its features.

Ear Shape	Face Shape	Whiskers	Is Cat
Pointy	Round	Yes	Yes
Floppy	Not round	Yes	Yes
Floppy	Round	No	No
Pointy	Not round	Yes	No
…	…	…	…

graph TD
    A([Ear Shape]) -->|Pointy| B([Face Shape])
    A -->|Floppy| C([Face Shape])
    B -->|Round| D[Cat]
    B -->|Not round| E[Not Cat]
    C -->|Round| F[Cat]
    C -->|Not round| G[Not Cat]

Learning Process

The learning process of a decision tree involves selecting the best feature to split the data.

There are multiple decisions to make:

Which feature to split on? Select the a feature to maximize purity.
When to stop splitting? When node is pure, when purity is above a threshold, or when depth is reached.

Variable Naming:

$w^{\text{left}}$: Number of examples in the left branch.
$w^{\text{right}}$: Number of examples in the right branch.
$p_0$: Fraction of negative examples.
$p_1$: Fraction of positive examples.
$p_i^{\text{parent}}$: Fraction of positive/negative examples in the parent node.
$p_i^{\text{left}}$: Fraction of positive/negative examples in the left branch.
$p_i^{\text{right}}$: Fraction of positive/negative examples in the right branch.
$H(p_i)$: Entropy of the data.

Purity

Purity is a measure of how well the data is separated by the decision. Entropy is a common measure of impurity (it measures the randomness in the data). It is calculated as: $$H(p_1) = -p_1 \log_2(p_1) - (1-p_1) \log_2(1 - p_1)$$

Example

For the cat example, the entropy would be calculated as: $$H(X) = -p_{\text{Cat}} \log_2(p_{\text{Cat}}) - p_{\text{Not Cat}} \log_2(p_{\text{Not Cat}})$$ Where $p_{\text{Cat}}$ and $p_{\text{Not Cat}}$ are the probabilities of the classes.

In a set of 6 animals, if 5 are cats and 1 are not cats, the entropy would be: $$H(p_1) = -\left(\frac{5}{6} \log_2 \frac{5}{6} + \frac{1}{6} \log_2 \frac{1}{6}\right) \approx 0.65$$

warning

For calculating entropy $\log_2(0) = 0$!

Information Gain

Information gain is the measure of how much the entropy decreases after splitting the data. It is calculated as: $$\text{Information Gain} = H(p_1^{\text{parent}}) - (w^{\text{left}}H(p_1^{\text{left}}) + w^{\text{right}}H(p_1^{\text{right}}))$$

Example

For the cat example, the information gain when splitting on the ear shape would be calculated as: $$\text{Information Gain} = H(\frac{5}{10}) - (\frac{5}{10}H(\frac{4}{5}) + \frac{5}{10}H(\frac{1}{5})) \approx 0.27$$

One Hot Encoding

Decision trees work with categorical data. One hot encoding is a technique to convert categorical data into numerical data. Each category is converted into a binary vector, where each element represents a category.

Regression Trees

Decision trees can also be used for regression tasks. Instead of predicting a class, they predict a continuous value.

Instead of information gain, regression trees use variance reduction. Variance is a measure of how spread out the data is. It is calculated as: $$\text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

Tree Ensemble

Decision trees can be combined into an ensemble. We run multiple trees and combine their predictions (majority vote for classification, average for regression). This reduces the sensitivity to changes in data.

Random Forest

Random forests are an ensemble of decision trees. Each tree is trained on a random subset of the data. The samples are drawn with replacement from the original training set. Then, predictions are run on all trees and combined.

In addition to the random sampling of data, random forests also use random feature selection. At each split, a random subset of features is selected to find the best split. This tends to only work for larger datasets with many features.

XGBoost

XGBoost is an optimized version of the bagged decision tree algorithm. Instead of picking a random sample from the training set with equal probability, XGBoost is more likely to pick a sample that was missclassified by a previously trained tree.

Decision Trees vs Neural Networks

Decision Trees / Tree Ensembles:

Work very well for tabular data.
Are fast to train and easy to implement.
Smaller forests or single trees are human interpretable.

Neural Networks:

Work well with all kinds of data, structured and unstructured.
May be slower than a decision tree
Works with transfer learning
May be easier to string together multiple neural networks when working with a system of multiple models.

07 Performance Metrics

09 Clustering

08 Decision Trees

Decision Tree Example link

Learning Process link

Purity link

Example link

Information Gain link

Example link

One Hot Encoding link

Regression Trees link

Tree Ensemble link

Random Forest link

XGBoost link

Decision Trees vs Neural Networks link

Decision Tree Example

Learning Process

Purity

Example

Information Gain

Example

One Hot Encoding

Regression Trees

Tree Ensemble

Random Forest

XGBoost

Decision Trees vs Neural Networks