08 Decision Trees
Decision trees are popular machine learning algorithms. They use a tree-like graph of decisions to make predictions. Each internal node represents a decision based on an input feature, and each leaf node represents the outcome of the decision.
Decision Tree Example
Deciding whether an animal is a cat or not, depending on its features.
| Ear Shape | Face Shape | Whiskers | Is Cat |
|---|---|---|---|
| Pointy | Round | Yes | Yes |
| Floppy | Not round | Yes | Yes |
| Floppy | Round | No | No |
| Pointy | Not round | Yes | No |
| … | … | … | … |
graph TD
A([Ear Shape]) -->|Pointy| B([Face Shape])
A -->|Floppy| C([Face Shape])
B -->|Round| D[Cat]
B -->|Not round| E[Not Cat]
C -->|Round| F[Cat]
C -->|Not round| G[Not Cat]
Learning Process
The learning process of a decision tree involves selecting the best feature to split the data.
There are multiple decisions to make:
- Which feature to split on? Select the a feature to maximize purity.
- When to stop splitting? When node is pure, when purity is above a threshold, or when depth is reached.
Variable Naming:
- $w^{\text{left}}$: Number of examples in the left branch.
- $w^{\text{right}}$: Number of examples in the right branch.
- $p_0$: Fraction of negative examples.
- $p_1$: Fraction of positive examples.
- $p_i^{\text{parent}}$: Fraction of positive/negative examples in the parent node.
- $p_i^{\text{left}}$: Fraction of positive/negative examples in the left branch.
- $p_i^{\text{right}}$: Fraction of positive/negative examples in the right branch.
- $H(p_i)$: Entropy of the data.
Purity
Purity is a measure of how well the data is separated by the decision. Entropy is a common measure of impurity (it measures the randomness in the data). It is calculated as: $$H(p_1) = -p_1 \log_2(p_1) - (1-p_1) \log_2(1 - p_1)$$
Example
For the cat example, the entropy would be calculated as: $$H(X) = -p_{\text{Cat}} \log_2(p_{\text{Cat}}) - p_{\text{Not Cat}} \log_2(p_{\text{Not Cat}})$$ Where $p_{\text{Cat}}$ and $p_{\text{Not Cat}}$ are the probabilities of the classes.
In a set of 6 animals, if 5 are cats and 1 are not cats, the entropy would be: $$H(p_1) = -\left(\frac{5}{6} \log_2 \frac{5}{6} + \frac{1}{6} \log_2 \frac{1}{6}\right) \approx 0.65$$
For calculating entropy $\log_2(0) = 0$!
Information Gain
Information gain is the measure of how much the entropy decreases after splitting the data. It is calculated as: $$\text{Information Gain} = H(p_1^{\text{parent}}) - (w^{\text{left}}H(p_1^{\text{left}}) + w^{\text{right}}H(p_1^{\text{right}}))$$
Example
For the cat example, the information gain when splitting on the ear shape would be calculated as: $$\text{Information Gain} = H(\frac{5}{10}) - (\frac{5}{10}H(\frac{4}{5}) + \frac{5}{10}H(\frac{1}{5})) \approx 0.27$$
One Hot Encoding
Decision trees work with categorical data. One hot encoding is a technique to convert categorical data into numerical data. Each category is converted into a binary vector, where each element represents a category.
Regression Trees
Decision trees can also be used for regression tasks. Instead of predicting a class, they predict a continuous value.
Instead of information gain, regression trees use variance reduction. Variance is a measure of how spread out the data is. It is calculated as: $$\text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$$
Tree Ensemble
Decision trees can be combined into an ensemble. We run multiple trees and combine their predictions (majority vote for classification, average for regression). This reduces the sensitivity to changes in data.
Random Forest
Random forests are an ensemble of decision trees. Each tree is trained on a random subset of the data. The samples are drawn with replacement from the original training set. Then, predictions are run on all trees and combined.
In addition to the random sampling of data, random forests also use random feature selection. At each split, a random subset of features is selected to find the best split. This tends to only work for larger datasets with many features.
XGBoost
XGBoost is an optimized version of the bagged decision tree algorithm. Instead of picking a random sample from the training set with equal probability, XGBoost is more likely to pick a sample that was missclassified by a previously trained tree.
Decision Trees vs Neural Networks
Decision Trees / Tree Ensembles:
- Work very well for tabular data.
- Are fast to train and easy to implement.
- Smaller forests or single trees are human interpretable.
Neural Networks:
- Work well with all kinds of data, structured and unstructured.
- May be slower than a decision tree
- Works with transfer learning
- May be easier to string together multiple neural networks when working with a system of multiple models.