How to train a decision tree?

2024-05-22

Let's say I want to train an algorithm that can predict whether I will bike or drive to work on a given day. I come up with a list of factors that I think might affect this decision. They are: (i) is it raining?, (ii) do I have to carry my backpack?, and (iv) am I tired?

To train a model that can predict my transportation medium, I collect some historical data to use it as training data.

Raining	Backpack	Tired	Transportation Medium
Yes	Backpack	No	Drive
No	Neither	No	Bike
No	Backpack	No	Bike
No	Backpack	Yes	Drive

I intuitively know that rain is probably going to be the main deciding factor, so I try to simplify the problem. Limit the data just to rain, I get the following dataset.

Raining	Transportation Medium
Yes	Drive
No	Bike
No	Bike
No	Drive

Based on the data, I come up with a simple algorithm.

if raining: 
    drive
else
    bike

This algorithm gives the correct prediction for 3 out of the 4 observations we see here. 3 out of 4 is not bad, but can I improve it? How about I add the tiredness factor too?

Raining	Tired	Transportation Medium
Yes	Yes	Drive
No	No	Bike
No	No	Bike
No	Yes	Drive

Now, I can update the algorithm to account for both the factors.

if raining:
    drive 
else: 
    if tired:
        drive
    else:
        bike

Now the algorithm predicts all the 4 observations correctly. Awesome!

What I built above is a decision tree that predicts an outcome (bike or drive) based on certain inputs (raining, backpack, tired). To get a prediction for a new observation, I will start at the tree's root (raining), follow the branches (yes or no) and other splits (tired, yes or no) until I reach a leaf (drive or bike).

In this particular example, my intuition guided my choice of the root node as well as the other node that to base the decision on.

A more generalized problem is training a decision tree to predict an outcome y based on n input parameters x_1, x_2, ..., x_n. To start with, we have some observational data.

x_1	x_2	...	x_n	y
a_1	a_2	...	a_n	y_a
b_1	b_2	...	b_n	y_b
c_1	c_2	...	c_n	y_c
...	...	...	...	...

To build a decision tree to predict y based on x_1, x_2, ... x_n, we need to choose the parameters to split on (raining and tired in the above example) and the order in which to consider those (split on raining before on tired in the above example).

The answer to both of these comes from the idea of mutual information. Mutual information of the decision y with respect to a parameter (say) x_1 is a measure the decision's information that is contained
in the parameter. This is a fancy way of saying that the mutual information of y with respect to x_1 refers to how "similar" y is to x_1 or how much of y can be inferred from x_1.

Mathematically, mutual information of y with respect to x_1 is the difference in entropy of y and the entropy of y given x_1.

mutual_information(y, x_1) = entropy(y) - entropy(y | x_1)

Entropy is a measure of the randomness of a variable. A high value of mutual information means that the entropy of y given x_1 is significantly less than entropy of y. That is, knowing x_1 significantly reduced the randomness of y.

Now let's see how to use mutual information to train our decision tree.

Similar to our intuition with the bike or drive example above, we want to use the input parameters that will help us get to the answer the fastest. This is exactly what mutual information tells us. A parameter with large mutual information implies that it can significantly reduce the randomness of the output, i.e., it can predict the output with a high degree of confidence. So, we choose the parameter with the largest mutual information as the first parameter to split on.

Let's say we decide to split on x_1. Our decision tree would start to shape up like this.

switch x_1: 
    case value_1:
        ...
    case value_2: 
        ...
    ...
    case value_k:
        ...

x_1 can take multiple values (and not just true or false), so end up with as many branches as the number of values that x_1 can take. Now we need to populate each of these branches with the next parameter to base the decision on.

Simple! We will just repeat the same process for each of the branches. For a branch where x_1 takes the value value_1, we have effectively reduced the problem and now need to consider only consider the training data where x_1 = value_1. In this subset of data, we need to find the parameter to base the decision on. Mutual information to the rescue again. We compute the mutual information of y with all the remaining parameters x_2, x_3, ..., x_n`. And choose the parameter with the highest mutual information as the parameter to split upon.

We will continue this process for all the branches and sub-branches till we have populated the entire tree.

As an example, for a three input problem (x_1, x_2, x_3) where each input can take only two values (true or false), the decision tree would look like below.

if x_1:
    if x_2:
        if x_3:
            prediction_true_true_true
        else:
            prediction_true_true_false
    else:
        if x_3:
            prediction_true_false_true
        else:
            prediction_true_false_false
else:
    if x_2:
        if x_3:
            prediction_false_true_true
        else:
            prediction_false_true_false
    else:
        if x_3:
            prediction_false_false_true
        else:
            prediction_false_false_false

Done? Not quite.

There are two problems with the model. First is that if it requires 2^n observations for n input parameters (e.g., 8 observations for the 3 input parameters above). This isn't always feasible.

Let's say we solve this problem by ignoring the paths for which we don't have training data. The second problem with the model is that it is prone to overfitting.

To understand overfitting, let's go back to our bike or drive example. Let's say we had this observation in the training dataset.

Raining	Backpack	Tired	Transportation Medium
Yes	Yes	Yes	Bike

If we follow the above algorithm, we might end up with a path in the tree that says:

if raining:
    if tired: 
        if backpack
            bike

However, this does not seem like a reasonable path. In the future, if there is a day when it is raining, I am tired, and I have a backpack to carry, I would expect the model to predict that I would drive to work. However, the model we have just trained would predict that I would bike.

This is because the model has overfit to the training data. This observation in the training data could have been because of some other unknown factor. For example, maybe the car was being serviced and thus unavailable that day, which caused me to commute via bike. Such observations are called noise in the data.

Overfitting refers to when the model fits the training data extremely well but generalizes to unseen data poorly. In trying to fit the training data well, the model starts fitting to the noise instead of the underlying pattern.

Complex models are more prone to overfitting because they tend to start modeling the noise. The principle of reducing the complexity of the models to avoid overfitting is called inductive bias.

For decision trees, inductive bias is trying to find the shortest tree that performs well on the training data and would also generalize to unseen data. To that end, we stop splitting the tree's branches when either of the following becomes true: (i) if the entropy of the output y is low enough, or (ii) if the mutual information with either of the remaining parameter is low enough. In either of these scenarios, splitting any further does not provide a lot of value and runs the risk of overfitting to the training data. The threshold of "low enough" is a parameter that can be tuned to get low error rates on test data.

Decision trees are useful for models where the input parameters are categorical, i.e., they can take values only from a set of values. Examples of categorical variables are raining or t-shirt sizes. These are different from continuous parameters that can take any value from a range. E.g., amount of rain (in cm) or length of t-shirt (in inches). Continuous variables are not good candidates for decision tree models because each split can take an infinite number of values. Of course, we could group the output of continuous variables into ranges and convert them to categorical values if it makes sense for the problem at hand.

To recap:

Decision trees work by greedily choosing the parameters that best describe the output and splitting on those.
Mutual information is a way to measure the similarity of the output to an input parameter.
Inductive bias is a way to avoid overfitting by choosing a simpler model. For decision trees, finding the smallest tree that provides a "low enough" error rate is the inductive bias.

References

Lectures from the "Introduction to Machine Learning (10-701)" course from Carnegie Mellon University

#decision-tree #machine-learning

x_1	x_2	...	x_n	y
a_1	a_2	...	a_n	y_a
b_1	b_2	...	b_n	y_b
c_1	c_2	...	c_n	y_c
...	...	...	...	...

x_1	x_2	...	x_n	y
a_1	a_2	...	a_n	y_a
b_1	b_2	...	b_n	y_b
c_1	c_2	...	c_n	y_c
...	...	...	...	...

x_1	x_2	...	x_n	y
a_1	a_2	...	a_n	y_a
b_1	b_2	...	b_n	y_b
c_1	c_2	...	c_n	y_c
...	...	...	...	...