Parameters in decision trees
One of the most important parameters for a Decision Tree is the stopping criterion. When the tree building is nearly completed, the final few decisions can often be somewhat arbitrary and rely on only a small number of samples to make their decision. Using such specific nodes can result in trees that significantly overfit the training data. Instead, a stopping criterion can be used to ensure that the Decision Tree does not reach this exactness.
Instead of using a stopping criterion, the tree could be created in full and then trimmed. This trimming process removes nodes that do not provide much information to the overall process. This is known as pruning and results in a model that generally does better on new datasets because it hasn't overfitted the training data.
The decision tree implementation in scikit-learn provides a method to stop the building of a tree using the following options:
- min_samples_split: This specifies how many samples are needed in order to create a new node in the Decision Tree
- min_samples_leaf: This specifies how many samples must be resulting from a node for it to stay
The first dictates whether a decision node will be created, while the second dictates whether a decision node will be kept.
Another parameter for decision trees is the criterion for creating a decision. Gini impurity and information gain are two popular options for this parameter:
- Gini impurity: This is a measure of how often a decision node would incorrectly predict a sample's class
- Information gain: This uses information-theory-based entropy to indicate how much extra information is gained by the decision node
These parameter values do approximately the same thing--decide which rule and value to use to split a node into subnodes. The value itself is simply which metric to use to determine that split, however this can make a significant impact on the final models.