A decision tree recursively partitions the feature space using binary splits. Each internal node tests a condition on one feature. Each leaf node makes a prediction based on the training data that falls into that region.
The tree chooses splits that maximize the reduction in impurity. For classification:
Gini and Entropy produce nearly identical splits in practice. Gini is computationally simpler.
Each split is chosen to maximize the information gain at that node. The algorithm is greedy: it optimizes locally, not globally. Press Step to watch the tree grow one split at a time.
Deep trees memorize training data (low bias, high variance). Shallow trees underfit (high bias, low variance). The optimal depth minimizes test error.
Rather than setting max depth directly, cost-complexity pruning penalizes the number of leaves:
Large alpha means aggressive pruning (fewer leaves). The optimal alpha is found by cross-validation.
The Bootstrap & CV app demonstrates how K-fold cross-validation can formally select the best tree depth or pruning parameter.
Bagging fits many trees to bootstrap samples of the training data, then averages their predictions. Each bootstrap sample draws n observations with replacement.
On average, each bootstrap sample contains about 63.2% of the original observations. The remaining 36.8% are 'out-of-bag' (OOB).
If trees are uncorrelated (rho near 0), averaging B trees divides the variance by B. If trees are highly correlated (rho near 1), bagging helps less. This motivates random forests.
Each observation's OOB prediction uses only the trees where that observation was not in the bootstrap sample. This provides a free estimate of test error without needing a separate test set.
At each split, random forests consider only a random subset of mtry features. This decorrelates the trees, reducing the correlation term in the variance formula.
Small mtry means more decorrelation (lower rho) but each tree has higher bias. Common defaults:
Permutation importance measures how much the OOB error increases when a variable's values are randomly shuffled. Variables that matter will show large increases.
The key hyperparameters are mtry, ntree, and min node size. Use cross-validation to select optimal values.
This is your sandbox. Fit single trees or random forests on any dataset. Compare methods. Run quick experiments to build intuition.
Each lesson sweeps one parameter while holding others fixed: