Decision Boundary
Impurity by Split
Candidate Splits
◆ THE BLUEPRINT
What Is a Decision Tree?

A decision tree recursively partitions the feature space using binary splits. Each internal node tests a condition on one feature. Each leaf node makes a prediction based on the training data that falls into that region.

Impurity Measures

The tree chooses splits that maximize the reduction in impurity. For classification:

\(\text{Gini}(t) = \sum_{k=1}^{K} \hat{p}_k(1 - \hat{p}_k)\)
\(\text{Entropy}(t) = -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k\)

Gini and Entropy produce nearly identical splits in practice. Gini is computationally simpler.

Greedy Splitting

Each split is chosen to maximize the information gain at that node. The algorithm is greedy: it optimizes locally, not globally. Press Step to watch the tree grow one split at a time.

\(\Delta I = I(\text{parent}) - \frac{n_L}{n}I(\text{left}) - \frac{n_R}{n}I(\text{right})\)
Related Topics
Slide Max Depth to watch train error decrease while test error follows a U-shape. The sweet spot balances bias and variance.
Train vs Test Error
Fit at Current Depth
Tree Structure
TRAIN ERROR
TEST ERROR
LEAVES
BEST DEPTH
◆ THE BLUEPRINT
The Complexity Tradeoff

Deep trees memorize training data (low bias, high variance). Shallow trees underfit (high bias, low variance). The optimal depth minimizes test error.

\(\text{MSE}_{\text{test}} = \text{Bias}^2 + \text{Variance} + \sigma^2\)
Cost-Complexity Pruning

Rather than setting max depth directly, cost-complexity pruning penalizes the number of leaves:

\(R_\alpha(T) = R(T) + \alpha|T|\)

Large alpha means aggressive pruning (fewer leaves). The optimal alpha is found by cross-validation.

Finding the Right Depth

The Bootstrap & CV app demonstrates how K-fold cross-validation can formally select the best tree depth or pruning parameter.

Related Topics
Ensemble Prediction
OOB Error Convergence
TREES BUILT
OOB MSE
SINGLE TREE MSE
VARIANCE REDUCTION
◆ THE BLUEPRINT
Bootstrap Aggregation

Bagging fits many trees to bootstrap samples of the training data, then averages their predictions. Each bootstrap sample draws n observations with replacement.

On average, each bootstrap sample contains about 63.2% of the original observations. The remaining 36.8% are 'out-of-bag' (OOB).

Variance Reduction
\(\text{Var}(\bar{f}) = \rho\sigma^2_{\text{tree}} + \frac{1-\rho}{B}\sigma^2_{\text{tree}}\)

If trees are uncorrelated (rho near 0), averaging B trees divides the variance by B. If trees are highly correlated (rho near 1), bagging helps less. This motivates random forests.

OOB Error

Each observation's OOB prediction uses only the trees where that observation was not in the bootstrap sample. This provides a free estimate of test error without needing a separate test set.

Related Topics
OOB Error Convergence
Variable Importance
mtry Comparison
TREES
OOB MSE
TEST MSE
MTRY
◆ THE BLUEPRINT
Random Forest = Bagging + Feature Subsampling

At each split, random forests consider only a random subset of mtry features. This decorrelates the trees, reducing the correlation term in the variance formula.

The mtry Hyperparameter
\(\text{Var}(\bar{f}) \approx \rho\sigma^2_{\text{tree}} + \frac{1-\rho}{B}\sigma^2_{\text{tree}}\)

Small mtry means more decorrelation (lower rho) but each tree has higher bias. Common defaults:

  • Regression: \(\lfloor p/3 \rfloor\)
  • Classification: \(\lfloor\sqrt{p}\rfloor\)
Variable Importance

Permutation importance measures how much the OOB error increases when a variable's values are randomly shuffled. Variables that matter will show large increases.

\(\text{Imp}_j = \text{OOB}_{\text{permuted}_j} - \text{OOB}_{\text{original}}\)
Tuning

The key hyperparameters are mtry, ntree, and min node size. Use cross-validation to select optimal values.

Related Topics
Variable Importance
Predicted vs Actual
TRAIN MSE
TEST MSE
METHOD
Lesson Results
◆ THE BLUEPRINT
The Workbench

This is your sandbox. Fit single trees or random forests on any dataset. Compare methods. Run quick experiments to build intuition.

Quick Lessons

Each lesson sweeps one parameter while holding others fixed:

  • Sweep mtry: Fits RF at several mtry values. Shows test MSE vs mtry.
  • Sweep ntree: Fits RF at several ntree values. Shows how error stabilizes.
  • Tree vs RF: Compares a single tree at various depths against RF.
Key Defaults
  • Regression mtry: \(\lfloor p/3 \rfloor\)
  • Typical ntree: 200-500 (more is rarely harmful)
  • Min node size: 5 (controls tree depth indirectly)
Related Topics