THE FORGE • Tree Methods

Decision Boundary

Impurity by Split

Candidate Splits

Data

Dataset

Sample Size (n)

Noise

Tree Controls

Impurity Measure

Min Leaf Size

What Is a Decision Tree?

A decision tree recursively partitions the feature space using binary splits. Each internal node tests a condition on one feature. Each leaf node makes a prediction based on the training data that falls into that region.

Impurity Measures

The tree chooses splits that maximize the reduction in impurity. For classification:

\(\text{Gini}(t) = \sum_{k=1}^{K} \hat{p}_k(1 - \hat{p}_k)\)

\(\text{Entropy}(t) = -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k\)

Gini and Entropy produce nearly identical splits in practice. Gini is computationally simpler.

Greedy Splitting

Each split is chosen to maximize the information gain at that node. The algorithm is greedy: it optimizes locally, not globally. Press Step to watch the tree grow one split at a time.

\(\Delta I = I(\text{parent}) - \frac{n_L}{n}I(\text{left}) - \frac{n_R}{n}I(\text{right})\)

Data

Task Type

Regression Classification

Dataset

Training Size

Noise (σ)

Noise Predictors

Noise

Tree Complexity

Max Depth

Min Leaf Size

The Complexity Tradeoff

Deep trees memorize training data (low bias, high variance). Shallow trees underfit (high bias, low variance). The optimal depth minimizes test error.

\(\text{MSE}_{\text{test}} = \text{Bias}^2 + \text{Variance} + \sigma^2\)

Cost-Complexity Pruning

Rather than setting max depth directly, cost-complexity pruning penalizes the number of leaves:

\(R_\alpha(T) = R(T) + \alpha|T|\)

Large alpha means aggressive pruning (fewer leaves). The optimal alpha is found by cross-validation.

Finding the Right Depth

The Bootstrap & CV app demonstrates how K-fold cross-validation can formally select the best tree depth or pruning parameter.

Data

Dataset

Training Size

Noise (σ)

Bagging Controls

Number of Trees

Tree Max Depth

Min Leaf Size

Show Individual Trees

Bootstrap Aggregation

Bagging fits many trees to bootstrap samples of the training data, then averages their predictions. Each bootstrap sample draws n observations with replacement.

On average, each bootstrap sample contains about 63.2% of the original observations. The remaining 36.8% are 'out-of-bag' (OOB).

Variance Reduction

\(\text{Var}(\bar{f}) = \rho\sigma^2_{\text{tree}} + \frac{1-\rho}{B}\sigma^2_{\text{tree}}\)

If trees are uncorrelated (rho near 0), averaging B trees divides the variance by B. If trees are highly correlated (rho near 1), bagging helps less. This motivates random forests.

OOB Error

Each observation's OOB prediction uses only the trees where that observation was not in the bootstrap sample. This provides a free estimate of test error without needing a separate test set.

Data

Dataset

Training Size

Noise Predictors

Noise (σ)

Random Forest

Number of Trees

Min Node Size

Random Forest = Bagging + Feature Subsampling

At each split, random forests consider only a random subset of mtry features. This decorrelates the trees, reducing the correlation term in the variance formula.

The mtry Hyperparameter

\(\text{Var}(\bar{f}) \approx \rho\sigma^2_{\text{tree}} + \frac{1-\rho}{B}\sigma^2_{\text{tree}}\)

Small mtry means more decorrelation (lower rho) but each tree has higher bias. Common defaults:

Regression: \(\lfloor p/3 \rfloor\)
Classification: \(\lfloor\sqrt{p}\rfloor\)

Variable Importance

Permutation importance measures how much the OOB error increases when a variable's values are randomly shuffled. Variables that matter will show large increases.

\(\text{Imp}_j = \text{OOB}_{\text{permuted}_j} - \text{OOB}_{\text{original}}\)

Tuning

The key hyperparameters are mtry, ntree, and min node size. Use cross-validation to select optimal values.

Data

Dataset

Training Size

Noise Predictors

Noise (σ)

Method

Single Tree Random Forest

Max Depth

Min Leaf Size

Number of Trees

Min Node Size

Quick Lessons

The Workbench

This is your sandbox. Fit single trees or random forests on any dataset. Compare methods. Run quick experiments to build intuition.

Quick Lessons

Each lesson sweeps one parameter while holding others fixed:

Sweep mtry: Fits RF at several mtry values. Shows test MSE vs mtry.
Sweep ntree: Fits RF at several ntree values. Shows how error stabilizes.
Tree vs RF: Compares a single tree at various depths against RF.

Key Defaults

Regression mtry: \(\lfloor p/3 \rfloor\)
Typical ntree: 200-500 (more is rarely harmful)
Min node size: 5 (controls tree depth indirectly)

◆ THE BLUEPRINT

What Is a Decision Tree?

Impurity Measures

Greedy Splitting

Related Topics

◆ THE BLUEPRINT

The Complexity Tradeoff

Cost-Complexity Pruning

Finding the Right Depth

Related Topics

◆ THE BLUEPRINT

Bootstrap Aggregation

Variance Reduction

OOB Error

Related Topics

◆ THE BLUEPRINT

Random Forest = Bagging + Feature Subsampling

The mtry Hyperparameter

Variable Importance

Tuning

Related Topics

◆ THE BLUEPRINT

The Workbench

Quick Lessons

Key Defaults

Related Topics