THE FORGE • Bootstrap & Cross-Validation

Sample vs Population

Bootstrap Distribution

BOOTSTRAP SE

TRUE SE

BOOTSTRAP MEAN OF θ̂*

TRUE θ

The Bootstrap Principle

The sample is to the population as the bootstrap sample is to the sample. By resampling with replacement from our observed data, we approximate the sampling distribution of any statistic.

The Procedure

Draw a sample of size n from the population.
Resample n observations with replacement from the sample.
Compute the statistic on the resample.
Repeat B times. The distribution of those B statistics is the bootstrap distribution.

Bootstrap Standard Error

\(SE_{boot} = \text{sd}(\hat{\theta}_1^*, \hat{\theta}_2^*, \ldots, \hat{\theta}_B^*)\)

The standard deviation of the bootstrap distribution estimates the standard error of the statistic.

How to Interact

Draw a sample, then run the bootstrap. Compare the bootstrap SE to the true SE. Try different populations and statistics. The bootstrap works well even when the population is skewed or when the statistic has no simple formula for its SE.

Bootstrap Confidence Intervals

The bootstrap gives us the sampling distribution of a statistic without relying on formulas. We can use that distribution to build confidence intervals.

Percentile Method

\(CI = [\hat{\theta}^*_{\alpha/2}, \; \hat{\theta}^*_{1-\alpha/2}]\)

Take the middle 95% (or other level) of the bootstrap distribution directly.

Normal Method

\(CI = \hat{\theta} \pm z^* \cdot SE_{boot}\)

Use the bootstrap SE but assume the sampling distribution is normal. Works well when the bootstrap distribution is approximately bell-shaped.

Coverage

A 95% confidence interval should contain the true parameter 95% of the time. The coverage experiment repeats the entire process many times and checks how often the CI actually captures the truth.

Coverage can be below the nominal level when the sample is small or when the population is heavily skewed. Try the right-skewed population with a small n to see this.

How to Interact

Compute a single CI to see it on the bootstrap distribution. Then run the coverage experiment to see how reliable that CI method is across many samples.

K-Fold Cross-Validation

Split the data into K equally-sized folds. Train on K-1 folds, test on the held-out fold. Rotate through all K folds and average the test errors.

The Procedure

Randomly assign each observation to one of K folds.
For fold k: train the model on all data except fold k.
Predict the held-out fold k observations. Compute the test error.
Average test error across all K folds.

CV Error

\(CV(K) = \frac{1}{K} \sum_{k=1}^{K} MSE_k\)

Choosing K

Small K (e.g., 3): each fold is large, so more bias (less training data per split) but lower variance. Large K (e.g., LOOCV): nearly unbiased but higher variance since folds overlap heavily.

K = 5 or K = 10 is the standard compromise.

The Punchline

Training error always decreases with model complexity. CV error reveals the true U-shape. The gap between the two is the key to understanding overfitting.

How to Interact

Generate data, then use the fold viewer slider to step through each fold's train/test split. Toggle the CV error curve to see training vs CV error across polynomial degrees. The minimum of the CV curve picks the right complexity.

CV for Hyperparameter Tuning

Regularized regression adds a penalty to prevent overfitting. The penalty strength is controlled by a tuning parameter. Cross-validation finds the value that minimizes out-of-sample error.

The Penalized Objective

\(\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \left[ \frac{1-\alpha}{2}\|\beta\|_2^2 + \alpha\|\beta\|_1 \right]\)

Ridge (α=0): shrinks all coefficients toward zero but never removes them.

Lasso (α=1): shrinks some coefficients exactly to zero (variable selection).

Elastic Net (0<α<1): a blend of both.

Choosing λ

Slide λ and watch coefficients shrink. Small λ = weak penalty (complex model). Large λ = strong penalty (simple model). The coefficient paths show the full trajectory.

\(\lambda_{min} = \arg\min_\lambda CV(\lambda)\)

The 1-SE Rule

λ_1se is the largest λ whose CV error is within 1 standard error of the minimum. It produces a simpler model that performs nearly as well.

Going Further: Joint Tuning

This app tunes λ for a fixed α. In practice, you can tune both jointly. Run CV over a grid of (α, λ) pairs and pick the combination with the lowest CV error. Packages like caret and tidymodels automate this.

How to Interact

Generate data. Look at the correlations.
Pick a penalty type and click Fit Model.
Slide lambda. Watch coefficients appear and disappear.
Click Run CV to see what lambda cross-validation would choose.

◆ THE BLUEPRINT

The Bootstrap Principle

The Procedure

Bootstrap Standard Error

How to Interact

Related Topics

◆ THE BLUEPRINT

Bootstrap Confidence Intervals

Percentile Method

Normal Method

Coverage

How to Interact

Related Topics

◆ THE BLUEPRINT

K-Fold Cross-Validation

The Procedure

CV Error

Choosing K

The Punchline

How to Interact

Related Topics

◆ THE BLUEPRINT

CV for Hyperparameter Tuning

The Penalized Objective

Choosing λ

The 1-SE Rule

Going Further: Joint Tuning

How to Interact

Related Topics