Sample vs Population
Bootstrap Distribution
BOOTSTRAP SE
TRUE SE
BOOTSTRAP MEAN OF θ̂*
TRUE θ
◆ THE BLUEPRINT
The Bootstrap Principle

The sample is to the population as the bootstrap sample is to the sample. By resampling with replacement from our observed data, we approximate the sampling distribution of any statistic.

The Procedure
  1. Draw a sample of size n from the population.
  2. Resample n observations with replacement from the sample.
  3. Compute the statistic on the resample.
  4. Repeat B times. The distribution of those B statistics is the bootstrap distribution.
Bootstrap Standard Error
\(SE_{boot} = \text{sd}(\hat{\theta}_1^*, \hat{\theta}_2^*, \ldots, \hat{\theta}_B^*)\)

The standard deviation of the bootstrap distribution estimates the standard error of the statistic.

How to Interact

Draw a sample, then run the bootstrap. Compare the bootstrap SE to the true SE. Try different populations and statistics. The bootstrap works well even when the population is skewed or when the statistic has no simple formula for its SE.

Related Topics
Bootstrap Distribution & CI
Coverage Experiment
CONFIDENCE INTERVAL
CI WIDTH
COVERAGE
◆ THE BLUEPRINT
Bootstrap Confidence Intervals

The bootstrap gives us the sampling distribution of a statistic without relying on formulas. We can use that distribution to build confidence intervals.

Percentile Method
\(CI = [\hat{\theta}^*_{\alpha/2}, \; \hat{\theta}^*_{1-\alpha/2}]\)

Take the middle 95% (or other level) of the bootstrap distribution directly.

Normal Method
\(CI = \hat{\theta} \pm z^* \cdot SE_{boot}\)

Use the bootstrap SE but assume the sampling distribution is normal. Works well when the bootstrap distribution is approximately bell-shaped.

Coverage

A 95% confidence interval should contain the true parameter 95% of the time. The coverage experiment repeats the entire process many times and checks how often the CI actually captures the truth.

Coverage can be below the nominal level when the sample is small or when the population is heavily skewed. Try the right-skewed population with a small n to see this.

How to Interact

Compute a single CI to see it on the bootstrap distribution. Then run the coverage experiment to see how reliable that CI method is across many samples.

Related Topics
Data & Fold Assignment
Training vs CV Error
TRAINING MSE
CV MSE
BEST DEGREE (MIN CV)
PER-FOLD TEST MSE
◆ THE BLUEPRINT
K-Fold Cross-Validation

Split the data into K equally-sized folds. Train on K-1 folds, test on the held-out fold. Rotate through all K folds and average the test errors.

The Procedure
  1. Randomly assign each observation to one of K folds.
  2. For fold k: train the model on all data except fold k.
  3. Predict the held-out fold k observations. Compute the test error.
  4. Average test error across all K folds.
CV Error
\(CV(K) = \frac{1}{K} \sum_{k=1}^{K} MSE_k\)
Choosing K

Small K (e.g., 3): each fold is large, so more bias (less training data per split) but lower variance. Large K (e.g., LOOCV): nearly unbiased but higher variance since folds overlap heavily.

K = 5 or K = 10 is the standard compromise.

The Punchline

Training error always decreases with model complexity. CV error reveals the true U-shape. The gap between the two is the key to understanding overfitting.

How to Interact

Generate data, then use the fold viewer slider to step through each fold's train/test split. Toggle the CV error curve to see training vs CV error across polynomial degrees. The minimum of the CV curve picks the right complexity.

Related Topics
◆ THE BLUEPRINT
CV for Hyperparameter Tuning

Regularized regression adds a penalty to prevent overfitting. The penalty strength is controlled by a tuning parameter. Cross-validation finds the value that minimizes out-of-sample error.

The Penalized Objective
\(\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \left[ \frac{1-\alpha}{2}\|\beta\|_2^2 + \alpha\|\beta\|_1 \right]\)

Ridge (α=0): shrinks all coefficients toward zero but never removes them.

Lasso (α=1): shrinks some coefficients exactly to zero (variable selection).

Elastic Net (0<α<1): a blend of both.

Choosing λ

Slide λ and watch coefficients shrink. Small λ = weak penalty (complex model). Large λ = strong penalty (simple model). The coefficient paths show the full trajectory.

\(\lambda_{min} = \arg\min_\lambda CV(\lambda)\)
The 1-SE Rule

λ1se is the largest λ whose CV error is within 1 standard error of the minimum. It produces a simpler model that performs nearly as well.

Going Further: Joint Tuning

This app tunes λ for a fixed α. In practice, you can tune both jointly. Run CV over a grid of (α, λ) pairs and pick the combination with the lowest CV error. Packages like caret and tidymodels automate this.

How to Interact
  1. Generate data. Look at the correlations.
  2. Pick a penalty type and click Fit Model.
  3. Slide lambda. Watch coefficients appear and disappear.
  4. Click Run CV to see what lambda cross-validation would choose.
Related Topics