THE FORGE • Regression & Regularization

Fit & Residuals

Loss Function Shapes

Total Loss

What You're Looking At

A regression line positioned by your sliders. The vertical segments are residuals: the error between each data point and your line. The loss function decides how much each error costs.

Key Equations

Total loss: \(L = \sum_{i=1}^{n} \rho(e_i)\) where \(e_i = y_i - \hat{y}_i\)

\(\rho_{L2}(e) = e^2 \qquad \rho_{L1}(e) = |e| \qquad \rho_{Huber}(e) = \begin{cases} e^2/2 & |e| \leq \delta \\ \delta|e| - \delta^2/2 & |e| > \delta \end{cases}\)

How to Interact

Drag the intercept and slope sliders to reposition your line. Watch the total loss numbers update. Try to minimize each loss manually, then toggle "Show Optimal Lines" to see the true minimizer.

Click "Add Outlier" and compare how L2 and L1 respond. L2 will spike; L1 increases proportionally.

What You're Looking At

The same data fit by OLS (L2) and median regression (L1), side by side. Add outliers and watch OLS get pulled while L1 holds steady.

Why L2 Fails With Outliers

L2 cost grows as \(e^2\). An outlier with \(e = 10\) costs \(100\). The optimizer moves the entire line to reduce that one giant squared cost.

L1 cost grows as \(|e|\). The same outlier costs only \(10\). Each point's influence is bounded.

When to Use Each

L2 (OLS): clean data with Gaussian errors.

L1 (Median): known outliers or heavy-tailed error distributions.

Huber: best of both. Quadratic near zero, linear for large errors.

What You're Looking At

Each colored line traces one coefficient's value across a range of regularization strengths. The x-axis is \(\log_{10}(\lambda)\). Moving right means more regularization. At the left, all coefficients are near their OLS values. At the right, all approach zero.

Key Quantities

\(\lambda_{\min}\): the \(\lambda\) that minimizes 10-fold cross-validation error.

\(\lambda_{1se}\): the largest \(\lambda\) within one standard error of \(\lambda_{\min}\). Produces a simpler model at negligible CV cost.

Lasso vs Ridge Paths

Lasso paths hit zero at finite \(\lambda\). The coefficient flatlines and stays there. This is variable selection.

Ridge paths approach zero asymptotically but never reach it. All predictors remain in the model.

How to Interact

Switch datasets to compare sparse vs dense true coefficient structures. On a sparse dataset, Lasso should correctly identify the true nonzero predictors near \(\lambda_{\min}\).

The Anvil

This is the unguided workbench. Combine any dataset, method, and regularization strength. Evaluate using train/test MSE and the coefficient comparison.

What to Try

Start with a sparse dataset and Lasso at auto-λ. Note how many coefficients are zeroed. Switch to Ridge with the same data. Test MSE is similar but all coefficients remain.

Try a dense dataset where all coefficients are nonzero. Ridge often outperforms Lasso here because there is nothing to select away.

Other Forge Tools

Bias-Variance Tradeoff the MSE decomposition underlying regularization

Full control. Generate data, pick a method, explore.

Dataset

Sample Size

Method

OLS

Ridge

Lasso

Elastic Net

Auto (λ via CV)

log₁₀(λ)

α (L1 mix)

Show True β

◆ THE BLUEPRINT

What You're Looking At

Key Equations

How to Interact

Related Topics

◆ THE BLUEPRINT

What You're Looking At

Why L2 Fails With Outliers

When to Use Each

Related Topics

◆ THE BLUEPRINT

What You're Looking At

Key Quantities

Lasso vs Ridge Paths

How to Interact

Related Topics

◆ THE BLUEPRINT

The Anvil

What to Try

Other Forge Tools