Fit & Residuals
Loss Function Shapes
Total Loss
◆ THE BLUEPRINT
What You're Looking At

A regression line positioned by your sliders. The vertical segments are residuals: the error between each data point and your line. The loss function decides how much each error costs.

Key Equations
Total loss: \(L = \sum_{i=1}^{n} \rho(e_i)\) where \(e_i = y_i - \hat{y}_i\)
\(\rho_{L2}(e) = e^2 \qquad \rho_{L1}(e) = |e| \qquad \rho_{Huber}(e) = \begin{cases} e^2/2 & |e| \leq \delta \\ \delta|e| - \delta^2/2 & |e| > \delta \end{cases}\)
How to Interact

Drag the intercept and slope sliders to reposition your line. Watch the total loss numbers update. Try to minimize each loss manually, then toggle "Show Optimal Lines" to see the true minimizer.

Click "Add Outlier" and compare how L2 and L1 respond. L2 will spike; L1 increases proportionally.

Related Topics
L2 (OLS)
L1 (Median)
Method Comparison
Takeaway
◆ THE BLUEPRINT
What You're Looking At

The same data fit by OLS (L2) and median regression (L1), side by side. Add outliers and watch OLS get pulled while L1 holds steady.

Why L2 Fails With Outliers
L2 cost grows as \(e^2\). An outlier with \(e = 10\) costs \(100\). The optimizer moves the entire line to reduce that one giant squared cost.
L1 cost grows as \(|e|\). The same outlier costs only \(10\). Each point's influence is bounded.
When to Use Each

L2 (OLS): clean data with Gaussian errors.

L1 (Median): known outliers or heavy-tailed error distributions.

Huber: best of both. Quadratic near zero, linear for large errors.

Related Topics
Contour Plot
Coefficients
Interpretation
Coefficient Paths vs log(λ)
Cross-Validation Error
Selected Model
◆ THE BLUEPRINT
What You're Looking At

Each colored line traces one coefficient's value across a range of regularization strengths. The x-axis is \(\log_{10}(\lambda)\). Moving right means more regularization. At the left, all coefficients are near their OLS values. At the right, all approach zero.

Key Quantities
\(\lambda_{\min}\): the \(\lambda\) that minimizes 10-fold cross-validation error.
\(\lambda_{1se}\): the largest \(\lambda\) within one standard error of \(\lambda_{\min}\). Produces a simpler model at negligible CV cost.
Lasso vs Ridge Paths

Lasso paths hit zero at finite \(\lambda\). The coefficient flatlines and stays there. This is variable selection.

Ridge paths approach zero asymptotically but never reach it. All predictors remain in the model.

How to Interact

Switch datasets to compare sparse vs dense true coefficient structures. On a sparse dataset, Lasso should correctly identify the true nonzero predictors near \(\lambda_{\min}\).

Related Topics
Coefficient Comparison
Train MSE
Test MSE
Nonzero Coefs
Residuals vs Fitted
◆ THE BLUEPRINT
The Anvil

This is the unguided workbench. Combine any dataset, method, and regularization strength. Evaluate using train/test MSE and the coefficient comparison.

What to Try

Start with a sparse dataset and Lasso at auto-λ. Note how many coefficients are zeroed. Switch to Ridge with the same data. Test MSE is similar but all coefficients remain.

Try a dense dataset where all coefficients are nonzero. Ridge often outperforms Lasso here because there is nothing to select away.

Other Forge Tools