Regularization: Keeping Models in Check with L1 and L2

Learn how Lasso (L1) and Ridge (L2) prevent overfitting and improve your models.

What is Regularization? Keeping Models in Check

Imagine training a machine learning model, like one predicting house prices. If the model gets too complex (maybe using too many features or high-degree polynomials), it might learn the training data perfectly – including all the random noise and tiny details. This sounds good, but it’s often bad! This is called overfitting.

An overfit model might ace the test on data it’s already seen, but it fails miserably when shown new, unseen data. How can we prevent this?

That’s where Regularization comes in. It’s a set of techniques used to prevent overfitting by adding a penalty to the model’s learning process, discouraging it from becoming too complex.

Main Technical Concept: Regularization adds a penalty term to the model’s loss function (the function it tries to minimize during training). This penalty is based on the size of the model’s coefficients (weights). By forcing the model to keep its weights small, regularization helps create simpler models that generalize better to new data.

How Does Adding a Penalty Help?

The Core Idea: Penalizing Complexity

Think of a model’s coefficients (often denoted as β or w) as representing how much importance the model gives to each input feature. Complex models that overfit often have very large coefficients.

Regularization adds a “cost” based on these coefficients:

Regularized Loss = Error(y, ŷ) + λ * Penalty(Coefficients)

λ (lambda) = The Regularization Parameter (called alpha in scikit-learn). Controls how strong the penalty is.

By adding this penalty, we force the model to find a balance: it can’t just make coefficients huge to minimize error; it also has to keep them small.

L1 Regularization (Lasso Regression)

The “Absolute Value” Penalty

L1 adds a penalty equal to the sum of the absolute values of the coefficients:

L1 Penalty = λ * Σ |βj|

Key Effect: Sparsity and Feature Selection

The most striking effect: L1 can force some coefficients to become exactly zero!
This means Lasso effectively performs automatic feature selection
Results in a sparse model (fewer active features)

When to Use L1? When you suspect many features are irrelevant and want a simpler, more interpretable model.

L2 Regularization (Ridge Regression)

The “Squared Value” Penalty

L2 adds a penalty equal to the sum of the squared values of the coefficients:

L2 Penalty = λ * Σ βj²

Key Effect: Coefficient Shrinkage

L2 encourages coefficients to be small and spread out more evenly
Shrinks coefficients towards zero but rarely forces them to exactly zero
All features are typically kept, but their influence is moderated

When to Use L2? Good general-purpose regularizer when most features are useful but you want to prevent any from dominating.

The Tuning Knob: Lambda (λ or Alpha)

High λ: Stronger penalty → simpler models, risk of underfitting
Low λ: Weaker penalty → model behaves more like standard regression, risk of overfitting
λ = 0: No penalty, equivalent to unregularized regression

Finding optimal λ/α is crucial and usually done using cross-validation.

Effect of Lambda on Model Behavior

         Underfit              Optimal              Overfit
         ─────────────────────────────────────────────────────
λ:       Very High               ★                  λ = 0

L1:      Most coefs = 0       Few non-zero       Many non-zero
         (too sparse)         (feature sel.)     (no pruning)

L2:      All coefs ≈ 0        Small coefs        Large coefs
         (too shrunken)       (well-balanced)    (unconstrained)

Test      High                 Low                 High
Error:    (underfitting)       (sweet spot)        (overfitting)

→ Use cross-validation to find the λ that minimizes test error (★)

Feature Scaling is Critical!

Regularization penalizes the size of coefficients. If features are on vastly different scales, regularization will unfairly penalize features based on their scale rather than importance.

Solution: Always scale features (Standardization or Normalization) before applying L1 or L2.

L1 (Lasso) vs. L2 (Ridge): Key Differences

Feature	L1 (Lasso)	L2 (Ridge)
Penalty	`Σ\|β\|`	`Σβ²`
Effect	Forces some coefs to exactly zero	Shrinks coefs, rarely zero
Feature Selection	Automatic (sparsity)	Keeps all features
Use Case	Many irrelevant features	General purpose, multicollinearity

Elastic Net combines both L1 and L2 for a balanced approach.

Regularization: Key Takeaways

Regularization prevents overfitting by adding a penalty for complex models
L1 (Lasso) uses absolute value penalty (Σ|β|), leading to sparsity and automatic feature selection
L2 (Ridge) uses squared value penalty (Σβ²), leading to coefficient shrinkage
The strength is controlled by lambda (λ) / alpha (α), which needs careful tuning
Feature scaling is crucial before applying regularization
L1 is preferred when feature selection is desired; L2 for general regularization and stability