Regularization: Keeping Models in Check with L1 and L2
Master L1 (Lasso) and L2 (Ridge) regularization. Learn how adding penalty terms prevents overfitting, understand feature selection via L1, and tune the crucial lambda hyperparameter.
Regularization: Keeping Models in Check with L1 and L2
Learn how Lasso (L1) and Ridge (L2) prevent overfitting and improve your models.
What is Regularization? Keeping Models in Check
Imagine training a machine learning model, like one predicting house prices. If the model gets too complex (maybe using too many features or high-degree polynomials), it might learn the training data perfectly – including all the random noise and tiny details. This sounds good, but it’s often bad! This is called overfitting.
An overfit model might ace the test on data it’s already seen, but it fails miserably when shown new, unseen data. How can we prevent this?
That’s where Regularization comes in. It’s a set of techniques used to prevent overfitting by adding a penalty to the model’s learning process, discouraging it from becoming too complex.
Main Technical Concept: Regularization adds a penalty term to the model’s loss function (the function it tries to minimize during training). This penalty is based on the size of the model’s coefficients (weights). By forcing the model to keep its weights small, regularization helps create simpler models that generalize better to new data.
How Does Adding a Penalty Help?
The Core Idea: Penalizing Complexity
Think of a model’s coefficients (often denoted as β or w) as representing how much importance the model gives to each input feature. Complex models that overfit often have very large coefficients.
Regularization adds a “cost” based on these coefficients:
Regularized Loss = Error(y, ŷ) + λ * Penalty(Coefficients)
λ (lambda) = The Regularization Parameter (called alpha in scikit-learn). Controls how strong the penalty is.
By adding this penalty, we force the model to find a balance: it can’t just make coefficients huge to minimize error; it also has to keep them small.
L1 Regularization (Lasso Regression)
The “Absolute Value” Penalty
L1 adds a penalty equal to the sum of the absolute values of the coefficients:
L1 Penalty = λ * Σ |βj|
Key Effect: Sparsity and Feature Selection
- The most striking effect: L1 can force some coefficients to become exactly zero!
- This means Lasso effectively performs automatic feature selection
- Results in a sparse model (fewer active features)
When to Use L1? When you suspect many features are irrelevant and want a simpler, more interpretable model.
L2 Regularization (Ridge Regression)
The “Squared Value” Penalty
L2 adds a penalty equal to the sum of the squared values of the coefficients:
L2 Penalty = λ * Σ βj²
Key Effect: Coefficient Shrinkage
- L2 encourages coefficients to be small and spread out more evenly
- Shrinks coefficients towards zero but rarely forces them to exactly zero
- All features are typically kept, but their influence is moderated
When to Use L2? Good general-purpose regularizer when most features are useful but you want to prevent any from dominating.
The Tuning Knob: Lambda (λ or Alpha)
- High λ: Stronger penalty → simpler models, risk of underfitting
- Low λ: Weaker penalty → model behaves more like standard regression, risk of overfitting
- λ = 0: No penalty, equivalent to unregularized regression
Finding optimal λ/α is crucial and usually done using cross-validation.
Effect of Lambda on Model Behavior
Underfit Optimal Overfit
─────────────────────────────────────────────────────
λ: Very High ★ λ = 0
L1: Most coefs = 0 Few non-zero Many non-zero
(too sparse) (feature sel.) (no pruning)
L2: All coefs ≈ 0 Small coefs Large coefs
(too shrunken) (well-balanced) (unconstrained)
Test High Low High
Error: (underfitting) (sweet spot) (overfitting)
→ Use cross-validation to find the λ that minimizes test error (★)
Feature Scaling is Critical!
Regularization penalizes the size of coefficients. If features are on vastly different scales, regularization will unfairly penalize features based on their scale rather than importance.
Solution: Always scale features (Standardization or Normalization) before applying L1 or L2.
L1 (Lasso) vs. L2 (Ridge): Key Differences
| Feature | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | Σ|β| | Σβ² |
| Effect | Forces some coefs to exactly zero | Shrinks coefs, rarely zero |
| Feature Selection | Automatic (sparsity) | Keeps all features |
| Use Case | Many irrelevant features | General purpose, multicollinearity |
Elastic Net combines both L1 and L2 for a balanced approach.
Regularization: Key Takeaways
- Regularization prevents overfitting by adding a penalty for complex models
- L1 (Lasso) uses absolute value penalty (
Σ|β|), leading to sparsity and automatic feature selection - L2 (Ridge) uses squared value penalty (
Σβ²), leading to coefficient shrinkage - The strength is controlled by lambda (λ) / alpha (α), which needs careful tuning
- Feature scaling is crucial before applying regularization
- L1 is preferred when feature selection is desired; L2 for general regularization and stability