Deriving Logistic Regression's Update Rule
Core Concepts to Understand
- Problem Setup: Defining inputs, outputs, parameters for binary classification.
- Sigmoid Function: Its mathematical form, properties, and crucially, its derivative.
- Probabilistic Model: How logistic regression models P(y=1|x) and P(y=0|x).
- Maximum Likelihood Estimation (MLE): The principle of finding parameters that maximize the likelihood of observed data.
- Log-Likelihood: Why we use it (numerical stability, easier differentiation).
- Cost Function: The negative log-likelihood (or binary cross-entropy).
- Gradient Derivation: Applying the chain rule step-by-step.
- Gradient Descent Update Rule: The final form for parameter updates.
- Regularization (L1 & L2): How it modifies the cost function and gradient, and its implications.
Derivation Walkthrough
Problem Setup
For binary classification with logistic regression:
- Input features: x ∈ ℝd
- Binary labels: y ∈ {0, 1}
- Parameters: θ = [θ₀, θ₁, ..., θd]T
- Linear combination: z = θTx = θ₀ + θ₁x₁ + ... + θdxd
Step 1: Sigmoid Function and Its Properties
The sigmoid function maps any real number to (0,1):
σ(z) = 1 / (1 + e-z)
Key Properties:
- σ(z) ∈ (0, 1) for all z ∈ ℝ
- σ(0) = 0.5
- σ(-z) = 1 - σ(z)
Derivative of Sigmoid (Critical for elegant update rule):
dσ(z)/dz = σ(z) × (1 - σ(z))
Proof:
dσ(z)/dz = d/dz [1 / (1 + e-z)]
= -1 / (1 + e-z)2 × d/dz(1 + e-z)
= -1 / (1 + e-z)2 × (-e-z)
= e-z / (1 + e-z)2
= [1 / (1 + e-z)] × [e-z / (1 + e-z)]
= σ(z) × [e-z / (1 + e-z)]
Since e-z / (1 + e-z) = (1 + e-z - 1) / (1 + e-z) = 1 - 1/(1 + e-z) = 1 - σ(z)
Therefore: dσ(z)/dz = σ(z)(1 - σ(z))
Step 2: Probabilistic Model
Logistic regression models the probability:
P(y = 1 | x, θ) = σ(θTx) = 1 / (1 + e-θTx)
P(y = 0 | x, θ) = 1 - σ(θTx) = σ(-θTx)
This can be written compactly as:
P(y | x, θ) = σ(θTx)y × (1 - σ(θTx))(1-y)
Step 3: Why Log-Likelihood?
Maximum Likelihood Estimation (MLE) Principle: We want to find θ that maximizes the probability of observing our training data.
For n training examples {(x₁, y₁), (x₂, y₂), ..., (xn, yn)}:
Likelihood:
L(θ) = ∏i=1n P(yi | xi, θ) = ∏i=1n σ(θTxi)yi × (1 - σ(θTxi))(1-yi)
Why take the logarithm?
- Numerical stability: Products of small probabilities → underflow
- Computational efficiency: Products become sums
- Optimization: Monotonic transformation preserves maxima
- Differentiability: Easier to differentiate sums than products
Log-Likelihood:
ℓ(θ) = log L(θ) = Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))]
Step 4: Cost Function
We minimize the negative log-likelihood (cross-entropy loss):
J(θ) = -ℓ(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))]
For a single example:
J(θ) = -[y log σ(θTx) + (1-y) log(1 - σ(θTx))]
Step 5: Gradient Derivation
To find the gradient ∇θJ(θ), we compute ∂J/∂θj for each parameter θj.
For a single training example:
∂J/∂θj = ∂/∂θj [-y log σ(θTx) - (1-y) log(1 - σ(θTx))]
Chain rule application:
∂J/∂θj = -y × (1/σ(θTx)) × ∂σ(θTx)/∂θj - (1-y) × (1/(1-σ(θTx))) × ∂(1-σ(θTx))/∂θj
Since ∂(1-σ(θTx))/∂θj = -∂σ(θTx)/∂θj:
∂J/∂θj = -y × (1/σ(θTx)) × ∂σ(θTx)/∂θj + (1-y) × (1/(1-σ(θTx))) × ∂σ(θTx)/∂θj
Factor out ∂σ(θTx)/∂θj:
∂J/∂θj = [-y/σ(θTx) + (1-y)/(1-σ(θTx))] × ∂σ(θTx)/∂θj
Now use the chain rule for ∂σ(θTx)/∂θj:
∂σ(θTx)/∂θj = σ(θTx)(1 - σ(θTx)) × ∂(θTx)/∂θj = σ(θTx)(1 - σ(θTx)) × xj
Substituting back:
∂J/∂θj = [-y/σ(θTx) + (1-y)/(1-σ(θTx))] × σ(θTx)(1 - σ(θTx)) × xj
Simplifying:
∂J/∂θj = [-y(1 - σ(θTx)) + (1-y)σ(θTx)] × xj
= [-y + yσ(θTx) + σ(θTx) - yσ(θTx)] × xj
= [σ(θTx) - y] × xj
Step 6: The Elegant Update Rule
For n training examples:
∂J/∂θj = Σi=1n (σ(θTxi) - yi) × xij
In vector form:
∇θJ(θ) = Σi=1n (σ(θTxi) - yi) × xi = XT(σ(Xθ) - y)
Gradient Descent Update:
θ := θ - α × ∇θJ(θ)
θ := θ - α × XT(σ(Xθ) - y)
Where α is the learning rate.
Why This Update Rule is "Elegant"
- Simple Form: The gradient has the intuitive form (prediction - actual) × feature
- No Complex Terms: Despite the non-linear sigmoid, the gradient is linear in the error
- Automatic Weighting: Larger errors get more weight in the update
- Sigmoid Derivative Cancellation: The σ(z)(1-σ(z)) terms cancel out beautifully
- Resembles Linear Regression: Same form as linear regression gradient!
Regularized Logistic Regression
L2 Regularization (Ridge)
Modified Cost Function:
J(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))] + λ/2 Σj=1d θj2
Note: We typically don't regularize the bias term θ0.
Modified Gradient:
∂J/∂θj = Σi=1n (σ(θTxi) - yi) × xij + λθj (for j ≠ 0)
∂J/∂θ0 = Σi=1n (σ(θTxi) - yi) × xi0 (bias term, no regularization)
Update Rule:
θj := θj - α[Σi=1n (σ(θTxi) - yi) × xij + λθj]
θj := θj(1 - αλ) - α Σi=1n (σ(θTxi) - yi) × xij
Computational Implications of L2:
- Advantages:
- Smooth, differentiable everywhere
- Closed-form solutions possible (for linear models)
- Shrinks parameters proportionally
- Computational Cost: O(d) additional operations per iteration
- Memory: No additional memory overhead
L1 Regularization (Lasso)
Modified Cost Function:
J(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))] + λ Σj=1d |θj|
Challenge: |θj| is not differentiable at θj = 0.
Subgradient:
∂|θj|/∂θj = {
+1 if θj > 0
-1 if θj < 0
[-1, +1] if θj = 0
}
Update Rule (Soft Thresholding):
θj := sign(θj - α∇j) × max(0, |θj - α∇j| - αλ)
Where ∇j is the gradient of the unregularized loss.
Computational Implications of L1:
- Advantages:
- Promotes sparsity (automatic feature selection)
- Robust to outliers
- Disadvantages:
- Non-differentiable at zero
- Requires specialized algorithms (proximal gradient, coordinate descent)
- Slower convergence than L2
- Computational Cost: O(d) additional operations but requires more sophisticated updates
- Memory: May require storing active set of non-zero parameters
Why This Derivation Matters
- Fundamental Understanding: Knowing this derivation solidifies your understanding of how logistic regression learns.
- Basis for Other Models: The concepts of likelihood, log-likelihood, cross-entropy, and gradient descent are foundational to many other machine learning models, especially in deep learning.
- Debugging & Modification: Understanding the "why" behind the update rule helps in debugging custom implementations or modifying the loss function for specific needs.
- Regularization Intuition: Seeing how regularization terms are added to the cost and gradient provides a clear picture of their impact.
- Interview Preparedness: This is a classic "ML fundamentals" interview question.