Logistic Regression: Gradient Descent Derivation — ML Depth

Deriving Logistic Regression's Update Rule

Core Concepts to Understand

Problem Setup: Defining inputs, outputs, parameters for binary classification.
Sigmoid Function: Its mathematical form, properties, and crucially, its derivative.
Probabilistic Model: How logistic regression models P(y=1|x) and P(y=0|x).
Maximum Likelihood Estimation (MLE): The principle of finding parameters that maximize the likelihood of observed data.
Log-Likelihood: Why we use it (numerical stability, easier differentiation).
Cost Function: The negative log-likelihood (or binary cross-entropy).
Gradient Derivation: Applying the chain rule step-by-step.
Gradient Descent Update Rule: The final form for parameter updates.
Regularization (L1 & L2): How it modifies the cost function and gradient, and its implications.

Derivation Walkthrough

Interviewer: Welcome! Today, we're going to dive deep into the fundamentals of logistic regression. Can you start by deriving the gradient descent update rule for logistic regression from first principles? Let's begin with the problem setup and the sigmoid function.

Candidate: Absolutely.

Problem Setup

For binary classification with logistic regression:

Input features: x ∈ ℝ^d
Binary labels: y ∈ {0, 1}
Parameters: θ = [θ₀, θ₁, ..., θ_d]^T
Linear combination: z = θ^Tx = θ₀ + θ₁x₁ + ... + θ_dx_d

Step 1: Sigmoid Function and Its Properties

The sigmoid function maps any real number to (0,1):

σ(z) = 1 / (1 + e^-z)

Key Properties:

σ(z) ∈ (0, 1) for all z ∈ ℝ
σ(0) = 0.5
σ(-z) = 1 - σ(z)

Derivative of Sigmoid (Critical for elegant update rule):

dσ(z)/dz = σ(z) × (1 - σ(z))

Proof:

dσ(z)/dz = d/dz [1 / (1 + e^-z)]
                     = -1 / (1 + e^-z)² × d/dz(1 + e^-z)
                     = -1 / (1 + e^-z)² × (-e^-z)
                     = e^-z / (1 + e^-z)²
                     = [1 / (1 + e^-z)] × [e^-z / (1 + e^-z)]
                     = σ(z) × [e^-z / (1 + e^-z)]

Since e^-z / (1 + e^-z) = (1 + e^-z - 1) / (1 + e^-z) = 1 - 1/(1 + e^-z) = 1 - σ(z)

Therefore: dσ(z)/dz = σ(z)(1 - σ(z))

Interviewer: Excellent. Now, how do we use this to model probabilities, and what's the likelihood function for our training data? Explain why we typically move to the log-likelihood.

Candidate:

Step 2: Probabilistic Model

Logistic regression models the probability:

P(y = 1 | x, θ) = σ(θ^Tx) = 1 / (1 + e^{-θ^Tx})
P(y = 0 | x, θ) = 1 - σ(θ^Tx) = σ(-θ^Tx)

This can be written compactly as:

P(y | x, θ) = σ(θ^Tx)^y × (1 - σ(θ^Tx))^(1-y)

Step 3: Why Log-Likelihood?

Maximum Likelihood Estimation (MLE) Principle: We want to find θ that maximizes the probability of observing our training data.

For n training examples {(x₁, y₁), (x₂, y₂), ..., (x_n, y_n)}:

Likelihood:

L(θ) = ∏_i=1ⁿ P(y_i | x_i, θ) = ∏_i=1ⁿ σ(θ^Tx_i)^y_i × (1 - σ(θ^Tx_i))^(1-y_i)

Why take the logarithm?

Numerical stability: Products of small probabilities → underflow
Computational efficiency: Products become sums
Optimization: Monotonic transformation preserves maxima
Differentiability: Easier to differentiate sums than products

Log-Likelihood:

ℓ(θ) = log L(θ) = Σ_i=1ⁿ [y_i log σ(θ^Tx_i) + (1-y_i) log(1 - σ(θ^Tx_i))]

Interviewer: Very clear. How does this lead us to the cost function we aim to optimize?

Candidate:

Step 4: Cost Function

We minimize the negative log-likelihood (cross-entropy loss):

J(θ) = -ℓ(θ) = -Σ_i=1ⁿ [y_i log σ(θ^Tx_i) + (1-y_i) log(1 - σ(θ^Tx_i))]

For a single example:

J(θ) = -[y log σ(θ^Tx) + (1-y) log(1 - σ(θ^Tx))]

Interviewer: Perfect. Now for the core part: please derive the gradient of this cost function J(θ) with respect to a single parameter θ_j.

Candidate:

Step 5: Gradient Derivation

To find the gradient ∇_θJ(θ), we compute ∂J/∂θ_j for each parameter θ_j.

For a single training example:

∂J/∂θ_j = ∂/∂θ_j [-y log σ(θ^Tx) - (1-y) log(1 - σ(θ^Tx))]

Chain rule application:

∂J/∂θ_j = -y × (1/σ(θ^Tx)) × ∂σ(θ^Tx)/∂θ_j - (1-y) × (1/(1-σ(θ^Tx))) × ∂(1-σ(θ^Tx))/∂θ_j

Since ∂(1-σ(θ^Tx))/∂θ_j = -∂σ(θ^Tx)/∂θ_j:

∂J/∂θ_j = -y × (1/σ(θ^Tx)) × ∂σ(θ^Tx)/∂θ_j + (1-y) × (1/(1-σ(θ^Tx))) × ∂σ(θ^Tx)/∂θ_j

Factor out ∂σ(θ^Tx)/∂θ_j:

∂J/∂θ_j = [-y/σ(θ^Tx) + (1-y)/(1-σ(θ^Tx))] × ∂σ(θ^Tx)/∂θ_j

Now use the chain rule for ∂σ(θ^Tx)/∂θ_j:

∂σ(θ^Tx)/∂θ_j = σ(θ^Tx)(1 - σ(θ^Tx)) × ∂(θ^Tx)/∂θ_j = σ(θ^Tx)(1 - σ(θ^Tx)) × x_j

Substituting back:

∂J/∂θ_j = [-y/σ(θ^Tx) + (1-y)/(1-σ(θ^Tx))] × σ(θ^Tx)(1 - σ(θ^Tx)) × x_j

Simplifying:

∂J/∂θ_j = [-y(1 - σ(θ^Tx)) + (1-y)σ(θ^Tx)] × x_j
                    = [-y + yσ(θ^Tx) + σ(θ^Tx) - yσ(θ^Tx)] × x_j
                    = [σ(θ^Tx) - y] × x_j

Interviewer: That's the key simplification! So, what is the full gradient descent update rule, and why is this result often described as "elegant"?

Candidate:

Step 6: The Elegant Update Rule

For n training examples:

∂J/∂θ_j = Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_ij

In vector form:

∇_θJ(θ) = Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_i = X^T(σ(Xθ) - y)

Gradient Descent Update:

θ := θ - α × ∇_θJ(θ)
θ := θ - α × X^T(σ(Xθ) - y)

Where α is the learning rate.

Why This Update Rule is "Elegant"

Simple Form: The gradient has the intuitive form (prediction - actual) × feature
No Complex Terms: Despite the non-linear sigmoid, the gradient is linear in the error
Automatic Weighting: Larger errors get more weight in the update
Sigmoid Derivative Cancellation: The σ(z)(1-σ(z)) terms cancel out beautifully
Resembles Linear Regression: Same form as linear regression gradient!

Interviewer: Excellent derivation and explanation. Now, let's consider regularization. How would you modify this for L2 regularized logistic regression, and what are the key computational implications?

Candidate:

Regularized Logistic Regression

L2 Regularization (Ridge)

Modified Cost Function:

J(θ) = -Σ_i=1ⁿ [y_i log σ(θ^Tx_i) + (1-y_i) log(1 - σ(θ^Tx_i))] + λ/2 Σ_j=1^d θ_j²

Note: We typically don't regularize the bias term θ₀.

Modified Gradient:

∂J/∂θ_j = Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_ij + λθ_j  (for j ≠ 0)
∂J/∂θ₀ = Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_i0  (bias term, no regularization)

Update Rule:

θ_j := θ_j - α[Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_ij + λθ_j]
θ_j := θ_j(1 - αλ) - α Σ_i=1ⁿ (σ(θ^Tx_i) - y_i) × x_ij

Computational Implications of L2:

Advantages:
- Smooth, differentiable everywhere
- Closed-form solutions possible (for linear models)
- Shrinks parameters proportionally
Computational Cost: O(d) additional operations per iteration
Memory: No additional memory overhead

Interviewer: Good. And what about L1 regularization (Lasso)? How does that differ, and what are its implications?

Candidate:

L1 Regularization (Lasso)

Modified Cost Function:

J(θ) = -Σ_i=1ⁿ [y_i log σ(θ^Tx_i) + (1-y_i) log(1 - σ(θ^Tx_i))] + λ Σ_j=1^d |θ_j|

Challenge: |θ_j| is not differentiable at θ_j = 0.

Subgradient:

∂|θ_j|/∂θ_j = {
    +1  if θ_j > 0
    -1  if θ_j < 0
    [-1, +1]  if θ_j = 0
}

Update Rule (Soft Thresholding):

θ_j := sign(θ_j - α∇_j) × max(0, |θ_j - α∇_j| - αλ)

Where ∇_j is the gradient of the unregularized loss.

Computational Implications of L1:

Advantages:
- Promotes sparsity (automatic feature selection)
- Robust to outliers
Disadvantages:
- Non-differentiable at zero
- Requires specialized algorithms (proximal gradient, coordinate descent)
- Slower convergence than L2
Computational Cost: O(d) additional operations but requires more sophisticated updates
Memory: May require storing active set of non-zero parameters

Interviewer: That's a very comprehensive answer. You've covered the derivation from first principles beautifully and addressed the nuances of regularization well. Thank you.

Candidate: Thank you! It was a good exercise to walk through it.

Why This Derivation Matters

Fundamental Understanding: Knowing this derivation solidifies your understanding of how logistic regression learns.
Basis for Other Models: The concepts of likelihood, log-likelihood, cross-entropy, and gradient descent are foundational to many other machine learning models, especially in deep learning.
Debugging & Modification: Understanding the "why" behind the update rule helps in debugging custom implementations or modifying the loss function for specific needs.
Regularization Intuition: Seeing how regularization terms are added to the cost and gradient provides a clear picture of their impact.
Interview Preparedness: This is a classic "ML fundamentals" interview question.