Model Selection: AIC, BIC & Cross-Validation — ML Depth

Model Selection Criteria

Core Concepts for AIC/BIC & CV

Model Selection: Choosing the best model from a set of candidate models.
Likelihood L: Probability of observing the data given the model parameters.
Log-Likelihood LL: log(L).
Number of Parameters k: A measure of model complexity.
Number of Samples n: Size of the dataset.
Akaike Information Criterion (AIC): Based on information theory, estimates prediction error and penalizes model complexity.
- AIC = 2k - 2LL
Bayesian Information Criterion (BIC): Derived from a Bayesian perspective, also penalizes complexity, typically more harshly than AIC for larger n.
- BIC = k log(n) - 2LL
Kullback-Leibler (KL) Divergence: Measures information loss when approximating a true distribution with a model. AIC is related to this.
Cross-Validation (CV): Resampling method to estimate model performance on unseen data.
- K-Fold CV, Leave-One-Out CV (LOOCV).
- Estimates out-of-sample error.
Theoretical Properties: Consistency (BIC), Asymptotic Efficiency (AIC).

Model Selection Criteria Explained

Interviewer: Let's discuss model selection. Can you derive or at least explain the mathematical basis for the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)? Explain how they balance model fit and complexity, and when you might prefer one over the other.

Candidate: Certainly. AIC and BIC are both information criteria used for model selection. They help us choose among a finite set of models by estimating the quality of each model relative to the others, balancing goodness of fit with model complexity.

Let L be the maximized value of the likelihood function for a model, k be the number of estimated parameters in the model, and n be the number of data points.

Akaike Information Criterion (AIC)

Conceptual Derivation/Basis:

AIC is derived from information theory. It aims to estimate the Kullback-Leibler (KL) divergence between the true underlying data generating process and the fitted model. The KL divergence measures the information lost when the model is used to approximate the true process. AIC provides an asymptotically unbiased estimator of this expected relative KL divergence, up to an additive constant.

A simplified way to think about its derivation involves:

Starting with the idea that we want to minimize the expected log-likelihood on future data (a measure of predictive accuracy).
The in-sample log-likelihood (log L) is a biased (overly optimistic) estimate of this.
Akaike showed that, asymptotically, the bias is approximately equal to k, the number of parameters.
So, an approximately unbiased estimator of the expected out-of-sample log-likelihood is (log L - k).
AIC is then defined (often multiplied by -2 for historical reasons, to be on a loss scale like deviance):

AIC = 2k - 2 log(L)

Alternatively, using deviance D = -2 log(L):

AIC = D + 2k

Balancing Fit and Complexity:

-2 log(L) (or D): This term represents the goodness of fit. A smaller value (larger log-likelihood) indicates a better fit to the training data.
2k: This term is the penalty for model complexity. It increases as the number of parameters k increases.

AIC aims to find a model that fits the data well (low -2 log(L)) without being overly complex (low 2k). We select the model with the lowest AIC value.

Bayesian Information Criterion (BIC)

Conceptual Derivation/Basis:

BIC (also known as Schwarz Criterion) is derived from a Bayesian perspective. It aims to find the model that is most probable given the data and a prior over the models. It's an approximation to the log of the marginal likelihood of the data given the model (log P(D|M)).

Using a Laplace approximation for the marginal likelihood, under certain assumptions (e.g., unit information prior), BIC can be derived as:

BIC = k log(n) - 2 log(L)

Where n is the number of data points.

Balancing Fit and Complexity:

-2 log(L): Same as AIC, represents goodness of fit.
k log(n): This is the penalty for model complexity. Compared to AIC's penalty (2k), BIC's penalty k log(n) depends on the sample size n.

BIC also aims to find a model that fits well but is not too complex. We select the model with the lowest BIC value.

When to Prefer AIC vs. BIC

Penalty Term:
- AIC penalty: 2k
- BIC penalty: k log(n)
- For n ≥ 8 (since log(8) ≈ 2.07 > 2), the BIC penalty for complexity (k log(n)) is harsher than AIC's penalty (2k).
Model Selection Goal:
- AIC: Aims for predictive accuracy. It tends to select models that might be slightly more complex if they improve prediction on new data. It's asymptotically efficient, meaning it will select the model that minimizes the mean squared error of prediction as n → ∞, assuming the true model is infinitely complex or not in the candidate set.
- BIC: Aims for consistency. It tends to select the "true" model (if it's among the candidates) as the sample size n grows large. It often prefers simpler models than AIC, especially for larger datasets.
Practical Preference:
- If the goal is prediction, and you believe the true model is complex or not among your candidates, AIC might be preferred as it's less likely to underfit.
- If the goal is explanation or finding a parsimonious model that is likely the true data generating process (and you believe the true model is relatively simple and in your candidate set), BIC might be preferred due to its stronger penalty on complexity and its consistency property.
- For smaller datasets (n < 8), AIC actually has a stronger penalty than BIC, but this is rarely the case in ML contexts.
- If results from AIC and BIC differ, it suggests uncertainty about model complexity. BIC is often favored in practice for its tendency to choose simpler models, which can be more interpretable and less prone to overfitting with large n.

Interviewer: That's a very good explanation of AIC and BIC. Now, how does cross-validation provide an alternative approach to model selection, and what are its theoretical properties or guarantees, if any?

Candidate:

Cross-Validation (CV) for Model Selection

Cross-validation is a resampling technique used to estimate how well a model will generalize to an independent dataset. It provides a more direct estimate of out-of-sample prediction error compared to AIC/BIC which are based on in-sample fit and complexity penalties.

Mechanism (e.g., K-Fold Cross-Validation):

The training dataset is randomly partitioned into K equal-sized (or nearly equal-sized) subsamples or "folds".
Of the K folds, one fold is retained as the validation data for testing the model, and the remaining K-1 folds are used as training data.
The model is fit on the K-1 training folds and then evaluated on the held-out validation fold (e.g., by calculating MSE, accuracy, etc.).
This process is repeated K times (the "folds"), with each of the K subsamples used exactly once as the validation data.
The K results from the folds can then be averaged (or otherwise combined) to produce a single estimation of the model's performance.

Model Selection with CV: To select among different models (or different hyperparameters for the same model type), you would perform this K-fold CV procedure for each candidate model. The model that yields the best average performance metric on the validation folds is chosen as the preferred model.

Theoretical Properties and Guarantees (or lack thereof)

Estimate of Prediction Error: CV aims to provide an approximately unbiased estimate of the true prediction error on unseen data.
- Leave-One-Out CV (LOOCV), where K=n, is almost unbiased for the true prediction error. However, it can have high variance (the estimates can vary a lot if the CV process were repeated on different initial datasets) and is computationally very expensive.
- K-Fold CV (e.g., K=5 or 10) typically has a slight pessimistic bias (because models are trained on K-1/K of the data, which is less than the full dataset), but often has lower variance than LOOCV, making it a more stable estimate. It's a good compromise.
No Guarantee of Finding the "True" Model: Unlike BIC which is consistent (selects the true model if it's in the set, as n → ∞), CV is generally focused on predictive performance. It will select the model that is expected to predict best on new data, which may not be the simplest or "true" underlying model, especially if the true model is very complex or the data is noisy.
Asymptotic Efficiency (Similar to AIC in some senses): For certain loss functions and under some conditions, K-fold CV can be asymptotically equivalent to AIC in terms of model selection behavior, meaning they tend to select similar models as n gets large, often favoring slightly more complex models for better predictive power if the true model is complex.
Dependence on K: The choice of K in K-fold CV matters.
- Small K (e.g., 2 or 3): Higher bias in error estimate, lower variance.
- Large K (closer to n, like LOOCV): Lower bias, higher variance, computationally expensive.
- K=5 or K=10 are common choices offering a good balance.
No Closed-Form Penalty: Unlike AIC/BIC, CV doesn't have an explicit penalty term for model complexity. The "penalty" is implicit: overly complex models will overfit the K-1 training folds and perform poorly on the held-out validation fold.
Robustness: CV is generally considered more robust and direct for estimating predictive performance than AIC/BIC, especially when the assumptions underlying AIC/BIC (e.g., models being close to the true model, large sample sizes for asymptotic properties to hold) are violated.

In practice, CV is often preferred for model selection when computational resources allow, as it directly estimates generalization performance. AIC and BIC are useful when CV is too expensive or as quick guides, especially when comparing models with different numbers of parameters based on their likelihoods.

Interviewer: That's a very thorough comparison, highlighting the strengths and contexts for each approach. Well explained!

Candidate: Thank you.

Why AIC, BIC & CV Matter

Principled Model Selection: Provide systematic ways to choose among competing models beyond just looking at training error.
Balancing Fit and Complexity: AIC and BIC explicitly penalize model complexity, helping to avoid overfitting. CV does this implicitly.
Estimating Generalization: Cross-validation directly estimates how well a model will perform on unseen data.
Theoretical Underpinnings: AIC is rooted in information theory and KL divergence. BIC has a Bayesian foundation. Understanding these provides deeper insight.
Practical Tools: Widely used in statistical modeling and machine learning for comparing models and tuning hyperparameters.
Understanding Tradeoffs: Knowing when to use AIC, BIC, or CV depends on the goals (prediction vs. explanation), dataset size, and computational budget.