Dealing with Imbalanced Data: Building Fairer Models

Learn Over-sampling and Under-sampling techniques to build fairer ML models.

Dealing with Imbalanced Data: Building Fairer Models

Imagine you’re building a model to detect a rare disease. Most people in your data are healthy (Class 0), and only a tiny fraction have the disease (Class 1). If you train a model on this data directly, it might become very good at predicting “healthy” simply because that’s the most common case. It might achieve high accuracy but completely fail at identifying the rare, important cases!

This is the problem of imbalanced data, which is very common in real-world scenarios like fraud detection, medical diagnosis, and anomaly detection. When one class (the majority class) vastly outnumbers another (the minority class), standard models often become biased towards the majority.

Main Technical Concept: Imbalanced data refers to classification datasets where the classes are not represented equally. Standard machine learning algorithms trained on such data tend to be biased towards the majority class, leading to poor performance on the minority class. Techniques like Under-sampling and Over-sampling are used to balance the class distribution before training.

Two Main Resampling Approaches

1. Over-Sampling: Increasing Minority Samples

Random Over-Sampling: Simply duplicate existing samples from the minority class.

Pros: Doesn’t lose information
Cons: Can lead to overfitting (model sees exact copies)

SMOTE (Synthetic Minority Over-sampling TEchnique): Create new, synthetic minority class samples instead of just copying.

How it works: For a minority sample, find its nearest minority neighbors, then generate a new synthetic sample somewhere between them in feature space.
Pros: Avoids simple duplication, often leads to better generalization
Cons: Can create noisy samples if minority instances are very close to majority instances

2. Under-Sampling: Reducing Majority Samples

NearMiss: Select majority class samples that are “close” to minority class samples, then keep only those selected samples.

Pros: Significantly reduces dataset size, can speed up training
Cons: Risk of losing important information from removed majority samples

Recommendation: Over-sampling (especially SMOTE) is often preferred over under-sampling as it doesn’t discard potentially useful data.

⚠️ GOLDEN RULE: Resample ONLY Training Data After Splitting!

Always perform resampling techniques ONLY on the TRAINING dataset AFTER splitting your data!

Why This Is Critical

If you resample before splitting:

With Over-sampling: Identical or synthetic copies of minority samples end up in both training and testing sets, causing data leakage
With Under-sampling: Test set distribution is altered based on information from the full dataset

Your test set must remain untouched to get honest evaluation of real-world performance.

Correct Workflow:

Load original data (X, y)
Split into Training (X_train, y_train) and Testing (X_test, y_test)
Apply resampling (e.g., SMOTE) ONLY to X_train and y_train
Train your model on resampled training data
Evaluate on original, untouched test data

Tips for Success

Best Practices:

Resample Training Data Only: Avoid data leakage at all costs
Choose Appropriate Metrics: Don’t just use accuracy. Look at Precision, Recall, F1-Score (especially for minority class), ROC AUC
Try Different Strategies: Experiment with under-sampling, random over-sampling, SMOTE, and combinations
Combine with Other Techniques: Use alongside class weights during model training (class_weight='balanced')
Consider Cost-Sensitive Learning: If misclassifying the minority class is much more costly, explore cost-sensitive algorithms

Handling Imbalanced Data: Key Takeaways

Imbalanced data (unequal class distribution) can bias standard ML models towards the majority class
Two main resampling approaches:
- Under-sampling: Reduces majority class (Risk: data loss)
- Over-sampling: Increases minority class (Risk: overfitting or noisy samples)
Crucial Rule: Apply resampling techniques ONLY to the training data after splitting
SMOTE is often preferred over random over-sampling for better generalization
Evaluate performance using metrics suitable for imbalanced data (Precision, Recall, F1, AUC)
The imbalanced-learn library provides powerful tools for resampling