Supervised Learning Fundamentals
Core Concepts to Master
- Problem Type: The crucial difference between Regression (predicting a value) and Classification (predicting a category).
- Model Linearity: Understanding if a model assumes a straight-line relationship or if it can handle complex curves and interactions.
- Interpretability: How easy is it to explain the model's predictions to a non-expert?
- Key Assumptions: The rules a model requires the data to follow in order to work correctly.
- Data Preprocessing: How different models demand different preparation steps, especially for categorical data.
- Overfitting vs. Underfitting: The risk of a model being too simple (underfit) or too complex (overfit), and how to control it.
Interview Walkthrough
- Linear Regression is a "Measuring Tape": It's for predicting a specific, continuous number.
- Logistic Regression is a "Sorting Machine": It's for classifying items into one of two boxes (e.g., Yes/No).
- A Decision Tree is a "Flowchart": It makes a prediction by asking a series of simple questions.
Here’s how I see them visually and conceptually.
1. Linear Regression (Regression)
It finds the best-fitting straight line to describe the relationship between inputs and a numerical output. For example, predicting a house price based on its square footage.
Key Assumption: Linearity. It assumes the underlying relationship is a straight line.
2. Logistic Regression (Classification)
It predicts the probability of an item belonging to a class. It uses an S-shaped (sigmoid) curve to map predictions between 0 and 1, and assumes a linear decision boundary can separate the classes.
Key Assumption: Linear separability. It assumes a straight line can separate the groups.
3. Decision Tree (Classification or Regression)
It creates a model that predicts by learning simple decision rules inferred from the data features, like a flowchart that partitions the data.
Key Assumption: No major assumptions about linearity! It can model complex, non-linear relationships. This is its greatest strength.
| Attribute | Linear/Logistic Regression | Decision Tree |
|---|---|---|
| Interpretability | High (Coefficients are easy to explain) | Very High (Flowchart is intuitive) |
| Performance on Non-linear Data | Poor 😞 | Excellent 😀 |
| Data Prep Effort | Medium (Requires scaling, encoding) | Low (Handles categorical data & mixed types) |
| Risk of Overfitting | Low (Can underfit if data is complex) | Very High (Can memorize the training data) |
- For Linear & Logistic Regression: You must convert categories into numbers. These models are mathematical equations and can't handle text. The standard method is One-Hot Encoding, where a column like 'Color' becomes several `Is_Red`, `Is_Green` columns with 1s and 0s.
- For Decision Trees: They handle categorical variables natively. No preprocessing is needed. The tree can simply create a rule like `IF Color == 'Red' THEN...`. This is a significant advantage in terms of ease of use.
- Linear & Logistic Regression are inherently simple models with low complexity (or "high bias"). Their risk of overfitting is very low. In fact, they are more likely to underfit if the data has complex patterns. The main way to control their complexity is through regularization (L1 or L2), which penalizes large coefficient values to prevent any single feature from having too much influence.
- Decision Trees are the opposite. They are high-complexity models (or "high variance") and are extremely prone to overfitting. A tree will keep splitting the data until every leaf is perfectly pure, essentially memorizing the training set. To control this, we must use techniques like:
- Pruning: Cutting back branches after the tree is built.
- Setting `max_depth`: Limiting how many "questions" the tree can ask in a row.
- Setting `min_samples_leaf`: Requiring a certain number of data points to be in a leaf before a split is considered final.
This is why single decision trees are often not used in practice, but they form the basis for powerful ensemble methods like Random Forests that specifically address this overfitting problem.
Why This Comparison Matters in an Interview
- Shows Foundational Strength: A clear answer proves you have mastered the basics, which is a prerequisite for any ML role.
- Demonstrates Critical Thinking: Comparing models isn't about facts; it's about understanding trade-offs. This shows you can choose the right tool for a given business problem.
- Connects Theory to Practice: Discussing data prep (encoding) and model tuning (overfitting) shows you've moved beyond textbook knowledge to practical application.
- Highlights Communication Skills: Using analogies and visuals proves you can explain complex topics to diverse audiences, a vital skill for collaborating with business stakeholders.
Which Model Fits Best?
For each scenario, choose the most suitable model based on the requirements.
Scenario 1: Feature Interactions
A discount works well for young customers but not old ones. Which model can capture this combined effect automatically?
Scenario 2: Outlier Sensitivity
A single house is mis-priced at $10M. Which model's predictions will be most skewed by this one error?
Scenario 3: Extrapolation
A model trained on experience from 1-10 years is asked to predict for 30 years. Which might give an absurdly high salary?