Ask Claude about this

The Hierarchy of Visual Understanding

Core Concepts to Master

  • Granularity of Output: The key difference between the tasks, from a single label for the whole image (classification) to bounding boxes (detection) to pixel-level masks (segmentation).
  • Problem Formulation: How the goal changes from "What is it?" to "What is it and where is it?" to "What is the exact outline of everything?"
  • Model Architecture: Understanding how model architectures evolve, from a simple CNN classifier to complex two-stage detectors (like Faster R-CNN) or encoder-decoder structures (like U-Net for segmentation).
  • Evaluation Metrics: The shift from simple accuracy to Intersection over Union (IoU) based metrics like mean Average Precision (mAP) for detection and Mean IoU (mIoU) for segmentation.
  • Class Imbalance: A common and critical problem in all three tasks, but with unique challenges in object detection and segmentation due to background vs. foreground imbalance.

Interview Walkthrough

Interviewer: Let's talk about some core tasks in computer vision. Can you explain the difference between image classification, object detection, and semantic segmentation? Please focus on how their problem formulation and evaluation metrics differ.
Candidate: Of course. These three tasks represent a hierarchy of increasing complexity in understanding the content of an image.

Analogy: Describing a Photo

  • Image Classification answers the question: "What's in this photo?" It assigns a single label to the entire image. (e.g., "This is a photo of a cat.")
  • Object Detection answers: "What's in this photo, and where are they?" It identifies individual objects and draws bounding boxes around them. (e.g., "There is a cat at these coordinates and a dog at these other coordinates.")
  • Semantic Segmentation answers: "What is the exact class of every single pixel in this photo?" It creates a pixel-level mask for each object class. (e.g., "All of these pixels are 'cat,' all of these are 'dog,' and all of these are 'grass.'")

Image Classification

Output: "Cat"

Assigns a single label to the entire image.

Object Detection

Cat Dog

Draws bounding boxes around each object and labels it.

Semantic Segmentation

Cat Dog BG

Classifies every single pixel in the image.

Technical Differences

1. Image Classification

  • Problem Formulation: A single-label or multi-label classification problem. The input is an image, the output is a class probability vector.
  • Model Architecture: Typically a standard CNN (like ResNet) with convolutional and pooling layers, ending in a fully connected layer and a Softmax or Sigmoid output.
  • Evaluation Metric: Standard classification metrics like Accuracy, Precision, Recall, and F1-Score.

2. Object Detection

  • Problem Formulation: A multi-task problem. The model must solve two problems simultaneously for each object: classification ("what is it?") and localization (a regression problem to predict the `x, y, width, height` of a bounding box).
  • Model Architecture: More complex, often with two stages. Architectures like Faster R-CNN have a "region proposal network" to find potential objects, followed by a classifier and regressor for each proposal. Single-stage detectors like YOLO perform both tasks in one pass.
  • Evaluation Metric: Based on Intersection over Union (IoU), which measures the overlap between a predicted bounding box and a ground-truth box. The main metric is mean Average Precision (mAP), which is the average of the Average Precision scores across all classes and multiple IoU thresholds.

3. Semantic Segmentation

  • Problem Formulation: A dense, pixel-wise classification problem. For every pixel in the input image, the model must predict which class it belongs to.
  • Model Architecture: Typically an encoder-decoder architecture, like a U-Net. The encoder (a standard CNN) downsamples the image to learn features, and the decoder upsamples it back to the original resolution to generate the pixel-level mask.
  • Evaluation Metric: Also based on IoU. The standard metric is Mean Intersection over Union (mIoU), which is the IoU calculated for each class and then averaged across all classes. Pixel accuracy is also sometimes used but is less robust to class imbalance.
Interviewer: That's a perfect breakdown. You mentioned class imbalance. How would you handle class imbalance in an object detection task where, for example, 'cars' appear thousands of times but 'bicycles' appear only a few dozen times?
Candidate: That's a very common and critical challenge in object detection. The model can become heavily biased towards the frequent class ('cars') and perform poorly on the rare class ('bicycles'). There are two main categories of solutions: data-level and model-level approaches.

1. Data-Level Strategies

The goal here is to re-balance the data the model sees during training.

  • Data Augmentation for the Minority Class: We can artificially increase the number of 'bicycle' images. This goes beyond simple flips and rotations. We can use techniques like copy-pasting bicycle instances onto different backgrounds or using advanced GAN-based augmentation.
  • Class-Aware Sampling: Instead of sampling images randomly, we can oversample images that contain the rare class ('bicycle') or undersample images that only contain the frequent class ('cars'). This ensures the model sees a more balanced distribution of classes during training.

2. Model-Level (Loss Function) Strategies

The goal here is to modify the model's learning objective to pay more attention to the rare, hard-to-classify objects.

  • Focal Loss: This is the most famous and effective solution, introduced in the RetinaNet paper. Standard Cross-Entropy loss can be overwhelmed by the vast number of easy negatives (e.g., background patches). Focal Loss modifies the standard cross-entropy loss by adding a modulating factor `(1-p_t)^γ`. This factor down-weights the loss assigned to well-classified examples, forcing the model to focus its learning on hard, misclassified examples, which are often the rare class instances.
  • Class-Weighted Loss: We can assign a higher weight to the classification loss for the minority class. This tells the model that making a mistake on a 'bicycle' is much more costly than making a mistake on a 'car', forcing it to pay more attention to getting the rare class right.

In practice, a combination of these techniques is often most effective. For instance, using class-aware sampling at the data level and Focal Loss at the model level can be a very powerful strategy for handling severe class imbalance in object detection.

Why This Comparison Matters in an Interview

  • Demonstrates Foundational CV Knowledge: Understanding this hierarchy is the basis of computer vision. A strong answer shows you know the landscape of possible tasks.
  • Connects Problem to Solution: A strong candidate can clearly map the problem formulation (e.g., "what and where") to the necessary model output (class + bounding box) and the right evaluation metric (mAP).
  • Highlights Practical Challenges: Discussing class imbalance and solutions like Focal Loss shows you've moved beyond textbook problems to real-world complexities.
  • Shows Architectural Awareness: Knowing that segmentation requires an encoder-decoder structure (like U-Net) while detection uses models like YOLO or Faster R-CNN demonstrates a deeper understanding of model design.
Pro-Tip: To showcase advanced knowledge, you can introduce Instance Segmentation as the next step in the hierarchy. Explain that it combines the goals of both object detection and semantic segmentation: it not only classifies each pixel but also distinguishes between different instances of the same class (e.g., "this is cat_1," "this is cat_2"). This is the most granular of the four tasks and is solved by models like Mask R-CNN.

What's the Right Task?

For each business problem, choose the computer vision task that best solves it.

Scenario 1: Counting Cars

A city wants to count the number of cars passing through an intersection to analyze traffic volume. They don't need to track individual cars, just get a total count in each frame of a video.

 
Scenario 2: Medical Imaging

A radiologist needs a tool that highlights the exact, pixel-perfect area of a tumor in a medical scan to measure its size and shape precisely.

 
Scenario 3: Evaluation Metric

You have trained an object detector. What is the standard, primary metric you would use to evaluate its performance on a test set?

 

 

Nerchuko Academy · Free DS Interview Prep