Text Representation: From Counts to Context — ML Breadth

Bag of Words, TF-IDF, Word2Vec, BERT

Core Concepts to Master

Vector Space Models: The core idea of representing text as numerical vectors so that mathematical operations can be performed.
Sparsity vs. Density: Understanding the difference between high-dimensional, sparse representations (BoW, TF-IDF) and low-dimensional, dense representations (Word2Vec, BERT).
Semantics & Syntax: Does the representation capture the meaning (semantics) of a word, or just its presence/frequency? Does it understand grammar (syntax)?
Context Independence vs. Dependence: The most crucial modern distinction. Does a word always have the same vector (static), or does its vector change based on the surrounding words (contextual)?
Pre-training & Fine-tuning: The modern paradigm where large models are pre-trained on massive text corpora and then fine-tuned for specific downstream tasks.

Interview Walkthrough

Interviewer: Let's talk about how we represent text for machine learning models. Can you compare Bag of Words, TF-IDF, Word2Vec, and BERT embeddings, and discuss when you'd use each?

Candidate: Of course. These four methods represent a clear evolution in NLP, moving from simple word counts to deep contextual understanding. Each has its place depending on the complexity of the task and available resources.

The Evolution of Text Representation

Bag of Words (BoW): Like a shopping list. It tells you what words are in a document and how many times they appear, but you lose the order and context. "man bites dog" and "dog bites man" are the same.
TF-IDF: Like a smarter library index. It still counts words, but it gives more importance to words that are frequent in one document but rare across all other documents, helping to identify key terms.
Word2Vec: Like a dictionary definition. Each word gets a fixed vector that captures its semantic relationships. It knows "king" is similar to "queen" and that "king" - "man" + "woman" is close to "queen."
BERT: Like a context-aware personal assistant. It understands that the meaning of a word changes based on the sentence. The vector for "bank" in "river bank" is different from the vector for "bank" in "bank account."

1. Bag of Words (BoW)

Mechanism: Creates a vector for each document where each dimension corresponds to a unique word in the entire corpus. The value in each dimension is simply the count of that word in the document.
Use Case: Simple text classification, document clustering, or as a baseline when you need a fast and simple representation.
Pros: Simple to understand and implement, very fast.
Cons: Ignores word order and context, results in very high-dimensional and sparse vectors, treats all words equally.

Bag of Words

2. TF-IDF (Term Frequency - Inverse Document Frequency)

Mechanism: It's an improvement on BoW. It calculates a score for each word in a document based on two factors:
- Term Frequency (TF): How often a word appears in the document.
- Inverse Document Frequency (IDF): How rare the word is across all documents. `log(N/df)` where N is total documents and df is documents containing the term.
The final score is `TF * IDF`. This down-weights common words (like "the") and up-weights important, topic-specific words.
Use Case: Information retrieval, search engine scoring, text summarization, and a stronger baseline for text classification than BoW.
Pros: Still simple and fast, but provides more informative features than BoW by highlighting important words.
Cons: Still ignores word order and context, and still produces sparse vectors.

TF-IDF

3. Word2Vec

Mechanism: A predictive, neural network-based model that learns dense word embeddings (vectors). It's trained by predicting a word given its context (CBOW model) or predicting the context given a word (Skip-gram model). Words that appear in similar contexts will have similar vectors.
Use Case: Any task where semantic understanding is important: sentiment analysis, text similarity, as input features for deeper neural networks.
Pros: Captures semantic relationships between words, produces low-dimensional and dense vectors, and pre-trained models are widely available.
Cons: It is context-independent. The word "bank" has the same single vector regardless of how it's used. It also cannot handle out-of-vocabulary words.

Word2Vec (Static Embeddings)

4. BERT

Mechanism: A large, pre-trained Transformer-based model. Unlike Word2Vec, it generates contextual embeddings. It processes the entire sentence at once, using a self-attention mechanism to understand how all words in the sentence relate to each other.
Use Case: This is the state-of-the-art for almost any NLP task: question answering, sentiment analysis, named entity recognition, machine translation. It's used as a powerful feature extractor or can be fine-tuned for a specific task.
Pros: Deeply understands context and polysemy (words with multiple meanings), achieves state-of-the-art performance on a wide range of tasks.
Cons: Extremely computationally expensive and slow to run compared to other methods. It requires significant hardware (like GPUs) for effective use.

BERT (Contextual Embeddings)

Interviewer: That's a perfect summary of the evolution. You mentioned a key difference between Word2Vec and BERT. Could you elaborate on what exactly makes a contextual embedding different from a static embedding?

Candidate: Absolutely. This is the fundamental leap that models like BERT made over previous methods like Word2Vec or GloVe.

Static Embeddings (e.g., Word2Vec, GloVe)

In a static embedding model, there is a fixed, one-to-one mapping between a word in the vocabulary and its vector representation. Think of it like a giant dictionary where each word has a single, unchanging definition (its vector).

The word "bank" will have exactly one vector in the entire model.
This vector is a blend of all the contexts the word appeared in during training. So, the vector for "bank" is a weird average of its financial meaning and its geographical meaning.
When you look up the embedding for "bank" in the sentence "I sat on the river bank," you get the exact same vector as you would for "I went to the bank to deposit money." The model has no way to distinguish between these two uses.

Contextual Embeddings (e.g., BERT, ELMo)

In a contextual model, a word's embedding is generated dynamically based on the entire sentence it appears in. There is no fixed dictionary lookup.

The word "bank" does not have a single pre-defined vector.
Instead, the BERT model takes the entire sentence as input ("I sat on the river bank"). It uses its self-attention mechanism to analyze the relationships between "bank" and all other words like "river," "sat," and "on."
Based on this context, it generates a unique vector for "bank" that captures its geographical meaning. If you then feed it the sentence "I went to the bank to deposit money," it will generate a completely different vector for "bank" that captures its financial meaning.

In short, the key difference is: Static embeddings are a dictionary lookup; contextual embeddings are a function of the entire input sentence. This allows contextual models to handle polysemy and capture a much richer, more nuanced understanding of language, which is why they have revolutionized the field of NLP.

Why This Comparison Matters in an Interview

Shows Historical Perspective: Explaining the progression from BoW to BERT demonstrates an understanding of the history and evolution of NLP.
Highlights Core Trade-offs: A strong answer articulates the fundamental trade-off between computational cost/simplicity and semantic richness/performance.
Mastery of Modern Concepts: Clearly explaining the difference between static and contextual embeddings is a litmus test for a candidate's understanding of modern NLP.
Demonstrates Practical Judgment: Knowing when to use a simple baseline like TF-IDF versus when to bring in a heavy-hitter like BERT shows practical project-planning skills.

Pro-Tip: Frame your choice in a project context. For example, "For an initial text classification baseline, I would start with TF-IDF followed by a logistic regression model because it's fast and highly interpretable. If performance is not sufficient, I would then move to using pre-trained Word2Vec embeddings as features. Finally, for maximum performance where computational cost is less of an issue, I would fine-tune a pre-trained BERT-based model on the specific task."

What's the Right Representation?

For each scenario, choose the best text representation method.

Scenario 1: Semantic Search

You need to build a system that finds documents with similar *meanings*, not just matching keywords. For example, a search for "US President" should find documents about "Joe Biden".

Scenario 2: Resource Constraints

You are building a simple text classifier on a laptop with limited CPU and RAM. You need a fast baseline. Which method is the most practical starting point?

Scenario 3: Polysemy

Your task is to analyze financial news. Which model's limitation would cause it to confuse "the bank of a river" with "a financial bank"?