Bag of Words, TF-IDF, Word2Vec, BERT
Core Concepts to Master
- Vector Space Models: The core idea of representing text as numerical vectors so that mathematical operations can be performed.
- Sparsity vs. Density: Understanding the difference between high-dimensional, sparse representations (BoW, TF-IDF) and low-dimensional, dense representations (Word2Vec, BERT).
- Semantics & Syntax: Does the representation capture the meaning (semantics) of a word, or just its presence/frequency? Does it understand grammar (syntax)?
- Context Independence vs. Dependence: The most crucial modern distinction. Does a word always have the same vector (static), or does its vector change based on the surrounding words (contextual)?
- Pre-training & Fine-tuning: The modern paradigm where large models are pre-trained on massive text corpora and then fine-tuned for specific downstream tasks.
Interview Walkthrough
The Evolution of Text Representation
- Bag of Words (BoW): Like a shopping list. It tells you what words are in a document and how many times they appear, but you lose the order and context. "man bites dog" and "dog bites man" are the same.
- TF-IDF: Like a smarter library index. It still counts words, but it gives more importance to words that are frequent in one document but rare across all other documents, helping to identify key terms.
- Word2Vec: Like a dictionary definition. Each word gets a fixed vector that captures its semantic relationships. It knows "king" is similar to "queen" and that "king" - "man" + "woman" is close to "queen."
- BERT: Like a context-aware personal assistant. It understands that the meaning of a word changes based on the sentence. The vector for "bank" in "river bank" is different from the vector for "bank" in "bank account."
1. Bag of Words (BoW)
- Mechanism: Creates a vector for each document where each dimension corresponds to a unique word in the entire corpus. The value in each dimension is simply the count of that word in the document.
- Use Case: Simple text classification, document clustering, or as a baseline when you need a fast and simple representation.
- Pros: Simple to understand and implement, very fast.
- Cons: Ignores word order and context, results in very high-dimensional and sparse vectors, treats all words equally.
Bag of Words
2. TF-IDF (Term Frequency - Inverse Document Frequency)
- Mechanism: It's an improvement on BoW. It calculates a score for each word in a document based on two factors:
- Term Frequency (TF): How often a word appears in the document.
- Inverse Document Frequency (IDF): How rare the word is across all documents. `log(N/df)` where N is total documents and df is documents containing the term.
- Use Case: Information retrieval, search engine scoring, text summarization, and a stronger baseline for text classification than BoW.
- Pros: Still simple and fast, but provides more informative features than BoW by highlighting important words.
- Cons: Still ignores word order and context, and still produces sparse vectors.
TF-IDF
3. Word2Vec
- Mechanism: A predictive, neural network-based model that learns dense word embeddings (vectors). It's trained by predicting a word given its context (CBOW model) or predicting the context given a word (Skip-gram model). Words that appear in similar contexts will have similar vectors.
- Use Case: Any task where semantic understanding is important: sentiment analysis, text similarity, as input features for deeper neural networks.
- Pros: Captures semantic relationships between words, produces low-dimensional and dense vectors, and pre-trained models are widely available.
- Cons: It is context-independent. The word "bank" has the same single vector regardless of how it's used. It also cannot handle out-of-vocabulary words.
Word2Vec (Static Embeddings)
4. BERT
- Mechanism: A large, pre-trained Transformer-based model. Unlike Word2Vec, it generates contextual embeddings. It processes the entire sentence at once, using a self-attention mechanism to understand how all words in the sentence relate to each other.
- Use Case: This is the state-of-the-art for almost any NLP task: question answering, sentiment analysis, named entity recognition, machine translation. It's used as a powerful feature extractor or can be fine-tuned for a specific task.
- Pros: Deeply understands context and polysemy (words with multiple meanings), achieves state-of-the-art performance on a wide range of tasks.
- Cons: Extremely computationally expensive and slow to run compared to other methods. It requires significant hardware (like GPUs) for effective use.
BERT (Contextual Embeddings)
Static Embeddings (e.g., Word2Vec, GloVe)
In a static embedding model, there is a fixed, one-to-one mapping between a word in the vocabulary and its vector representation. Think of it like a giant dictionary where each word has a single, unchanging definition (its vector).
- The word "bank" will have exactly one vector in the entire model.
- This vector is a blend of all the contexts the word appeared in during training. So, the vector for "bank" is a weird average of its financial meaning and its geographical meaning.
- When you look up the embedding for "bank" in the sentence "I sat on the river bank," you get the exact same vector as you would for "I went to the bank to deposit money." The model has no way to distinguish between these two uses.
Contextual Embeddings (e.g., BERT, ELMo)
In a contextual model, a word's embedding is generated dynamically based on the entire sentence it appears in. There is no fixed dictionary lookup.
- The word "bank" does not have a single pre-defined vector.
- Instead, the BERT model takes the entire sentence as input ("I sat on the river bank"). It uses its self-attention mechanism to analyze the relationships between "bank" and all other words like "river," "sat," and "on."
- Based on this context, it generates a unique vector for "bank" that captures its geographical meaning. If you then feed it the sentence "I went to the bank to deposit money," it will generate a completely different vector for "bank" that captures its financial meaning.
In short, the key difference is: Static embeddings are a dictionary lookup; contextual embeddings are a function of the entire input sentence. This allows contextual models to handle polysemy and capture a much richer, more nuanced understanding of language, which is why they have revolutionized the field of NLP.
Why This Comparison Matters in an Interview
- Shows Historical Perspective: Explaining the progression from BoW to BERT demonstrates an understanding of the history and evolution of NLP.
- Highlights Core Trade-offs: A strong answer articulates the fundamental trade-off between computational cost/simplicity and semantic richness/performance.
- Mastery of Modern Concepts: Clearly explaining the difference between static and contextual embeddings is a litmus test for a candidate's understanding of modern NLP.
- Demonstrates Practical Judgment: Knowing when to use a simple baseline like TF-IDF versus when to bring in a heavy-hitter like BERT shows practical project-planning skills.
What's the Right Representation?
For each scenario, choose the best text representation method.
Scenario 1: Semantic Search
You need to build a system that finds documents with similar *meanings*, not just matching keywords. For example, a search for "US President" should find documents about "Joe Biden".
Scenario 2: Resource Constraints
You are building a simple text classifier on a laptop with limited CPU and RAM. You need a fast baseline. Which method is the most practical starting point?
Scenario 3: Polysemy
Your task is to analyze financial news. Which model's limitation would cause it to confuse "the bank of a river" with "a financial bank"?