Advanced Prompting: Chain-of-Thought, Self-Consistency, Tree of Thoughts

This is Part 9 of the AI Agents series. Part 8 covered Zero-Shot and Few-Shot prompting — the foundation. This post covers what to reach for when those techniques aren’t enough: Chain-of-Thought, Self-Consistency, and Tree of Thoughts.

1. Why LLMs need to “think” before answering

LLMs generate tokens one at a time, predicting the most likely next word given the current context. When you ask a direct question, the model produces an answer immediately — without pausing to reason through it.

For simple factual questions, that’s fine. For logic puzzles, multi-step math, or ambiguous problems, it’s not. The model commits to an answer before it has fully worked through the problem.

The three techniques in this post all share the same underlying principle: give the model space to reason before committing to a final answer. The approaches differ in how many reasoning paths are explored and whether they’re sequential or parallel.

2. Chain-of-Thought (CoT) prompting

Chain-of-Thought instructs the model to show its reasoning step by step rather than jumping straight to an answer.

The problem it solves:

Prompt: A farmer has 15 cows, all but 8 died. How many cows are there?

Without CoT, many models answer 7 — they subtract 8 from 15. The correct answer is 8 (eight survived; “all but 8 died” means 8 are still alive). The model gets this wrong because it pattern-matches to subtraction without parsing the sentence carefully.

With CoT, the model works through the language first and gets it right.

Zero-Shot CoT

The simplest form: append “Let’s think step by step” to your prompt.

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

prompt = """A farmer has 15 cows, all but 8 died. How many cows are there?

Let's think step by step."""

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

That single phrase triggers analytical behavior. The model parses “all but 8 died,” identifies that 8 survived, and arrives at the correct answer.

Other zero-shot CoT triggers that work similarly:

"Think through this carefully before answering."
"Work through this step by step, then give your final answer."
"Reason through the problem first, then state your conclusion."

Few-Shot CoT

Few-Shot CoT provides an example of the reasoning process itself — not just an input/output pair, but a worked example that shows how to think through the problem.

prompt = """Solve logic problems step by step.

Problem: I have 3 shirts and 2 pairs of pants. How many distinct outfits can I make?
Reasoning: Each shirt can be paired with any pair of pants. So outfit count = shirts × pants = 3 × 2 = 6.
Answer: 6 outfits.

Problem: A store sells apples for $1.20 each and oranges for $0.80 each. I bought 4 apples and 3 oranges. How much did I spend?
Reasoning: Apples cost 4 × $1.20 = $4.80. Oranges cost 3 × $0.80 = $2.40. Total = $4.80 + $2.40 = $7.20.
Answer: $7.20.

Problem: A train leaves at 9:00 AM traveling at 60 mph. Another train leaves the same station at 11:00 AM traveling in the same direction at 90 mph. When does the second train catch up?
Reasoning:"""

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

The model sees the Reasoning → Answer pattern from two examples and applies the same structure to the new problem. This is the right choice for any domain with consistent problem structure: math word problems, legal reasoning, diagnostic logic.

3. Self-Consistency prompting

Self-Consistency is an extension of Chain-of-Thought. Instead of running a reasoning prompt once, you run it multiple times and take the majority answer.

The intuition: if the same answer appears across most independent runs, it’s more likely correct. One run can be confidently wrong. Five runs that agree are much harder to dismiss.

import os
from collections import Counter
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

prompt = """A bat and a ball together cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?

Think step by step, then state your final answer as a single number in cents (e.g., "Answer: 5 cents")."""

answers = []

for _ in range(5):
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    output = response.choices[0].message.content
    answers.append(output)

# Extract final answer line from each response
final_answers = []
for ans in answers:
    for line in ans.split("\n"):
        if line.strip().lower().startswith("answer:"):
            final_answers.append(line.strip())
            break

most_common = Counter(final_answers).most_common(1)[0]
print(f"Most common answer: {most_common[0]} (appeared {most_common[1]}/5 times)")

A few implementation notes:

Set temperature > 0 — if temperature is 0, the model is deterministic and every run produces the same output. You need variation across runs for majority voting to be meaningful. 0.5–0.9 is a good range.
Parse answers consistently — ask the model to output its final answer in a structured format (as shown above) so extraction is reliable.
5 runs is a practical default for most tasks. For high-stakes decisions, go higher.

When to use it: Complex math, any problem where a single CoT run produces inconsistent results across manual tests, critical classification decisions.

When not to: Simple factual questions (overkill), anything cost-sensitive at scale (5× the API calls = 5× the tokens).

4. Tree of Thoughts (ToT) prompting

Chain-of-Thought and Self-Consistency follow a single reasoning path (or multiple runs of the same path). Tree of Thoughts explores multiple distinct reasoning paths in parallel — useful for open-ended problems that don’t have a single obvious approach.

The analogy: if your website’s conversion rate drops 15%, you shouldn’t assume one cause and investigate only that. You should simultaneously explore: was there a code deploy? Did ad targeting change? Are competitors running promotions? Is it seasonal? Each is a valid branch worth investigating before you discard it.

ToT structures this as three phases:

Brainstorm — generate multiple distinct hypotheses
Evaluate — assess the evidence for and against each
Conclude — identify the most probable root cause and next steps

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

problem = "Our e-commerce website's conversion rate dropped 15% last week with no intentional changes made."

system_prompt = "You are a senior product analyst. Reason through problems systematically."

tot_prompt = f"""Problem: {problem}

**Phase 1 — Brainstorm:**
Generate exactly 3 distinct, plausible hypotheses that could explain this drop. Each should be a different category of cause (e.g., technical, behavioral, competitive, external).

**Phase 2 — Evaluate:**
For each hypothesis:
- What evidence would confirm it?
- What evidence would rule it out?
- Rate likelihood as High / Medium / Low based on how common this type of issue is.

**Phase 3 — Conclude:**
Based on your evaluation, identify the single most probable root cause.
State what you would investigate first and what data you need to check within the next 24 hours."""

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": tot_prompt}
    ]
)

print(response.choices[0].message.content)

This produces a structured analysis rather than a guess. The model surfaces the most likely cause (often a technical issue or recent deploy) and gives you a concrete investigation path.

5. Choosing the right technique

Technique	When to use	Token cost
Zero-Shot CoT	Logic, math, ambiguous language — simple cases	Low
Few-Shot CoT	Consistent problem structure, domain-specific reasoning	Low–Medium
Self-Consistency	High-stakes answers, unreliable single-run results	High (N× calls)
Tree of Thoughts	Open-ended problems, root cause analysis, no single right answer	Very High

The decision path:

Is the task simple/factual?
└─ Yes → Zero-shot (Part 8), done.

Does it involve logic or multi-step reasoning?
└─ Yes → Try Zero-Shot CoT first ("Let's think step by step")
         └─ Still inconsistent? → Few-Shot CoT with worked examples
                                  └─ High-stakes and needs confidence? → Self-Consistency (5 runs)

Is the problem open-ended with multiple valid root causes?
└─ Yes → Tree of Thoughts (3-phase prompt)

6. One practical rule

Don’t use the most sophisticated technique by default. Tree of Thoughts on a simple math problem wastes tokens and doesn’t improve accuracy. Self-Consistency on a factual lookup is pointless overhead.

Match the technique to the complexity of the task. Most real-world prompts in production applications are Zero-Shot or Zero-Shot CoT. Self-Consistency and ToT are tools you reach for when the simpler approaches provably fail.

What’s next

Part 10 is the final post in the prompt engineering series. It covers two frameworks that give LLMs access to the external world: ReAct (Reason + Act — connecting LLMs to live tools like web search and APIs) and RAG (Retrieval-Augmented Generation — grounding answers in your own private documents). Together, these are what turn a plain LLM into an AI agent.

Full video walkthrough is embedded above.