ReAct and RAG: Giving LLMs Access to the External World

This is Part 10 of the AI Agents series and the final part of the prompt engineering sub-series. Parts 8–9 covered Zero-Shot, Few-Shot, Chain-of-Thought, Self-Consistency, and Tree of Thoughts — techniques for improving how an LLM reasons over the information you give it.

This post covers the next level: giving the LLM access to information it doesn’t have at all — live data from the internet, real-time APIs, and your own private documents.

1. The two gaps prompting alone can’t fix

Standard prompting — even with advanced techniques — is still limited to what the model learned during training:

Temporal gap: Training data has a cutoff. Ask an LLM for last weekend’s box office results or the current temperature in Hyderabad and it cannot answer accurately. It will guess.
Private knowledge gap: Your company’s internal documents, policies, and data were never in the training set. Ask about “Nerchuko’s remote work policy” and it will hallucinate a plausible-sounding but wrong answer.

Two frameworks address these gaps: ReAct for live external data, RAG for private internal knowledge.

2. ReAct: Reason + Act

ReAct (Reason + Act) gives an LLM access to external tools — web search, APIs, databases, calculators — and lets it decide when and how to use them.

The model doesn’t just generate text. It operates in a loop:

Reason  →  Act (call a tool)  →  Observe (read the result)  →  Reason again

It keeps looping until it has enough information to answer confidently.

Example — current weather:

User: What's the current temperature in Hyderabad?

Reason: I don't have real-time weather data. I need to call a weather API.
Act: call weather_api(city="Hyderabad")
Observe: {"temperature": 32, "condition": "partly cloudy"}
Reason: I now have the data. I can answer.
Answer: It's currently 32°C and partly cloudy in Hyderabad.

Without ReAct, the model guesses a temperature based on seasonal patterns in its training data. With ReAct, it fetches the actual current reading.

Implementing ReAct with tool calling

Most LLM providers support tool calling natively. You define the tools available, and the model decides when to call them.

import os
import json
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

# Define the tools the model can use
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name"
                    }
                },
                "required": ["city"]
            }
        }
    }
]

# Simulate the tool's actual implementation
def get_weather(city: str) -> dict:
    # In production: call a real weather API here
    return {"city": city, "temperature": 32, "condition": "partly cloudy"}

messages = [{"role": "user", "content": "What's the current temperature in Hyderabad?"}]

# First call: model reasons and decides to call the tool
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages,
    tools=tools
)

# Check if the model wants to call a tool
tool_call = response.choices[0].message.tool_calls
if tool_call:
    # Execute the tool
    args = json.loads(tool_call[0].function.arguments)
    result = get_weather(**args)

    # Feed the result back to the model
    messages.append(response.choices[0].message)
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call[0].id,
        "content": json.dumps(result)
    })

    # Second call: model observes the result and generates final answer
    final_response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print(final_response.choices[0].message.content)

The model handles the Reason and Observe steps. You implement the actual tool functions. For a real application, get_weather would call a live weather API instead of returning hardcoded data.

When to use ReAct

ReAct is the right choice when answers require:

Real-time data: weather, stock prices, sports scores, news
Computation: a calculator tool for math that needs guaranteed accuracy
External actions: sending emails, querying databases, running code
Information beyond the training cutoff: anything recent

3. RAG: Retrieval-Augmented Generation

RAG solves the private knowledge problem. Instead of fine-tuning a model on your internal documents (expensive, slow, and stale the moment documents update), RAG retrieves the relevant documents at query time and includes them in the prompt.

Think of it as an open-book test. The LLM doesn’t need to have memorized your company handbook — it just needs to be handed the right page before it answers.

How RAG works

User query
    │
    ▼
Convert query to vector embedding
    │
    ▼
Search vector database for similar document chunks
    │
    ▼
Retrieve top-k matching chunks
    │
    ▼
Inject chunks into prompt as context
    │
    ▼
LLM answers based only on the provided context

The vector database stores your documents as numerical embeddings — representations of meaning rather than exact text. When a question comes in, it’s embedded the same way and matched against stored chunks by similarity. Only the most relevant chunks are retrieved.

This keeps the context window manageable. You’re not dumping an entire 500-page handbook into every prompt — you’re retrieving the 2–3 sections that actually answer the question.

RAG prompt pattern

Once you’ve retrieved the relevant document chunks, the prompt structure is straightforward:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

def answer_from_context(question: str, retrieved_chunks: list[str]) -> str:
    context = "\n\n".join(retrieved_chunks)

    prompt = f"""Answer the question using only the information provided in the context below.
If the answer is not in the context, say "I don't have that information."
Do not use any outside knowledge.

Context:
{context}

Question: {question}"""

    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content


# Example: company policy documents retrieved from vector DB
retrieved = [
    "Nerchuko employees are entitled to 22 paid holidays per year.",
    "Holidays include national public holidays and company-specific days as announced annually."
]

print(answer_from_context("How many holidays do Nerchuko employees get?", retrieved))

The instruction "Do not use any outside knowledge" is critical. Without it, the model may blend retrieved content with its training data and hallucinate. The constraint makes answers traceable to source documents.

Why not just fine-tune instead?

Fine-tuning bakes knowledge into the model weights. RAG keeps it in documents.

	Fine-tuning	RAG
Update documents	Retrain the model	Update the database
Cost	High	Low
Answer traceability	Hard	Easy (you know the source chunk)
Good for	Behavior/style changes	Knowledge/facts

For knowledge that changes (policies, product docs, FAQs), RAG is almost always the better choice. Fine-tuning is for changing how the model behaves, not what it knows.

4. ReAct vs RAG: which one to use

	ReAct	RAG
Problem	Real-time or external data	Private or internal knowledge
Data source	APIs, web, tools	Your own documents
Latency	Depends on tool response time	Depends on vector search speed
Hallucination risk	Low (grounded in tool results)	Low (grounded in retrieved docs)
Use case	”What’s the weather?"	"What’s our refund policy?”

They’re not mutually exclusive. A real AI agent often uses both: RAG for internal knowledge and ReAct for anything requiring live external data.

5. The full picture: plain LLM → AI Agent

Looking back at the full series:

Parts 1–5: What LLMs are, how to use APIs, how to control output
Parts 6–7: Open-source models via Groq and locally via Ollama
Parts 8–10: Prompt engineering — from basic prompting to tool use and retrieval

The progression from Parts 8–10 specifically traces how prompts evolve:

Zero/Few-Shot — tell the model what format you want
Chain-of-Thought — tell the model how to reason
ReAct + RAG — give the model external capabilities

At this point, you have a complete foundation. An LLM with a well-structured prompt, tool access via ReAct, and a document retrieval layer via RAG is, functionally, an AI agent — it can reason, act, retrieve, and respond accurately on live and private information.

What’s next

Part 11 goes deep on RAG specifically — how the retrieval layer actually works: chunking strategies, vector embeddings, cosine similarity vs Euclidean distance, and a full implementation using ChromaDB.

Full video walkthrough is embedded above.