How LLMs Work: From Tokens to AI Agents
Before you can build an AI agent, you need to understand the engine inside it. A ground-up walkthrough of LLMs — tokenization, transformers, training, and the limits that make agents necessary.
This is the kickoff post for the AI Agents series. Before we build agents, we need to understand the thing they’re built around: the LLM.
If you can already explain tokenization, attention, and why LLMs need agents to know today’s news — skip this one. Otherwise, this is the foundation the rest of the series builds on.
What is an LLM?
LLM = Large Language Model.
It’s a neural network — specifically a transformer — trained on huge amounts of text scraped from the internet. The training objective is brutally simple:
Given a sequence of words, predict the next word.
That’s it. Everything ChatGPT, Claude, and Gemini do — answering questions, writing code, translating between languages — comes out of that one objective, scaled to billions of parameters.
How LLMs read text
You and I read whole sentences at once. LLMs don’t. They break input into smaller chunks called tokens.
This step is called tokenization. A few common flavors:
- Word-level — split on every word.
"The quick brown fox"→ 4 tokens. - Sub-word — break a word into pieces.
"running"→"run"+"ning". - Byte Pair Encoding (BPE) — the most common in modern LLMs. It learns which character combinations appear most often in the training data and treats those as single tokens.
For the rest of this post I’ll use word-level tokenization to keep things simple.
From tokens to numbers: embeddings
Neural networks don’t speak English. They speak math.
So every token gets converted into a vector — just an array of numbers of some fixed dimension. This is called an embedding.
Vectors start out random. Through training, the model learns to push similar words close together in vector space — king and queen end up near each other; king and banana don’t.
Positional encoding: why order matters
Here’s a quirk of the transformer: all words in the input go into the model at the same time, not one after another. This is part of why transformers train fast on GPUs.
But "fox the over jumps" and "the fox jumps over" are the same set of words in different orders. The model needs a way to know which word came first.
That’s the job of positional encoding — a separate signal added to each embedding that says “this is position 1, this is position 2, …”.
Attention: what to focus on
This is the heart of the transformer.
When the model predicts the next word, it asks: “Which of the previous words actually matter for this prediction?”
Example: The cat sat on the mat.
- To predict
sat— the model leans hard oncat(who sat?). - To predict
on— it leans onsat(sat where? doing what?). - The two
thes? Mostly ignored.
That weighting is the attention score.
Multi-head attention just means the model does this multiple times in parallel. Each “head” learns a different kind of relationship — grammatical, semantic, positional — and the heads get combined into a richer representation.
The full transformer pipeline
Here’s the whole flow, top to bottom:
- Tokenize the input.
- Embed each token as a vector.
- Add positional encoding so order is preserved.
- Multi-head attention — figure out which words depend on which.
- Feed-forward network — non-linear transformation on the result.
- Output layer — produce a probability distribution over the next token.
- Forward + backward propagation — update weights to minimize loss.
Stack that block dozens of times, train on the entire internet, and you have a modern LLM.
How LLMs are actually trained
Training happens in two big phases.
1. Pre-training (self-supervised)
You can’t manually label internet-scale data. So the trick is self-supervised learning: generate input/output pairs directly from the text itself.
From "The cat sat on the mat":
- Input:
The cat sat on the→ Target:mat - Input:
The cat sat on→ Target:the - Input:
The cat sat→ Target:on - …and so on.
One sentence becomes many training examples. Scale that to Wikipedia, GitHub, books, and news, and the model picks up grammar, facts, code, and reasoning along the way.
2. Fine-tuning + RLHF
Pre-training gives you a generalist — okay at everything, great at nothing.
To make it good at a specific task, you fine-tune it on labeled examples for that task. For example, the base model can write standard SQL no problem, but if you want it to generate code in your custom query language, you have to show it examples.
Then there’s RLHF — Reinforcement Learning from Human Feedback. Humans rank model outputs, and the model learns “this is the kind of answer I should produce.” This is what turns a raw text completer into a helpful assistant.
What LLMs are good at
- Text generation and completion
- Q&A and grammar correction
- Translation (the training data was multilingual)
- Code generation (it was trained on GitHub)
- Summarization and creative writing
Where LLMs break down
This part is critical, because the gaps are exactly why AI agents exist.
1. They hallucinate. The model’s only job is to predict the next plausible word. If it misunderstands you, it will confidently produce nonsense. Confidence ≠ correctness.
2. They have a knowledge cutoff. The model was trained at some point in time. Ask “Who won the election today?” and it has no idea — that data didn’t exist when it was trained.
3. They have no memory. Each conversation starts fresh. A vanilla LLM can’t remember what you said yesterday, or even three turns ago, beyond what fits inside the current context window.
4. They’re computationally expensive. Billions or trillions of parameters means GPUs, datacenters, and a lot of money. Only large labs can afford to train frontier models from scratch.
Why this leads to AI agents
Here’s the punchline.
An AI agent is an LLM with tools and memory bolted on.
- Tools patch the knowledge cutoff. Give the LLM a web search tool, a news scraper, or an API client, and it can fetch real-time information instead of guessing.
- Memory patches the amnesia. Store conversation history (and summaries of it), and the agent can reason across turns.
Concrete example. You ask: “RCB won today — will they win the next match?” The agent calls a Cricbuzz tool to get the schedule. Then you follow up: “Will they win the one after that?” — no team mentioned. The agent uses memory to remember you were talking about RCB.
Tools + memory is most of what makes an agent feel intelligent.
What’s next in the series
Upcoming posts will build on this foundation:
- Prompt engineering — how to actually steer an LLM.
- RAG and vector databases — giving the LLM access to your own documents.
- Building real AI agents — putting it all together with tools and memory.
If you’d rather watch than read, the full video walkthrough is embedded above.