A technical primer for beginners — grounded in how transformers actually work.
If you've attended my workshops, you've heard me say this: "A language model on its own is stateless, memoryless, and passive." It waits. It responds. It forgets. So how do we get from that — a next-token predictor — to an autonomous agent that can book your flights, write code, run tests, and fix its own bugs?
That's what going agentic means. And it builds directly on transformer fundamentals you already know. Let's go layer by layer.
Before we build an agent, let's re-anchor on the mechanism. A transformer takes a sequence of tokens, processes them through stacked self-attention + feed-forward layers, and outputs a probability distribution over the vocabulary. That's it.
The critical insight for agentic AI: the output doesn't have to be prose. It can be structured JSON. A function call. A decision. This is the bridge from "chatbot" to "agent."
The transformer doesn't change — only what we ask it to generate.
Key insight: "Tool calling" is not a special mode. It's just token generation with a prompt that says: "If you need to act, output JSON in this schema." The model samples from its vocabulary distribution — and we've shaped that distribution to prefer well-formed JSON over prose.
A single forward pass gives you one response. An agent chains many. The ReAct pattern (Yao et al., 2022) is the most common architecture: the model alternates between reasoning in natural language and executing tool calls, with each observation fed back as context for the next step.
This works because of something fundamental about transformer attention: the entire history is in the context window. Every prior reasoning step and observation is attended to at every layer. The model doesn't just see the last message — it attends across the full trajectory.
Each loop iteration appends tokens to context. The transformer attends to the full history at every step.
Here's the transformer connection that often gets missed: every observation from a tool becomes part of the key-value (KV) cache. When the model "thinks" in the next step, self-attention queries span the entire trajectory. The model isn't summarizing what happened — it's attending to it directly. This is why long context windows matter enormously for agentic reliability.
The model generates "scratch-pad" tokens — reasoning steps like "I need to search for flights from Phoenix to JFK. Let me call search_flights." These tokens are in the context window. They attend to the original task, system prompt, and any prior results. This is just autoregressive generation — no magic.
In your prompt, you typically say: "Think step-by-step. When you need to act, output a JSON tool call."
The model outputs structured JSON instead of prose. Your orchestrator (Python, TypeScript, whatever) parses that JSON, routes it to the right function — a search API, a database query, a code executor — and captures the result.
Example output: {"tool": "search_flights", "args": {"origin": "PHX", "dest": "JFK", "date": "2026-05-01"}}
The tool result gets serialized as a string and appended to the conversation as a new message — typically tagged as a "tool result" role. This extends the context window. On the next forward pass, every attention head can query this result.
This is the secret: the model "remembers" results not through internal state, but through extended context that it attends to.
The orchestrator checks if the model returned a final answer or another tool call. If it's a tool call, repeat. If it's a final answer, return it to the user. Simple loops run 3–5 iterations. Complex agents may run dozens.
Your main engineering challenge: stopping conditions, error handling, and context window budget management.
A well-built agent has five distinct concerns. Understanding which layer handles what will save you days of debugging.
This is your frozen pre-trained network. It brings world knowledge, reasoning ability, and the capacity to produce structured outputs. You're not training it — you're steering its generation through context.
Key parameters you control: temperature (lower = more deterministic tool calls), max_tokens (budget per step), and stop_sequences (tell it when to stop and wait for a tool result).
Agentic tasks often want lower temperature (~0.1–0.4) than creative tasks. You need reliable JSON, not poetic JSON.
Let's be precise about the transformer internals here, because this is what I always cover in workshops and what most blog posts skip.
In a standard transformer, each token's representation is computed as a weighted sum of all other tokens — the attention weights. During an agentic loop, this means the model's "reasoning" at step N is informed by everything: the original task, every prior think-step, every tool call, every observation. The KV cache stores the key and value projections for all prior tokens, so incremental generation is efficient — you don't recompute the whole context on every new token.
Attention is not bounded to "recent" steps. The model can attend to the original task with full weight, even 40 steps later.
Why context window size is an architectural decision, not a setting: Each agent step costs O(n) in attention computation (n = current context length). Long chains in large contexts get expensive fast. Budget your context like memory — know when to summarize or use RAG to compress old observations.
Single agents hit limits: context overflow, task complexity, and parallelism. The solution is multi-agent orchestration — multiple transformer instances, each running its own ReAct loop, coordinated by an orchestrator.
| Pattern | Structure | When to use |
|---|---|---|
| Single agent | One model, one loop | Tasks with <10 steps, small context |
| Supervisor + workers | Orchestrator model routes sub-tasks to specialized agents | Parallel work, role specialization |
| Pipeline | Agent A's output = Agent B's input | Sequential, verifiable stages (write → review → deploy) |
| Debate / critic | Two agents produce + critique | High-stakes outputs requiring self-correction |
Each worker is an independent transformer instance with its own context. The orchestrator aggregates results in its context window.
Agentic systems fail in slow, compounding ways. An LLM chatbot either answers or it doesn't. An agent can drift across 20 steps, confidently doing the wrong thing, with every step looking plausible in isolation.
| Failure mode | Root cause | Mitigation |
|---|---|---|
| Context drift | Early task context gets diluted by many observations; attention weights shift | Re-state the goal every N steps; use a "goal anchor" in system prompt |
| Tool hallucination | Model invents tool names or args not in schema | Strict JSON schema validation; retry with error feedback in context |
| Infinite loops | No stopping condition; model keeps seeking more info | Hard step limit; detect repeated tool calls |
| Context overflow | Long chains exceed the context window | Sliding window; RAG compression; hierarchical agents |
| Cascading errors | Bad tool result poisons all subsequent reasoning | Error-handling branches; retry with different strategy in prompt |
Here's the minimal pattern. Python pseudocode, but the logic applies everywhere:
# 1. Define your tools (just Python functions + docstrings) def search_web(query: str) -> str: ... def run_python(code: str) -> str: ... # 2. Build a context window messages = [ {"role": "system", "content": SYSTEM_PROMPT + tool_schemas}, {"role": "user", "content": task} ] # 3. ReAct loop for step in range(MAX_STEPS): response = llm.generate(messages, temperature=0.2) if response.is_final_answer: return response.text # Parse the tool call from JSON output tool_call = parse_json(response.text) result = dispatch(tool_call) # routes to Python function # Append both the model's output AND the result to context messages.append({"role": "assistant", "content": response.text}) messages.append({"role": "tool", "content": str(result)}) # If we hit MAX_STEPS without a final answer, surface that to the user raise AgentTimeoutError("Max steps reached")
That's the core. Every agentic framework — LangChain, LlamaIndex, CrewAI, the Anthropic Claude SDK — is this loop with more error handling, observability, and tool abstractions around it.
Agentic AI isn't a new paradigm — it's the same transformer you know from my workshops, extended with a loop. The model's attention mechanism is its working memory. The context window is the agent's state. Tool outputs are just tokens.
Once you see it this way, you can reason about agent failure the same way you reason about transformer behavior: it's a function of what's in the context, how attention distributes, and whether the training distribution covers what you're asking it to do.
Topics we'll go deeper on in the next session: