Going Agentic — From Tokens to Autonomous Action

If you've attended my workshops, you've heard me say this: "A language model on its own is stateless, memoryless, and passive." It waits. It responds. It forgets. So how do we get from that — a next-token predictor — to an autonomous agent that can book your flights, write code, run tests, and fix its own bugs?

That's what going agentic means. And it builds directly on transformer fundamentals you already know. Let's go layer by layer.

01 — Foundations

What the transformer is actually doing

Before we build an agent, let's re-anchor on the mechanism. A transformer takes a sequence of tokens, processes them through stacked self-attention + feed-forward layers, and outputs a probability distribution over the vocabulary. That's it.

The critical insight for agentic AI: the output doesn't have to be prose. It can be structured JSON. A function call. A decision. This is the bridge from "chatbot" to "agent."

Fig 1 — From next-token prediction to action selection

The transformer doesn't change — only what we ask it to generate.

Key insight: "Tool calling" is not a special mode. It's just token generation with a prompt that says: "If you need to act, output JSON in this schema." The model samples from its vocabulary distribution — and we've shaped that distribution to prefer well-formed JSON over prose.

02 — The Core Loop

The ReAct loop: Reason → Act → Observe

A single forward pass gives you one response. An agent chains many. The ReAct pattern (Yao et al., 2022) is the most common architecture: the model alternates between reasoning in natural language and executing tool calls, with each observation fed back as context for the next step.

This works because of something fundamental about transformer attention: the entire history is in the context window. Every prior reasoning step and observation is attended to at every layer. The model doesn't just see the last message — it attends across the full trajectory.

Fig 2 — The ReAct agentic loop

Each loop iteration appends tokens to context. The transformer attends to the full history at every step.

Here's the transformer connection that often gets missed: every observation from a tool becomes part of the key-value (KV) cache. When the model "thinks" in the next step, self-attention queries span the entire trajectory. The model isn't summarizing what happened — it's attending to it directly. This is why long context windows matter enormously for agentic reliability.

Thinking: Chain-of-thought in context

The model generates "scratch-pad" tokens — reasoning steps like "I need to search for flights from Phoenix to JFK. Let me call search_flights." These tokens are in the context window. They attend to the original task, system prompt, and any prior results. This is just autoregressive generation — no magic.

In your prompt, you typically say: "Think step-by-step. When you need to act, output a JSON tool call."

Acting: Structured JSON → external API

The model outputs structured JSON instead of prose. Your orchestrator (Python, TypeScript, whatever) parses that JSON, routes it to the right function — a search API, a database query, a code executor — and captures the result.

Example output: {"tool": "search_flights", "args": {"origin": "PHX", "dest": "JFK", "date": "2026-05-01"}}

Observing: Results become tokens

The tool result gets serialized as a string and appended to the conversation as a new message — typically tagged as a "tool result" role. This extends the context window. On the next forward pass, every attention head can query this result.

This is the secret: the model "remembers" results not through internal state, but through extended context that it attends to.

Repeat until done

The orchestrator checks if the model returned a final answer or another tool call. If it's a tool call, repeat. If it's a final answer, return it to the user. Simple loops run 3–5 iterations. Complex agents may run dozens.

Your main engineering challenge: stopping conditions, error handling, and context window budget management.

03 — Architecture

Anatomy of a production agent

A well-built agent has five distinct concerns. Understanding which layer handles what will save you days of debugging.

The model — the transformer brain

This is your frozen pre-trained network. It brings world knowledge, reasoning ability, and the capacity to produce structured outputs. You're not training it — you're steering its generation through context.

Key parameters you control: temperature (lower = more deterministic tool calls), max_tokens (budget per step), and stop_sequences (tell it when to stop and wait for a tool result).

Agentic tasks often want lower temperature (~0.1–0.4) than creative tasks. You need reliable JSON, not poetic JSON.

04 — The Transformer Connection

How attention makes agents work

Let's be precise about the transformer internals here, because this is what I always cover in workshops and what most blog posts skip.

In a standard transformer, each token's representation is computed as a weighted sum of all other tokens — the attention weights. During an agentic loop, this means the model's "reasoning" at step N is informed by everything: the original task, every prior think-step, every tool call, every observation. The KV cache stores the key and value projections for all prior tokens, so incremental generation is efficient — you don't recompute the whole context on every new token.

Fig 3 — KV cache accumulation across agent steps

Attention is not bounded to "recent" steps. The model can attend to the original task with full weight, even 40 steps later.

Why context window size is an architectural decision, not a setting: Each agent step costs O(n) in attention computation (n = current context length). Long chains in large contexts get expensive fast. Budget your context like memory — know when to summarize or use RAG to compress old observations.

05 — Scaling Up

Multi-agent systems: when one loop isn't enough

Single agents hit limits: context overflow, task complexity, and parallelism. The solution is multi-agent orchestration — multiple transformer instances, each running its own ReAct loop, coordinated by an orchestrator.

Pattern	Structure	When to use
Single agent	One model, one loop	Tasks with <10 steps, small context
Supervisor + workers	Orchestrator model routes sub-tasks to specialized agents	Parallel work, role specialization
Pipeline	Agent A's output = Agent B's input	Sequential, verifiable stages (write → review → deploy)
Debate / critic	Two agents produce + critique	High-stakes outputs requiring self-correction

Fig 4 — Supervisor / worker multi-agent pattern

Each worker is an independent transformer instance with its own context. The orchestrator aggregates results in its context window.

06 — Failure Modes

What actually goes wrong (and why)

Agentic systems fail in slow, compounding ways. An LLM chatbot either answers or it doesn't. An agent can drift across 20 steps, confidently doing the wrong thing, with every step looking plausible in isolation.

Failure mode	Root cause	Mitigation
Context drift	Early task context gets diluted by many observations; attention weights shift	Re-state the goal every N steps; use a "goal anchor" in system prompt
Tool hallucination	Model invents tool names or args not in schema	Strict JSON schema validation; retry with error feedback in context
Infinite loops	No stopping condition; model keeps seeking more info	Hard step limit; detect repeated tool calls
Context overflow	Long chains exceed the context window	Sliding window; RAG compression; hierarchical agents
Cascading errors	Bad tool result poisons all subsequent reasoning	Error-handling branches; retry with different strategy in prompt

07 — Getting Started

Building your first agent in ~50 lines

Here's the minimal pattern. Python pseudocode, but the logic applies everywhere:

# 1. Define your tools (just Python functions + docstrings)
def search_web(query: str) -> str: ...
def run_python(code: str) -> str: ...

# 2. Build a context window
messages = [
  {"role": "system", "content": SYSTEM_PROMPT + tool_schemas},
  {"role": "user",   "content": task}
]

# 3. ReAct loop
for step in range(MAX_STEPS):
  response = llm.generate(messages, temperature=0.2)

  if response.is_final_answer:
    return response.text

  # Parse the tool call from JSON output
  tool_call = parse_json(response.text)
  result = dispatch(tool_call)  # routes to Python function

  # Append both the model's output AND the result to context
  messages.append({"role": "assistant", "content": response.text})
  messages.append({"role": "tool",      "content": str(result)})

# If we hit MAX_STEPS without a final answer, surface that to the user
raise AgentTimeoutError("Max steps reached")

That's the core. Every agentic framework — LangChain, LlamaIndex, CrewAI, the Anthropic Claude SDK — is this loop with more error handling, observability, and tool abstractions around it.

Going Agentic:
From Tokens to
Autonomous Action

What the transformer is actually doing

The ReAct loop: Reason → Act → Observe

Thinking: Chain-of-thought in context

Acting: Structured JSON → external API

Observing: Results become tokens

Repeat until done

Anatomy of a production agent

The model — the transformer brain

How attention makes agents work

Multi-agent systems: when one loop isn't enough

What actually goes wrong (and why)

Building your first agent in ~50 lines

The core insight to take home

Going Agentic:From Tokens toAutonomous Action

What the transformer is actually doing

The ReAct loop: Reason → Act → Observe

Thinking: Chain-of-thought in context

Acting: Structured JSON → external API

Observing: Results become tokens

Repeat until done

Anatomy of a production agent

The model — the transformer brain

How attention makes agents work

Multi-agent systems: when one loop isn't enough

What actually goes wrong (and why)

Building your first agent in ~50 lines

The core insight to take home

Going Agentic:
From Tokens to
Autonomous Action