LinkedIn Article · AI Engineering

Going Agentic:
From Tokens to
Autonomous Action

A technical primer for beginners — grounded in how transformers actually work.

Transformers Agentic AI ReAct Loops Tool Use

If you've attended my workshops, you've heard me say this: "A language model on its own is stateless, memoryless, and passive." It waits. It responds. It forgets. So how do we get from that — a next-token predictor — to an autonomous agent that can book your flights, write code, run tests, and fix its own bugs?

That's what going agentic means. And it builds directly on transformer fundamentals you already know. Let's go layer by layer.


What the transformer is actually doing

Before we build an agent, let's re-anchor on the mechanism. A transformer takes a sequence of tokens, processes them through stacked self-attention + feed-forward layers, and outputs a probability distribution over the vocabulary. That's it.

The critical insight for agentic AI: the output doesn't have to be prose. It can be structured JSON. A function call. A decision. This is the bridge from "chatbot" to "agent."

Fig 1 — From next-token prediction to action selection
Transformer to Agent Bridge Diagram showing how transformer token output becomes structured JSON that maps to agent actions "Search" "for" "flights" Input tokens Transformer Attention + FFN × N layers Logits P(vocab) → sample {"action": "search_flights", "origin": "PHX", "dest": "JFK"} Structured action Same model. Different prompt.

The transformer doesn't change — only what we ask it to generate.

Key insight: "Tool calling" is not a special mode. It's just token generation with a prompt that says: "If you need to act, output JSON in this schema." The model samples from its vocabulary distribution — and we've shaped that distribution to prefer well-formed JSON over prose.


The ReAct loop: Reason → Act → Observe

A single forward pass gives you one response. An agent chains many. The ReAct pattern (Yao et al., 2022) is the most common architecture: the model alternates between reasoning in natural language and executing tool calls, with each observation fed back as context for the next step.

This works because of something fundamental about transformer attention: the entire history is in the context window. Every prior reasoning step and observation is attended to at every layer. The model doesn't just see the last message — it attends across the full trajectory.

Fig 2 — The ReAct agentic loop
ReAct Agentic Loop Diagram showing the cycle: Task → Think → Act → Observe → back to Think, until Done Task User goal Think Chain-of-thought in context Act Tool call JSON → API Observe Tool result → context Feed result back into context window → repeat Done Final ans. if complete Context window accumulates: system prompt + task + [think + act + observe]× n

Each loop iteration appends tokens to context. The transformer attends to the full history at every step.

Here's the transformer connection that often gets missed: every observation from a tool becomes part of the key-value (KV) cache. When the model "thinks" in the next step, self-attention queries span the entire trajectory. The model isn't summarizing what happened — it's attending to it directly. This is why long context windows matter enormously for agentic reliability.

Thinking: Chain-of-thought in context

The model generates "scratch-pad" tokens — reasoning steps like "I need to search for flights from Phoenix to JFK. Let me call search_flights." These tokens are in the context window. They attend to the original task, system prompt, and any prior results. This is just autoregressive generation — no magic.

In your prompt, you typically say: "Think step-by-step. When you need to act, output a JSON tool call."

Acting: Structured JSON → external API

The model outputs structured JSON instead of prose. Your orchestrator (Python, TypeScript, whatever) parses that JSON, routes it to the right function — a search API, a database query, a code executor — and captures the result.

Example output: {"tool": "search_flights", "args": {"origin": "PHX", "dest": "JFK", "date": "2026-05-01"}}

Observing: Results become tokens

The tool result gets serialized as a string and appended to the conversation as a new message — typically tagged as a "tool result" role. This extends the context window. On the next forward pass, every attention head can query this result.

This is the secret: the model "remembers" results not through internal state, but through extended context that it attends to.

Repeat until done

The orchestrator checks if the model returned a final answer or another tool call. If it's a tool call, repeat. If it's a final answer, return it to the user. Simple loops run 3–5 iterations. Complex agents may run dozens.

Your main engineering challenge: stopping conditions, error handling, and context window budget management.


Anatomy of a production agent

A well-built agent has five distinct concerns. Understanding which layer handles what will save you days of debugging.

The model — the transformer brain

This is your frozen pre-trained network. It brings world knowledge, reasoning ability, and the capacity to produce structured outputs. You're not training it — you're steering its generation through context.

Key parameters you control: temperature (lower = more deterministic tool calls), max_tokens (budget per step), and stop_sequences (tell it when to stop and wait for a tool result).

Agentic tasks often want lower temperature (~0.1–0.4) than creative tasks. You need reliable JSON, not poetic JSON.


How attention makes agents work

Let's be precise about the transformer internals here, because this is what I always cover in workshops and what most blog posts skip.

In a standard transformer, each token's representation is computed as a weighted sum of all other tokens — the attention weights. During an agentic loop, this means the model's "reasoning" at step N is informed by everything: the original task, every prior think-step, every tool call, every observation. The KV cache stores the key and value projections for all prior tokens, so incremental generation is efficient — you don't recompute the whole context on every new token.

Fig 3 — KV cache accumulation across agent steps
KV Cache Accumulation Each agent step appends tokens to the KV cache, which grows across the loop System + Task Step 0 context Think 1 Step 1 Act 1 Step 2 Observe 1 Step 3 Think 2 Step 4 … KV cache grows with each step — all prior keys/values cached, no recompute Think 2 attends to all prior tokens: highest attention weight: the tool result

Attention is not bounded to "recent" steps. The model can attend to the original task with full weight, even 40 steps later.

Why context window size is an architectural decision, not a setting: Each agent step costs O(n) in attention computation (n = current context length). Long chains in large contexts get expensive fast. Budget your context like memory — know when to summarize or use RAG to compress old observations.


Multi-agent systems: when one loop isn't enough

Single agents hit limits: context overflow, task complexity, and parallelism. The solution is multi-agent orchestration — multiple transformer instances, each running its own ReAct loop, coordinated by an orchestrator.

Pattern Structure When to use
Single agent One model, one loop Tasks with <10 steps, small context
Supervisor + workers Orchestrator model routes sub-tasks to specialized agents Parallel work, role specialization
Pipeline Agent A's output = Agent B's input Sequential, verifiable stages (write → review → deploy)
Debate / critic Two agents produce + critique High-stakes outputs requiring self-correction
Fig 4 — Supervisor / worker multi-agent pattern
Multi-agent supervisor worker pattern An orchestrator agent dispatches sub-tasks to specialized worker agents and collects results User Orchestrator transformer model plans + routes Search Agent web + retrieval tools Code Agent exec + test tools Writer Agent doc + format tools Merged result back to user Worker results injected into orchestrator context → final synthesis pass

Each worker is an independent transformer instance with its own context. The orchestrator aggregates results in its context window.


What actually goes wrong (and why)

Agentic systems fail in slow, compounding ways. An LLM chatbot either answers or it doesn't. An agent can drift across 20 steps, confidently doing the wrong thing, with every step looking plausible in isolation.

Failure mode Root cause Mitigation
Context drift Early task context gets diluted by many observations; attention weights shift Re-state the goal every N steps; use a "goal anchor" in system prompt
Tool hallucination Model invents tool names or args not in schema Strict JSON schema validation; retry with error feedback in context
Infinite loops No stopping condition; model keeps seeking more info Hard step limit; detect repeated tool calls
Context overflow Long chains exceed the context window Sliding window; RAG compression; hierarchical agents
Cascading errors Bad tool result poisons all subsequent reasoning Error-handling branches; retry with different strategy in prompt

Building your first agent in ~50 lines

Here's the minimal pattern. Python pseudocode, but the logic applies everywhere:

# 1. Define your tools (just Python functions + docstrings)
def search_web(query: str) -> str: ...
def run_python(code: str) -> str: ...

# 2. Build a context window
messages = [
  {"role": "system", "content": SYSTEM_PROMPT + tool_schemas},
  {"role": "user",   "content": task}
]

# 3. ReAct loop
for step in range(MAX_STEPS):
  response = llm.generate(messages, temperature=0.2)

  if response.is_final_answer:
    return response.text

  # Parse the tool call from JSON output
  tool_call = parse_json(response.text)
  result = dispatch(tool_call)  # routes to Python function

  # Append both the model's output AND the result to context
  messages.append({"role": "assistant", "content": response.text})
  messages.append({"role": "tool",      "content": str(result)})

# If we hit MAX_STEPS without a final answer, surface that to the user
raise AgentTimeoutError("Max steps reached")
    

That's the core. Every agentic framework — LangChain, LlamaIndex, CrewAI, the Anthropic Claude SDK — is this loop with more error handling, observability, and tool abstractions around it.

The core insight to take home

Agentic AI isn't a new paradigm — it's the same transformer you know from my workshops, extended with a loop. The model's attention mechanism is its working memory. The context window is the agent's state. Tool outputs are just tokens.

Once you see it this way, you can reason about agent failure the same way you reason about transformer behavior: it's a function of what's in the context, how attention distributes, and whether the training distribution covers what you're asking it to do.

Topics we'll go deeper on in the next session: