Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAcademic Proof-of-Work in the Age of LLMsLessWrong AISeries Week 20 / 52 — Differentiating Patching differences between Exadata On Prem and OCI Databases (DB Systems and ExaCS)DEV CommunityWhat Happens When an AI Agent Stops WritingDev.to AIBootstrap-Driven Model Diagnostics and Inference in Python/PySparkMedium AI[AWS] Strategies to make KAA work like a member of the project team [Kiro]DEV CommunityAB-730 Skill 1: Understand Generative AI Fundamentals — Complete Study GuideMedium AIBuilding a Digital Twin of the Human Brain: Inside TRIBE v2Medium AIStop Telling Professionals How to Do Their Job — Commander’s Intent at WorkMedium AIClaude Code hooks: auto-format, auto-test, and self-heal on every file saveDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIWhy Ahrefs Shows Backlinks but Google Doesn’t (Technical Breakdown + Fix Guide)Dev.to AIProgressive Disclosure: Improving Human-Computer Interaction in AI Products with Less-is-More PhilosophyDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAcademic Proof-of-Work in the Age of LLMsLessWrong AISeries Week 20 / 52 — Differentiating Patching differences between Exadata On Prem and OCI Databases (DB Systems and ExaCS)DEV CommunityWhat Happens When an AI Agent Stops WritingDev.to AIBootstrap-Driven Model Diagnostics and Inference in Python/PySparkMedium AI[AWS] Strategies to make KAA work like a member of the project team [Kiro]DEV CommunityAB-730 Skill 1: Understand Generative AI Fundamentals — Complete Study GuideMedium AIBuilding a Digital Twin of the Human Brain: Inside TRIBE v2Medium AIStop Telling Professionals How to Do Their Job — Commander’s Intent at WorkMedium AIClaude Code hooks: auto-format, auto-test, and self-heal on every file saveDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIWhy Ahrefs Shows Backlinks but Google Doesn’t (Technical Breakdown + Fix Guide)Dev.to AIProgressive Disclosure: Improving Human-Computer Interaction in AI Products with Less-is-More PhilosophyDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

I Built a Multi-Agent AI Runtime in Go Because Python Wasn't an Option

DEV Communityby Clinton AdedejiApril 4, 202613 min read0 views
Source Quiz

The idea that started everything Some weeks ago, I was thinking about Infrastructure as Code. The reason IaC became so widely adopted is not because it's technically superior to clicking through a cloud console. It's because it removed the barrier between intent and execution. You write what you want, not how to do it. A DevOps engineer doesn't need to understand the internals of how an EC2 instance is provisioned — they write a YAML file, and the machine figures it out. I started wondering: why doesn't this exist for AI agents? If I want to run a multi-agent workflow today, I have two choices. I learn Python and use LangGraph or CrewAI, or I build my own tooling from scratch. Neither option is satisfying. The first forces me into an ecosystem and a language I might not want. The second me

The idea that started everything

Some weeks ago, I was thinking about Infrastructure as Code.

The reason IaC became so widely adopted is not because it's technically superior to clicking through a cloud console. It's because it removed the barrier between intent and execution. You write what you want, not how to do it. A DevOps engineer doesn't need to understand the internals of how an EC2 instance is provisioned — they write a YAML file, and the machine figures it out.

I started wondering: why doesn't this exist for AI agents?

If I want to run a multi-agent workflow today, I have two choices. I learn Python and use LangGraph or CrewAI, or I build my own tooling from scratch. Neither option is satisfying. The first forces me into an ecosystem and a language I might not want. The second means rebuilding primitives every time.

What if I could write a YAML file that described what I wanted — which agents, which tools, which LLM providers — and a runtime would just handle the rest? What if a non-developer could read that file and understand what the system does? What if I didn't have to understand how an agent works internally before I could use one?

That question became Routex.

Why Go, not Python

Every AI agent framework that exists today is written in Python. LangChain, LangGraph, CrewAI, AutoGen — Python all the way down. And for good reason: Python has the richest ML ecosystem, the most tutorials, and the lowest barrier to entry for data scientists.

But as a Go developer. And I kept thinking: Go should be a natural fit for this.

Here's why. An AI agent is fundamentally a concurrent system. An agent waits for an LLM response, executes tools, waits for tool results, calls the LLM again. Multiple agents run in parallel, passing results to each other through a dependency graph. This is exactly what Go was designed for.

Goroutines are cheap enough that you can run one per agent without thinking about thread pool sizing. Channels give you typed, safe communication between agents without shared state. The context package gives you cancellation and timeout propagation that flows naturally through the entire call stack. You get a single, statically compiled binary you can deploy anywhere without a runtime, a virtualenv, or a requirements.txt.

Go already had everything the problem needed — it just didn't have the framework yet. So I built it.

What Routex looks like to a user

The core idea is that you should be able to describe an entire multi-agent crew in a YAML file, run it with a single command, and get results — without writing a single line of Go. Here's what that looks like:

agents.yaml

runtime: name: "research-crew" llm_provider: "anthropic" model: "claude-haiku-4-5-20251001" api_key: "env:ANTHROPIC_API_KEY"

task: input: "Compare the top Go web frameworks in 2026"

agents:

  • id: "researcher" role: "researcher" goal: "Find detailed information about Go web frameworks" tools: ["web_search", "wikipedia"]

  • id: "writer" role: "writer" goal: "Write a clear, structured report from the research" depends: ["researcher"]

tools:

  • name: "web_search"
  • name: "wikipedia"`

Enter fullscreen mode

Exit fullscreen mode

Run it:

routex run agents.yaml

Enter fullscreen mode

Exit fullscreen mode

That's the entire user experience for the common case. The researcher runs first, uses web search and Wikipedia to gather information, then the writer agent picks up those results and produces a report. The dependency is declared — depends: ["researcher"] — and the runtime handles the ordering automatically.

A non-developer can read this file and understand exactly what it does. A developer can extend it with custom tools, different LLM providers per agent, Redis-backed memory, and OpenTelemetry tracing — all from YAML, all without touching the runtime code.

The technical core: goroutines, channels, and a topological scheduler

Under the YAML surface, Routex is built on three Go primitives: goroutines, channels, and a topological sort.

Each agent is a long-lived goroutine. It sits waiting on an Inbox channel. The scheduler sends it a task, it runs its thinking loop — calling the LLM, executing tools, calling the LLM again — and sends its result back through an Output channel. This model maps so naturally onto Go that the core agent loop is less than fifty lines.

The scheduler uses Kahn's algorithm to determine execution order. It builds a dependency graph from your YAML, identifies which agents have no dependencies, and runs them all in parallel as the first "wave." When that wave completes, it identifies agents whose dependencies are now satisfied and runs those. This continues until all agents have run.

In practice, this means independent agents run concurrently without you having to think about it. If you have three researcher agents gathering data about different topics, they all run almost the same time. The writer agent waits until all three are done, then synthesizes their results in a single pass.

The thing I didn't plan for: what happens when an agent fails

I finished the scheduler and felt good about it. Then I realised I had a problem.

LLM calls fail. They time out, hit rate limits, return malformed responses. If an agent fails halfway through a crew run, every agent that depends on it gets the wrong answer — or no answer at all. The writer agent would try to synthesise results that don't exist. The whole run corrupts silently.

I needed a way to handle failure that was more principled than wrapping everything in a retry loop.

I started reading about how Erlang handles this problem. Erlang was built by Ericsson in the 1980s for telephone switches — systems that cannot go down. Their solution was the supervision tree: every process is watched by a supervisor, and when a process crashes, the supervisor decides what to do based on a policy. The philosophy is "let it crash" — don't write defensive code trying to handle every possible failure, just let things fail fast and trust the supervisor to recover cleanly.

This maps perfectly onto agents. An agent fails — the supervisor checks its policy:

  • one_for_one — restart only this agent, leave the others running

  • one_for_all — restart the entire crew

  • rest_for_one — restart this agent and everything that depends on it

The supervisor also tracks a restart budget. If an agent crashes three times within one minute, the supervisor stops trying and declares it permanently failed rather than looping forever burning API tokens.

In Routex, you configure this in one line:

agents:

  • id: "researcher" role: "researcher" restart: "one_for_one"`

Enter fullscreen mode

Exit fullscreen mode

The bug that taught me everything about channel protocols

Here is a story about a bug I will not forget.

I had finished the supervisor. Restart policies were working. The supervisor correctly restarted failed agents. I was feeling very good about myself.

Then I ran a test with a researcher agent that was configured to fail on its first attempt and recover on the second. I kicked it off and watched the logs. The supervisor saw the failure. It applied the one_for_one policy. It restarted the agent goroutine. The logs said:

supervisor: agent "researcher" restarted agent "researcher": waiting for message

Enter fullscreen mode

Exit fullscreen mode

And then... nothing. The application just sat there.

No timeout error. No panic. No LLM calls. Just silence.

I stared at my terminal for a full two minutes assuming the LLM was being slow. I went and made tea. I came back. Still nothing. I started wondering if the Anthropic API was down. I checked the Anthropic status page. Everything was fine.

I added more logging. The agent was alive — it was sitting in its select loop, genuinely waiting for a message on its Inbox channel. It had been restarted correctly. The problem was that the scheduler had no idea.

The scheduler had sent the original task to the agent's Inbox before the failure. The agent crashed mid-run. The supervisor restarted a fresh agent goroutine. That fresh goroutine was now sitting patiently waiting for a new task to arrive on the channel — which it never would, because from the scheduler's perspective, the task had already been sent. The scheduler was blocked waiting for a result from the old goroutine that no longer existed.

Two goroutines. Both alive. Both waiting. Neither knowing the other was waiting. A perfect deadlock dressed up as a slow LLM.

The fix was the FailureReport / Decision protocol. The scheduler now never moves on after a failure until the supervisor explicitly tells it what to do:

// Scheduler sends this when an agent fails type FailureReport struct {  AgentID string  Err error  Reply chan<- Decision }

// Supervisor responds with this type Decision struct { AgentID string Retry bool Err error }`

Enter fullscreen mode

Exit fullscreen mode

When an agent fails, the scheduler sends a FailureReport and blocks on the Reply channel. The supervisor restarts the agent, then sends Decision{Retry: true} back. The scheduler receives this, re-sends the original task to the agent's Inbox, and waits for the result again.

Now the scheduler always knows. The agent always gets its message. And I no longer spend time checking the Anthropic status page when my own code is broken.

Parallel tool calls: the LLM asked for three tools at once

When a language model responds with a tool call, most agent frameworks execute it, wait for the result, then call the LLM again. One tool at a time, sequentially.

But modern LLMs can request multiple tools in a single response when those tools are independent. Claude might decide it needs to search the web, read a file, and query Wikipedia simultaneously — and return all three requests in one response. Running them sequentially wastes time.

In Routex, when the LLM returns multiple tool calls, they all execute concurrently:

var wg sync.WaitGroup results := make([]toolResult, len(toExecute))

for i, tc := range toExecute { wg.Add(1) go func(i int, tc llm.ToolCallRequest) { defer wg.Done() out, err := registry.Execute(ctx, tc.ToolName, tc.Input) results[i] = toolResult{output: out, err: err} }(i, tc) }

wg.Wait()`

Enter fullscreen mode

Exit fullscreen mode

All results are appended to history in order before the next LLM call. From the LLM's perspective, it asked for three tools and got three results — the parallelism is invisible to it, but the wall-clock time is the slowest single tool rather than the sum of all three.

Calling LLM APIs directly with net/http

Every LLM SDK is just an HTTP client under the hood. Both the Anthropic and OpenAI adapters in Routex use net/http directly — no anthropic-sdk-go, no go-openai in go.mod. The wire format is straightforward JSON over HTTP.

req, err := http.NewRequestWithContext(ctx, http.MethodPost, c.baseURL+"/v1/messages", body) req.Header.Set("x-api-key", c.apiKey) req.Header.Set("anthropic-version", "2023-06-01") req.Header.Set("content-type", "application/json")

Enter fullscreen mode

Exit fullscreen mode

That is the entire Anthropic adapter setup. Removing the SDKs dropped several megabytes of transitive dependencies from the binary and made the HTTP layer completely transparent — no SDK abstractions, no version mismatches, no wrapping errors in SDK-specific types. When the API changes, you update a struct. That's it.

Multi-LLM crews: different models for different jobs

One pattern that emerges naturally from the YAML-driven design is using different LLM providers for different agents:

agents:

  • id: "researcher" role: "researcher" llm: provider: "anthropic" model: "claude-haiku-4-5-20251001" # fast, cheap

  • id: "writer" role: "writer" llm: provider: "openai" model: "gpt-4o" # more capable

  • id: "critic" role: "critic" llm: provider: "ollama" model: "llama3" # local, free`

Enter fullscreen mode

Exit fullscreen mode

Each agent has its own LLM configuration. You can run Claude for research, GPT-4o for writing, and a local Llama model for review — all in the same crew, all declared in YAML.

MCP: connecting to the entire ecosystem

Model Context Protocol is Anthropic's open standard for connecting LLMs to external tools via JSON-RPC. Any MCP-compatible server exposes a standard interface that Routex can connect to at startup:

tools:

  • name: "mcp" extra: server_url: "http://localhost:3000" server_name: "github" header_Authorization: "env:GITHUB_TOKEN"`

Enter fullscreen mode

Exit fullscreen mode

Routex connects, calls tools/list to discover everything the server exposes, and registers each tool automatically. From that point, agents use them exactly like built-in tools.

What the build looked like

The project went through several distinct phases, each with its own surprises.

The YAML config and basic agent loop came together relatively quickly. The topological scheduler took longer — mostly spent making sure cycles were detected cleanly and parallel waves executed correctly. The supervisor was the hardest part by far — not the restart logic itself, but making the channel protocol between the scheduler and supervisor airtight. The deadlock story above is the most vivid evidence of that difficulty.

Parallel tool calls came late in the project, after I noticed the LLM was sometimes requesting multiple tools in one response and the runtime was silently discarding all but the first. Once I understood the pattern, the implementation was clean — but the change rippled through the history format, both LLM adapters, and the agent loop simultaneously.

What I'd do differently: Start with the supervision model. I bolted it on after the scheduler was built, which meant retrofitting the channel protocol. If I were starting again, I'd design the scheduler–supervisor communication contract first and build everything else around it. The deadlock I described above would likely never have happened.

Try it

Routex v1.0.1 is available now:

go get github.com/Ad3bay0c/routex

Enter fullscreen mode

Exit fullscreen mode

Or install the CLI:

go install github.com/Ad3bay0c/routex/cmd/routex@latest

Enter fullscreen mode

Exit fullscreen mode

Scaffold a new project:

routex init my-crew cd my-crew

Enter fullscreen mode

Exit fullscreen mode

Update the generated agents.yaml and copy .env.example to .env and update the correct environment values, then run

routex run agents.yaml

Enter fullscreen mode

Exit fullscreen mode

If you're a Go developer who has been watching the AI agent ecosystem from the sidelines — Routex is for you.

Routex is open source under the MIT License. Source, examples, and documentation: https://github.com/Ad3bay0c/routex

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
I Built a M…claudellamamodellanguage mo…availableversionDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 241 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!