Hermes — The Complete Guide
Gardax — the network's worker agent: generates, analyzes, self-heals
In my network today, Hermes is Gardax — the comic-cast character alongside Kami, Kaylee, Box and Solis, and the network's studio/worker agent: it takes jobs from Claude Code (the orchestrator) and returns structured output — asset generation, data-science, and delegating coding tasks to other coding agents. It runs on free Gemini and chats on Telegram in text and voice. The name 'Hermes' stays because that's what it grew from — a self-healing infrastructure CLI written in Go (v0.8.0 in my stack). The philosophy: a whitelist of permitted actions + verification-after-fix + learning from recurring failures. A five-stage architecture: detect → diagnose → fix → verify → learn. It runs as a cron job or a webhook responder and persists history to SQLite/JSON. In my setup it performs autoheal for Kami and for OpenClaw (the engine behind Kaylee) — but for you, it's a pattern you can adopt with any CLI (or even bash scripts): the five stages fit any production system, not just AI agents.
What this guide covers
What is Hermes? A doctor living in your server's ER
A Go CLI that detects an incident, diagnoses, fixes, verifies, and learns — without waking you
Hermes is a self-healing infrastructure CLI — a tool written in Go that runs on the server like a teammate who never sleeps. The idea is simple but powerful: 90% of production failures are the same 10 recurring problems (a container that fell over, a stuck network connection, a disk that filled up). Hermes recognizes that pattern and, instead of waking you every time, triggers a five-stage sequence: detect, diagnose, fix, verify, and learn (for the next time). In my own setup it performs autoheal for Kami and for OpenClaw (the engine behind Kaylee), but it's a pattern you can adopt on any stack — not just AI agents, but any production service. The real savings are in your sleep and in the PagerDuty bill you never have to pay again.
The Pattern in detail — how to wire up the 5 stages
Each stage is simple and testable on its own; together they form a self-healing loop
The beauty of the Hermes pattern is that each stage is a short, independently testable function — which is exactly why you can start with a minimal version (an hour's work) and grow it incrementally. This is the canonical SRE approach at Google: a self-healing system is built from small, safe steps, not as one giant monolith.
Whitelist — what Hermes is allowed to do (and, crucially, what it isn't)
The whitelist is the safety harness of any self-healing system
The moment you give an automated script permission to run commands against production — you must define exactly what's allowed and what isn't. Hermes's whitelist is a small JSON file containing the list of permitted actions — without it, Hermes will do nothing. That's the difference between a system that lets you sleep soundly and one that accidentally wipes out your VPS.
Verification — the key to real reliability
A fix worked only if you can prove it worked — 'the command ran' is not enough
The most common mistake junior SRE teams make: 'I ran a restart, it returned 0, it's probably fine.' It isn't. Verification is the ability to prove that after the fix the service is genuinely alive, genuinely responsive, and genuinely doing what it's supposed to do. That's the difference between a Hermes that works and a script that runs at night and lulls you into feeling everything's fine — until morning reveals that the API was returning 500 all night long.
Memory — the memory that makes Hermes smarter every week
A Qdrant collection that remembers what worked for what — semantic search over historical fixes
Without memory, Hermes is a collection of scripts running in a loop. With memory — it becomes something that learns from your network. Every successful fix is stored as an embedding in Qdrant, and the next time a similar failure appears, a 40ms semantic search surfaces the action that worked before. That's the difference between a static system and one that gets smarter with every incident.
Escalation — when it's right to wake you (and as little as possible)
The gold of self-healing: alert only when it's truly worth your sleep
Escalation is a last resort — the moment Hermes throws its hands up and says 'I can't do this, please help.' The whole point of Hermes is to cut alerts down to 10% of cases — reserved only for the new and interesting. If Hermes sends too many alerts, that's a sign the whitelist or memory isn't good enough, not a sign that 'the tool is noisy.' PagerDuty's starter plan runs $21/user/month (modern alternatives like BetterStack, Grafana OnCall or Squadcast come in cheaper still); Hermes costs $0 and saves your sleep on top.
Integrating with your stack — Hermes is a Pattern, not a service
How to embed the approach inside your existing agents and services
Important note: the Hermes pattern (detect→diagnose→fix→verify→learn) lives inside the agents and services themselves — cron jobs, webhook handlers, or in-code modules — not one central service. That's an advantage: effective self-healing is distributed across every component. 2026 update: beyond the self-healing pattern it grew from, today in my network Hermes is also the network's studio/worker agent — the headless component that generates assets, analyzes data, and runs code on behalf of the orchestrator. Both sides coexist: the pattern that keeps the server alive, and the agent that produces work on top of it.

