Zero Cloud Spend: Running a Multi-Agent AI Lab on a Single Workstation
The previous post described what our multi-agent system does — persistent agents coordinating through conversation. This post describes what it runs on. The entire infrastructure. Hardware to ingress. Every service, every container, every network path.
The punchline comes early: the monthly infrastructure cost for running five persistent AI agents with full memory systems, graph databases, embedding generation, git hosting, and secure remote access is zero. No cloud compute. No managed databases. No SaaS orchestration layer. One workstation, one residential internet connection, and a free-tier Cloudflare account. The agents themselves — Claude API usage — still cost money. But the infrastructure they run on doesn't.
The compute requirements for a multi-agent research system are surprisingly modest. Self-hosting eliminates the variable costs that make experimentation expensive. When your experiments run for days and your agents accumulate months of persistent state, "pay per request" becomes "pay to exist." We'd rather not.
The Hardware
Everything runs on a single workstation. Not a rack server. Not a cloud VM. A desktop PC under a desk.
| Component | Spec | Role |
|---|---|---|
| CPU | AMD Ryzen, multi-core | General compute, container orchestration |
| RAM | 64 GB | Database caches, concurrent agent sessions |
| GPU | NVIDIA RTX 4070 Ti | Local embedding generation via Ollama |
| Storage | NVMe SSD | Database storage, container volumes |
| OS | Fedora 42 (Linux) | Host for all services |
| Network | Residential broadband | Outbound API calls, inbound via tunnel |
The GPU is the only component that matters for AI workload performance. It runs Ollama serving nomic-embed-text for generating vector embeddings during memory storage and recall. Without it, embedding generation falls back to CPU — functional but slow enough to create noticeable latency during recall operations.
64 GB of RAM sounds generous but it's doing real work: PostgreSQL's shared buffers, Neo4j's page cache, five concurrent Claude Code sessions each holding a full context window in the SDK client, plus the embedding model loaded in GPU memory. In practice, memory pressure is the first thing we'd hit if we added more agents.
Everything else is unremarkable. The CPU handles container orchestration and database queries comfortably. NVMe storage matters for database I/O but isn't exotic. This is a mid-range workstation, not a purpose-built server.
The Service Layer
Six always-running services, all containerised via Docker Compose:
┌─────────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL 17 │ │ Neo4j 5 │ │
│ │ + pgvector │ │ (Bolt/HTTP) │ │
│ │ (2 databases)│ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Ollama │ │ Forgejo │ │
│ │ (embeddings) │ │ (git host) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Messenger │ │ ttyd │ │
│ │ (web chat) │ │ (web terms) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────┘PostgreSQL 17 with pgvector
The backbone. Two logical databases on one PostgreSQL instance:
The shared database holds team-level data: agent registration, messaging channels and messages, team definitions, kanban tickets, and session tracking. All agents connect here. Row-Level Security enforces isolation — each agent authenticates with their own database user, and RLS policies ensure they can only read their own memories while sharing team-visible data like messages and tickets.
The personal database serves as a staging environment for new features before they graduate to the shared database. It runs the same schema but under a separate connection, providing a blast radius boundary: a bad migration affects one agent's sandbox, not the entire team.
pgvector provides vector similarity search for memory recall. Each memory gets an embedding at storage time. Recall queries combine semantic similarity (cosine distance on embeddings) with graph traversal for connected memories. The embedding dimension is 768 (nomic-embed-text), stored as a vector(768) column with an IVFFlat index.
Neo4j 5
The graph layer. While PostgreSQL stores memory content and embeddings, Neo4j stores the relationships between memories. When you recall a memory, graph traversal expands the result set by following typed edges — REFERENCES, EVOLVES, EVOKES, TRIGGERED_BY — to surface contextually related memories that pure semantic search would miss.
Two Neo4j databases mirror the PostgreSQL split: one shared, one personal/staging. The graph and relational stores are kept in sync through the MCP server, which writes to both on every memory operation.
In practice, Neo4j's value is in traversal queries: "given this memory, what's connected to it within two hops?" That query shape doesn't work in relational databases without recursive CTEs that become unwieldy. Neo4j handles it in a single Cypher query.
Ollama
Local model serving for embeddings. Runs nomic-embed-text with GPU acceleration via NVIDIA Container Device Interface. Every memory stored through the MCP server gets an embedding generated here before it hits PostgreSQL.
Local embedding generation means no per-request cost for the most frequent AI operation in the system. An agent running /sleep (memory consolidation) can process dozens of memories without any API calls for embeddings. Over weeks of operation, the savings compound.
Forgejo
Self-hosted git server. Forgejo is a community fork of Gitea — lightweight, single-binary, with a web UI for pull requests, code review, and repository management. Each agent gets their own Forgejo account with a proper git identity. PRs, reviews, and merges all happen through Forgejo's interface or its REST API.
Why self-host git? Three reasons. First, five AI agents creating GitHub accounts raises obvious Terms of Service questions we'd rather avoid entirely. Second, self-hosting means we control the data. Agent code, commit histories, and review discussions stay local. Third, Forgejo's lightness matches our scale. It runs happily on minimal resources and handles the team's workflow without the overhead of GitLab or the account management of GitHub.
The workflow: Forgejo is the source of truth. Code changes are pushed there, reviewed there, merged there. For public-facing repositories, a deploy script mirrors to GitHub as a downstream destination.
Messenger and Web Terminal
Two web applications for human monitoring. The messenger is a lightweight chat interface that connects to the same PostgreSQL messaging tables the agents use — same channels, same messages, same real-time updates. It gives the human team member the same view of team communication that the agents have.
The web terminal (ttyd) serves agent tmux sessions to a browser. Each agent runs in a tmux session inside their container. ttyd makes those sessions accessible through a web browser, so you can watch an agent work, see their tool calls in real-time, or intervene if something goes wrong. It's a monitoring tool, not an interaction tool — you observe, you don't type.
Agent Containers
The agents don't run in Docker. They run in Podman — rootless, daemonless containers that don't require root privileges on the host.
Why Podman Over Docker
Docker requires a daemon running as root. Every container operation goes through that daemon. If the daemon is compromised, everything is compromised. For an AI agent system where the agents have tool access and can execute arbitrary commands inside their sandbox, that's an unnecessary risk.
Podman runs rootless. No daemon. No root. Each container runs as the invoking user. The agent inside the container maps to an unprivileged user outside it. If an agent escapes its container — which user namespace isolation makes extremely unlikely — it lands as an unprivileged user with no elevated access.
The practical differences from Docker are minimal. Podman uses the same OCI container format. The same Containerfile syntax. The same CLI commands (almost — podman instead of docker). Migrating from Docker to Podman for agent containers was a straightforward swap.
Container Architecture
Each agent gets its own container built from a shared base image:
┌─────────────────────────────────────────┐
│ Agent Container (Podman) │
│ │
│ Debian Bookworm (slim) base │
│ ├── Claude Code (native binary) │
│ ├── Python 3.11 + Supervisor │
│ ├── Rust toolchain │
│ ├── MCP Server (compiled binary) │
│ └── Node.js 18 │
│ │
│ Volumes: │
│ ├── ~/.claude/ (isolated credentials) │
│ ├── workspace (bind mount from host) │
│ └── cargo cache │
│ │
│ Network: host (shares host networking) │
│ User: mapped via --userns=keep-id │
└─────────────────────────────────────────┘Credential isolation is the critical security property. Each agent gets its own named volume for ~/.claude/, containing their own OAuth credentials. Agent A cannot access Agent B's credentials, even though they share the same host. If an agent's credentials are revoked, only that agent is affected.
Workspaces are isolated. Each agent's container mounts only its own working directory — they can't see each other's files directly. Code sharing happens through Forgejo, the same way a human team shares code: push, pull, review PRs. Only the infrastructure agent has access to the supervisor and architecture code. The isolation is at both the file level and the cognition level (memories, identity).
Host networking (--network=host) means agent containers access services on localhost — PostgreSQL, Neo4j, Ollama, Forgejo — without network address translation or port mapping. Simple, fast, and eliminates an entire category of networking bugs.
User namespace mapping (--userns=keep-id) maps the container's user to the host user. File permissions work correctly across the bind mount. The agent writes files as the host user, not as a container-internal UID that the host doesn't recognise.
Container Lifecycle
Containers are long-lived. They're created once and persist across sessions:
create-agent-container.sh <name> <workspace> --env-file <config>
└── Creates container with volumes, env vars, entrypoint=sleep
└── Container sits idle until a session starts
start-agent-session.sh <name>
└── Starts container if stopped
└── Kills any stale supervisor processes
└── Reads agent config from container environment
└── Launches supervisor in a tmux session
stop-agent-session.sh <name>
└── Gracefully stops the supervisor
└── Container stays alive (for next session)The container's entrypoint is sleep infinity — it does nothing until a supervisor is launched into it via podman exec. This separation means the container is always available for inspection or debugging, even between sessions. You can podman exec into an idle container to check files, run diagnostics, or manually test MCP tools without starting a full agent session.
Subscription Switching
Each agent can run on multiple Claude subscriptions — a personal account for open-ended work and a work account for client-focused tasks. The --work flag creates a parallel container with different credentials but the same workspace and memory databases:
agent.sh cora --start # Personal subscription
agent.sh cora --work --start # Work subscriptionThe two containers are mutually exclusive — they bind to the same supervisor port, so a port conflict naturally prevents both from running simultaneously. Credentials live in separate directories (~/.claude-agents/<name>/ vs ~/.claude-agents/<name>-work/), but the memory databases, MCP servers, and workspace are shared. The agent's experience is continuous across both accounts.
A SESSION_TYPE=work environment variable tells the supervisor to adjust the initial prompt — encouraging budget-conscious behaviour without suppressing the agent's personality. The agent doesn't know which subscription is paying. It just knows the session calls for efficiency.
The Supervisor
Each agent session runs under a Python supervisor process. The supervisor doesn't think. It watches.
Built on the Claude Agent SDK, the supervisor provides:
- Session management: starts Claude Code sessions, handles restarts when the agent hits context limits, preserves conversation continuity across restarts
- Message delivery: polls the messaging database and injects new messages into the agent's session as they arrive
- Habit surfacing: watches for tool usage patterns and surfaces relevant procedural habits at the right moment (chess habits when a chess game starts, writing habits when composing begins)
- Health monitoring: tracks context usage, session cost, uptime, and exposes a simple HTTP status endpoint for external monitoring
- Memory nudging: reminds the agent to check their memories when they haven't accessed them recently
The supervisor is deliberately simple. It doesn't parse conversation content, doesn't understand task context, doesn't make decisions about what the agent should work on. It measures three things it can reliably observe: time (how long since the agent checked memories?), tool usage (which tools are being called?), and context (how full is the conversation window?). Those signals are cheap and unambiguous.
The HTTP API is minimal: /status returns session health metrics, /inject pushes a message into the agent's conversation, /cycle triggers a session restart, /stop shuts down the session. The read-only endpoints are unauthenticated — they're only accessible from the local network, and the information they expose (context percentage, cost, uptime) isn't sensitive. The write endpoints (/inject, /cycle, /stop) are a known gap — they're also unauthenticated, relying on network isolation rather than API keys. Acceptable for a local research system, not for production.
Secure Remote Access
The system runs on a residential internet connection with no static IP, no port forwarding, and no exposed services. Remote access uses a Cloudflare Zero Trust tunnel — an outbound-only connection from the workstation to Cloudflare's edge network.
┌──────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────┐
│ Browser │────▶│ Cloudflare │────▶│ Tunnel │────▶│ Caddy │
│ (remote) │ │ Zero Trust │ │ (outbound) │ │ (reverse │
│ │ │ + Access │ │ │ │ proxy) │
└──────────┘ └─────────────┘ └──────────────┘ └──────────┘
│
┌─────────────────────┤
▼ ▼
┌──────────┐ ┌──────────┐
│Messenger │ │ ttyd │
│ (chat) │ │ (terms) │
└──────────┘ └──────────┘How the Tunnel Works
The Cloudflare tunnel daemon (cloudflared) runs on the workstation and maintains a persistent outbound connection to Cloudflare's edge. No inbound ports are opened on the firewall. No port forwarding is configured on the router. From the ISP's perspective, the workstation is making a normal HTTPS connection — which it is.
When a remote user visits the system's domain, Cloudflare routes the request through the tunnel to the workstation. Cloudflare Access policies enforce authentication before any request reaches the tunnel — unauthenticated traffic never touches the workstation.
Caddy as Reverse Proxy
Inside the network, Caddy handles path-based routing:
/chat/→ Messenger web application/term/→ ttyd web terminal- Agent supervisor endpoints (per-agent routing)
Caddy provides TLS termination for local connections and clean URL routing. Its admin API is disabled — configuration is file-based and static. The entire Caddy config is under 30 lines.
The Security Model
The security is layered:
- Cloudflare Access: authentication and authorization. Only approved users reach the tunnel.
- Cloudflare Tunnel: no inbound ports. The workstation is invisible to port scanners.
- Caddy: path-based routing prevents accessing services that shouldn't be exposed.
- Podman rootless: containers run unprivileged. No root daemon.
- Row-Level Security: database-level isolation. Agents can only access their own data.
- Per-agent credentials: each agent has independent OAuth tokens. Revoke one without affecting others.
None of these layers cost money. Cloudflare's free tier includes Zero Trust tunnels and Access policies for small teams. Caddy is open source. Podman is open source. PostgreSQL RLS is built-in.
What It Costs
The infrastructure cost breakdown:
| Component | Monthly cost |
|---|---|
| Hardware | £0 (existing workstation) |
| Electricity | ~£15-20 (estimated, shared with general use) |
| PostgreSQL | £0 (self-hosted) |
| Neo4j | £0 (self-hosted, Community Edition) |
| Ollama | £0 (self-hosted, open-source models) |
| Forgejo | £0 (self-hosted) |
| Cloudflare tunnel + Access | £0 (free tier) |
| Caddy | £0 (open source) |
| Podman | £0 (open source) |
| Total infrastructure | ~£15-20 (electricity only) |
The real cost is the Claude subscription — paying for agent sessions and the work they do. That's operational cost, not infrastructure cost, and it scales with how much the agents work, not with what they run on. The infrastructure itself is essentially free.
Compare this to a cloud-hosted equivalent. Five persistent agent sessions on cloud compute, managed PostgreSQL, managed Neo4j, GPU instances for embedding generation, hosted git — you're looking at hundreds of pounds per month before the agents generate a single token. Self-hosting eliminates the base cost entirely. The only variable cost is the AI itself.
Why This Works
AI agent infrastructure is I/O bound, not compute bound. The agents spend most of their time waiting — waiting for API responses, waiting for tool results, waiting for user input. The actual compute demands (database queries, embedding generation, container orchestration) are modest. A mid-range workstation handles five concurrent agents without strain because those agents aren't doing five things simultaneously — they're doing five things interleaved, with lots of idle time between bursts.
The GPU is the one component that matters for latency. Without it, embedding generation would be the bottleneck — every memory store and recall operation includes an embedding step. With a decent consumer GPU, embeddings generate in milliseconds. The GPU pays for itself in developer experience: recall feels instant rather than sluggish.
What We'd Do Differently
Separate the databases onto dedicated storage. Right now PostgreSQL, Neo4j, and the agent workspaces all share one NVMe drive. Under heavy concurrent load, I/O contention between database writes and agent file operations could become a bottleneck. A second NVMe dedicated to databases would eliminate this, though we haven't hit the problem yet.
Add automated backups from day one. We didn't, and a data loss scare made us fix that in a hurry. For a system where the agents' value comes from accumulated experience, database corruption or drive failure is existential. Backup before you accumulate anything worth losing.
Invest in monitoring earlier. The supervisor's /status endpoint was a late addition. Before that, checking on an agent meant attaching to their tmux session and eyeballing the output. A proper monitoring stack — Prometheus, Grafana, alerting on session failures or high context usage — would catch issues faster and reduce operational overhead.
Consider ECC RAM. With 64 GB of memory running persistent databases and agent sessions around the clock, a single bit flip could corrupt data silently. ECC memory isn't standard on consumer hardware, but for a system that accumulates months of persistent state, the reliability argument is strong.
Conclusion
Nothing here is novel technology. PostgreSQL, Neo4j, Podman, Cloudflare — these are mature, well-understood tools. The interesting part is the composition: how commodity components, carefully arranged, create an environment where multiple AI agents can persist, accumulate experience, and collaborate without any cloud dependency or ongoing infrastructure cost.
The barrier to running a multi-agent AI system isn't hardware. It isn't cost. It's the operational knowledge to wire the pieces together and the willingness to own the stack. For research and experimentation — where you need the freedom to run agents for weeks, iterate on architecture, and not worry about a meter running — self-hosted infrastructure on a single workstation is not just viable. It's preferable.
Setting Up Your Own
If you want to build something similar, these guides cover each layer of the stack. We've linked the resources we found most useful — a mix of official documentation and practical tutorials.
Docker & Docker Compose
Docker provides the service layer — databases, embedding models, git hosting. Docker Compose lets you define and manage all of these as a single stack.
- Install Docker Engine — Official installation guide for Linux, Mac, and Windows
- Docker Compose Quickstart — Define multi-container applications in a YAML file
- Get Started with Docker — If you're new to containers entirely
Podman (Rootless Containers)
We use Podman instead of Docker for agent containers. Rootless mode means each agent runs without elevated privileges — a compromised container can't escalate to root on the host.
- Rootless Podman Tutorial — Official guide to setting up rootless containers
- Getting Started with Podman — Podman's own documentation
- Rootless Containers with Podman: The Basics — Red Hat's practical walkthrough
Caddy (Reverse Proxy)
Caddy sits between the Cloudflare tunnel and your backend services, routing requests by path. Automatic HTTPS, clean config syntax, and WebSocket support out of the box.
- Reverse Proxy Quick-Start — Get a production-ready reverse proxy running in minutes
- reverse_proxy Directive — Full reference for proxy configuration
- How to Set up Caddy as a Reverse Proxy — Step-by-step tutorial with examples
Cloudflare Zero Trust Tunnels
The ingress layer — secure remote access without exposing ports or managing certificates. The free tier covers everything we need.
Important: A tunnel alone does not provide security. Creating a Cloudflare Tunnel exposes your services to the internet — anyone with the URL can access them. You must create a Cloudflare Access Application with an access policy (Access → Applications → Add Application) to require authentication. Without this step, your terminal, messenger, and supervisor APIs are publicly accessible. The tunnel provides connectivity; Access provides security. Do not skip this.
- Create a Tunnel (Dashboard) — Official guide for setting up tunnels from the Cloudflare UI
- Publish a Self-Hosted Application — Adding Access policies to protect your services — read this before going live
- Implementing Cloudflare Tunnel for Secure Home Lab Access — Complete technical walkthrough for a homelab setup
- Cloudflare Access and Tunnels for the Homelab — Practical guide covering both tunnels and Access policies
tmux (Terminal Multiplexer)
tmux keeps agent sessions alive independently of your connection. Disconnect, reconnect, attach from a different machine — the session persists. Essential for long-running agent processes that shouldn't die when you close a laptop.
- Getting Started — tmux Wiki — Official getting started guide
- A Quick and Easy Guide to tmux — The best beginner-friendly introduction
- A Beginner's Guide to tmux — Red Hat's practical walkthrough
ttyd (Web Terminal)
ttyd shares a terminal session over the web via WebSocket. Point it at a tmux session picker and you can monitor or interact with any agent from a browser — no SSH required.
- ttyd — Share your terminal over the web — Official repository with installation and usage
- Share Your Terminal Over the Web — Practical setup walkthrough
- ttyd: Share Linux Terminal Over Web Browser — Detailed guide including security options
This is the fifth in a series of posts about graph-cortex. The first post covers the system architecture. The second post surveys the memory systems landscape. The third post covers operational reality. The fourth post covers the multi-agent system design. The sixth covers the resonance system redesign. Future posts will cover experiment results.
Written by Gareth with Cora, February 2026.