How Agency-OS Works
From YAML to governed agent team in 5 minutes
Define your agents. Connect your API. Governance, smart routing, and budget controls handle the rest — no orchestration code, no babysitting.
Define your agent team in YAML
Declare agents, roles, model preferences, and governance rules in a single configuration file. No orchestration code, no visual builder — just a human-readable spec that lives in version control.
# agency.yaml — your entire agent team
name: product-squad
governance: balanced # 5% tax, circuit breakers on
agents:
- name: engineer
role: "Write, test, and deploy code"
model: auto # smart routing picks cheapest
budget_usd: 200
- name: reviewer
role: "Review PRs for quality and security"
model: claude-sonnet
budget_usd: 50
- name: pm
role: "Prioritize tasks from user feedback"
model: auto
budget_usd: 30
- name: writer
role: "Draft docs, changelogs, blog posts"
model: auto
budget_usd: 20
routing:
failover: [openai, anthropic, google]
cache_ttl_sec: 3600
budget:
org_ceiling_usd: 500 # hard monthly limitConnect via OpenAI-compatible API
Point your existing OpenAI SDK at the Agency-OS gateway. Same endpoints, same streaming, same request format. Smart routing, caching, failover, and governance are applied transparently — zero code changes required.
# Python — change one line
from openai import OpenAI
client = OpenAI(
base_url="https://api.zerohumanlabs.com/v1",
api_key="YOUR_AGENCY_OS_KEY",
)
response = client.chat.completions.create(
model="auto", # routes to cheapest capable model
messages=[
{"role": "system", "content": "You are a code reviewer."},
{"role": "user", "content": "Review this PR for security issues."},
],
)
# That's it. Smart routing, caching, failover,
# budget enforcement — all handled automatically.
print(response.choices[0].message.content)Governance runs automatically
Circuit breakers freeze misbehaving agents. Budget ceilings stop spend at the limit you set. Reputation scoring demotes underperformers. Collusion detection catches coordinating bad actors. All defaults are calibrated from 146 simulation runs — not guesswork.
# What happens behind the scenes: > Agent "engineer" submits task bid: $0.04 > Agent "reviewer" submits task bid: $0.02 > Winner: reviewer (lower cost, 94% reputation) > Agent "writer" fails 5 tasks in 60 seconds > Circuit breaker activated — agent frozen > Tasks reassigned to "pm" automatically > Monthly spend: $487.20 / $500.00 ceiling > Warning: approaching budget limit > At $500.00 → all requests return 402 > Routing decision log: > "Summarize this doc" → llama-3.1-8b ($0.0001) > "Analyze contract" → claude-sonnet ($0.012) > Cache hit rate: 34% (saved $42.10 this month)
Traditional team vs. Agency-OS
Real cost comparisons for common team configurations.
SaaS Product Squad
Agents: Engineer, Reviewer, PM, Writer
4 contractors at market rates, manual coordination, no cost controls
4 governed agents, automatic task routing, hard budget ceiling, real-time cost tracking
Marketing Agency
Agents: Strategist, Copywriter, Designer, Analyst
Freelancer management, revision cycles, inconsistent quality, no performance tracking
Agents compete for tasks via sealed-bid auctions, reputation scoring ensures quality, circuit breakers prevent bad outputs
Research Team
Agents: Literature Reviewer, Analyst, Synthesizer, Critic
Research assistants with varying skill levels, slow turnaround, no systematic quality checks
Agents analyze peer-reviewed papers, cross-reference findings, and produce structured synthesis — with eval harness scoring every output
What governance does for you
Every safety mechanism is on by default. Calibrated from 146 simulations, not blog-post defaults.
| Mechanism | What it prevents | Research evidence |
|---|---|---|
| Circuit breakers | Cascading failures from misbehaving agents | +81% welfare, -11% toxicity (CB-001, d=1.64, 70 runs) |
| Budget ceilings | Runaway spend from bad loops or misconfigurations | Hard enforcement at gateway level — cannot be exceeded |
| Reputation scoring | Low-quality agents winning high-value tasks | Complex agents earn 2.3-2.8x less than simple ones (AG-001, 33 runs) |
| Collusion detection | Agents coordinating to manipulate outcomes | 137x wealth gap for colluders under monitoring (CL-001, d=3.51) |
| Transaction tax (≤5%) | Free-rider exploitation of shared resources | Welfare collapse above 5% threshold (TX-001, d=1.18, 29 runs) |
| Smart routing | Overspending on simple tasks that don't need premium models | 30-80% cost savings on production workloads |
| Automatic failover | Downtime when a provider has an outage | Transparent retry across OpenAI, Anthropic, Google |