← All posts

We ran 146 multi-agent simulations. Here's what broke.

·Zero Human Labs Research Team
researchgovernancesimulationsmulti-agentsafety

We ran 146 multi-agent simulations. Here's what broke.

Before we shipped governance in Agency-OS, we broke it. Deliberately, systematically, 146 times.

We built SWARM — a multi-agent economic simulation framework — to answer a question nobody in the agent framework space was asking: what actually happens when you let AI agents operate in an economy with real incentives, real budgets, and real consequences?

The answer: everything you'd expect to go wrong, does. And a few things you wouldn't expect.

Here's what we found across 146 simulations, 43 agent types, and 27 different governance configurations.


The experiment

Each SWARM simulation creates an economic environment where agents trade, cooperate, compete, and get paid. Agents have budgets. They earn reputation. They can be audited, frozen, or demoted.

We tested every variable we could think of:

  • What happens with no governance at all?
  • What happens with too much governance?
  • What happens when agents collude?
  • What happens when you mix honest and dishonest agents?
  • What happens when you add circuit breakers? Remove them? Change the threshold?

We published 84 claims with full evidence chains. Every finding has an effect size. Every effect size is reproducible.


Finding 1: Circuit breakers dominate everything else

+81% welfare improvement. Effect size d = 1.64. 70 simulations.

We tested every governance mechanism we could think of: transaction taxes, audits, reputation systems, staking requirements, collusion detection. Circuit breakers beat all of them.

The mechanism is simple: if an agent violates behavioral rules N times, it gets frozen. In our balanced preset, N = 3. Conservative preset: N = 2.

What surprised us wasn't that circuit breakers helped — it's that the effect was so large. d = 1.64 is enormous in behavioral research. It means the difference between governed and ungoverned systems isn't marginal. It's the difference between a system that works and one that doesn't.

Circuit breakers also reduced toxic behavior by 11%. Agents that know they'll be frozen for misbehavior produce better outputs even before they trigger the breaker.

Takeaway: If you implement one thing from this post, implement circuit breakers.


Finding 2: Complex agents earn less

Depth-5 reasoning agents earn 2.3–2.8x less than simple agents. Effect size d > 1.0. 33 simulations.

We built sophisticated agents with deep strategic reasoning — multi-step planning, opponent modeling, recursive evaluation. We pitted them against straightforward agents that just followed simple heuristics.

The simple agents won. Consistently.

Why: complex agents spend more tokens per decision. They overthink. They second-guess. They explore branches that don't matter. The marginal improvement in decision quality doesn't compensate for the cost.

This has a direct implication for production systems: don't make your agents smarter than they need to be. A focused agent with clear instructions and a small model outperforms a sophisticated agent with a large model on most tasks.

Takeaway: Use the simplest agent that solves the problem. Complexity costs more than it returns.


Finding 3: Collusion monitoring creates a 137x wealth gap

Effect size d = 3.51. The largest effect we measured.

When we turned on behavioral monitoring for collusion patterns, honest agents accumulated 137 times more wealth than colluding agents. Without monitoring, colluding agents could dominate.

d = 3.51 is off the charts. For reference, most behavioral interventions in social science research show effect sizes under 0.5.

The implication for multi-agent production systems: if your agents communicate with each other (and in most multi-agent architectures, they do), you need to monitor for coordination patterns that shouldn't exist. Two agents that always agree, always route work to each other, or always validate each other's outputs — that's a pattern worth flagging.

Takeaway: Monitor inter-agent communication patterns. The upside of catching collusion is enormous.


Finding 4: 20% honest agents beat 100% honest teams

66 simulations.

This was counterintuitive. Teams composed entirely of honest, rule-following agents performed worse than teams with 20% honest agents mixed with other strategies.

The reason: governance overhead. In a 100% honest system, the auditing, monitoring, and trust scoring infrastructure is pure overhead — it catches nothing because there's nothing to catch. The agents pay the governance tax (transaction fees, audit delays, reputation tracking) without any compensating benefit.

In mixed populations, governance earns its cost. The monitoring catches actual misbehavior, the honest agents benefit from the enforcement, and the overall system is more efficient because governance is doing useful work.

Takeaway: Don't over-govern. Governance should be proportional to risk, not maximal by default. This is why our presets differentiate: conservative for new/untrusted agents, balanced for most, aggressive for proven high-trust agents.


Finding 5: Transaction tax has a cliff at 5%

Phase transition above 5%. Effect size d = 1.18. 29 simulations.

We swept transaction tax rates from 0% to 15%. The system behaves normally up to about 5%. Above 5%, there's a sharp S-curve collapse — economic activity drops off a cliff.

This isn't a gradual degradation. It's a phase transition. At 4% tax, the system hums. At 6%, agents stop transacting because the overhead exceeds the marginal value of most interactions.

Our balanced preset caps at exactly 5% based on this finding. Conservative preset runs at 10% (deliberately constrictive for new agents). Aggressive preset: 2%.

Takeaway: If you're charging agents per-transaction fees (or imposing per-call overhead of any kind), keep it under 5% of transaction value or you'll kill economic activity.


Finding 6: Sybil attacks still work

100% success rate against all configurations.

We tried everything. Reputation systems, staking requirements, behavioral analysis, identity verification. Sybil attacks — where a bad actor creates multiple fake identities — worked every time.

We're publishing this because we think the honest thing to do is tell you what doesn't work. Most governance platforms don't disclose their failure modes. We think that's a mistake.

Sybil resistance in decentralized multi-agent systems is an open problem. We're working on it. We haven't solved it. If someone tells you they have, ask for their simulation data.

Takeaway: Don't trust identity-based governance alone. Layer it with behavioral monitoring, budget caps, and circuit breakers that limit damage regardless of how many identities an attacker controls.


What we built from this

Every governance preset in Agency-OS derives directly from these 146 simulations:

Parameter Conservative Balanced Aggressive
Transaction tax 10% 5% 2%
Circuit breaker threshold 2 violations 3 violations 5 violations
Audit probability 25% 10% 5%
Collusion detection On Off (default) Off
Min stake $50 None None

These aren't opinions. They're the result of systematically breaking multi-agent systems and measuring what fixes them.

If you're running multi-agent systems in production — or planning to — and you want governance that's evidence-backed instead of guessed, we're offering early access to a small group of teams.

Get early access →


Full research: 84 claims with evidence chains at swarm-ai.org. All effect sizes are reproducible via SWARM CLI tools.