Agent systems Software work 2026

The old laws still apply

Agent swarms still obey Amdahl's Law.

More agents can search wider, keep noisy work separate, and run independent checks. Someone still has to merge the work, judge the result, and own the answer.

Eighth issue May 6, 2026 By Marlon McKinnie

Agent swarms don't repeal the old laws of software engineering. They make it easier to hit those limits quickly. Parallel agents are useful, but they're a concurrency tool. More workers don't make shared state or the serial path disappear.

What scales Breadth

Agents can search, compare, test, and summarize separate branches at once.

What leaks State

Prompts, repo context, permissions, assumptions, and edits can cross boundaries.

What caps speed Reducer

One path still has to merge the work, verify behavior, and own the answer.

A newspaper-style technical illustration of many small AI agent machines feeding parallel work streams into one narrow review gate. — **Plate I: The reducer queue** Many agents can explore and prepare candidate work at once. The queue narrows when one result has to be chosen, merged, checked, and owned.

01 Fan out

Split only the work that can run independently.

02 Isolate

Keep noisy logs, searches, and candidate edits out of the main thread.

03 Reduce

Collapse findings into one coherent decision, not a pile of summaries.

04 Verify

Trust checks, tests, and evidence before trusting agent confidence.

The pitch is simple: if one agent helps, several should help more. One inspects the repo. Another writes tests. Two more patch the backend and frontend. A fifth reviews the result. Maybe the job finishes in a fraction of the time.

Sometimes it does. In software, though, producing more work often isn't the slow part. The slow part is deciding what should exist, keeping shared state coherent, checking behavior, and turning partial answers into one answer someone will stand behind.

LLM agents changed the workers. They didn't remove the coordination math.

Vendor docs put limits on parallelism.

The vendor docs don't say "spawn as many workers as possible." They describe isolation, supervision, and the points where parallel work starts adding cost.

OpenAI's Codex subagents documentation uses subagents to keep noisy work out of the main thread. It names exploration, tests, triage, and summarization as good fits. It warns that concurrent code changes bring conflicts and coordination overhead.

OpenAI's Codex app announcement describes multiple agents in separate threads and isolated worktrees, plus long-running tasks and a UI for supervision. Parallel work still has to come back in a form a person can review.

Anthropic's multi-agent research-system writeup describes a lead agent sending broad research branches to subagents with separate context windows, then collecting compressed findings. The post also says multi-agent systems use many more tokens, make economic sense only when the task value covers that cost, and find fewer truly parallel pieces in most coding tasks than in research.

Anthropic's Claude Code subagents docs spell out another cost. Subagents get focused instructions and their own context, but they may add latency because they start clean and have to gather context.

Both vendors build supervision and isolation into the product because coordination isn't free.

More attempts help when the judge is clear.

Multiple model samples, agents, or reasoning paths can improve results when the answer can be judged cleanly. The paper More Agents Is All You Need found that a simple sampling-and-voting setup improved performance as it added agents, with the benefit tied to task difficulty.

A search-heavy problem with a clean judge can use more attempts. Ten candidate fixes are useful if a reliable test suite can pick the winner. Several agents can also inspect different logs, docs, services, or versions and return separate findings.

The clean judge is what makes that work. Without it, ten agents give you ten more things to reason about.

Exhibit A: The fit check Parallel agents fit work with a natural split and a clear check. They struggle when every worker needs the same changing state.

const swarmFit = {
  strong: [
    "repo exploration",
    "log triage",
    "independent test runs",
    "source comparison",
    "first-pass research"
  ],
  risky: [
    "shared API redesign",
    "cross-cutting refactors",
    "ambiguous product calls",
    "stateful migrations"
  ],
  reducerCost: [
    "choose the path",
    "merge the work",
    "verify behavior",
    "own the final answer"
  ]
};

The old limits are easy to spot.

Amdahl's Law

The part that can't run in parallel caps the speedup. For agents, synthesis, review, merge, and final validation often stay serial.

Gustafson's Law

Agents gain room when the problem gets broader. They can cover more useful search space, but coupled work doesn't become independent.

Brooks's Law

More workers bring more communication and review. Agents still need task boundaries, context, permissions, conventions, tests, and a handoff format.

Little's Law

Too much parallel work piles up at the reducer. Producing candidates gets easy. Accepting or rejecting them becomes the bottleneck.

Goodhart's Law

Agent count, PR count, benchmark pass rate, and autonomy theater can all improve while product judgment gets worse.

Software has shared state.

Research often splits cleanly. One agent searches vendor docs. Another searches papers. A third checks pricing. The lead compresses the findings into a report. Those subagents usually aren't changing the same source of truth while they work.

Software is usually more coupled. A backend change reaches the frontend. A schema change reaches tests, migrations, docs, and customer behavior. A product decision changes copy, permissions, analytics, support expectations, and rollout order. Separate files can still share one design.

Write-heavy swarms can be locally right and globally wrong. One agent changes the API. Another writes UI against the old shape. A third updates tests around another reading. Each agent can reason well and still leave pieces that don't agree.

Single-agent benchmarks miss some coordination and competition, which is why MultiAgentBench tests those dynamics. Multi-SWE-bench broadens software-agent evaluation beyond mostly Python issue resolution. The failure taxonomy in Why Do Multi-Agent LLM Systems Fail? names system design, inter-agent misalignment, and task verification. Those failures happen in the coordination layer.

Field note

Translation catalogs were a good split.

The ArchiBot Console globalization work had many locale catalogs and a lot of repeated review. Each catalog could move alone, but the product voice still had to hold together.

Batch 3 + 5 + 4

Early RTL catalogs, western European catalogs, then CJK and Korean.

Split 1 per locale

Worker agents reviewed isolated catalog files instead of fighting over one change.

Reducer Checks

i18n validation, build, persona smoke tests, and a final human cleanup pass.

Why that parallel job worked

Locale catalogs are naturally parallel. French, German, Portuguese, Italian, Dutch, Japanese, Korean, Simplified Chinese, and Traditional Chinese can be reviewed in separate worktrees and separate agent contexts without each worker needing to redesign the product. I kept the manual sweep, western-locale batch, and CJK/Korean batch isolated so branch state stayed clean while the agents worked one catalog at a time. The ownership boundary is obvious: this agent owns this catalog, this batch, and this review pass.

The agents still over-touched shared material. Automated translation tried to localize CSS utility classes, reserved example domains, provider hostnames, credential labels, and acronym-heavy UI copy. Protected literals and stronger i18n checks caught that. A build, smoke tests, and one final pass separated product language from code-adjacent material.

Rule

The more parallel the candidate work became, the more the shared contract mattered: stable English source, protected technical strings, one catalog per worker, and deterministic checks before merge.

I use them as sidecars.

One lead thread owns the problem. Sidecars get bounded work that won't block the next decision: inspect this module, compare two approaches, run this test slice, summarize these docs, audit this narrow risk, or prepare one candidate patch.

A sidecar should return evidence, not vibes: file paths, commands, test output, exact assumptions, and a clear "I did not check this" section. File changes stay small and disjoint. Exploration comes back short enough to use without dragging all the noise into the lead thread.

The reducer chooses a path, merges state, runs checks, catches contradictions, and decides whether the result can ship. Demos tend to stop before this part. Without it, a swarm is parallel note taking with side effects.

Exhibit B: The contract The lead agent owns the consistency boundary for the parallel work.

agent_swarm_contract:
  lead:
    owns: decision, integration, final verification
  sidecars:
    own: bounded subtasks with disjoint context
    output: evidence, patch, test result, caveat
  shared_state:
    default: read-only
    writes: isolated until reviewed
  done:
    requires: deterministic checks plus one responsible reducer

Use the split you can verify.

Use more agents for independent branches, expensive search, noisy logs, broad comparisons, or work with a clean oracle. Keep one strong agent on coherent product judgment, shared architectural state, or a long chain of dependent decisions.

I don't expect useful software work to look like a thousand agents chatting with each other. It's more likely to use familiar systems controls: isolated worktrees, explicit ownership, scoped permissions, artifact contracts, deterministic checks, queues, traces, review surfaces, and one place where responsibility lands.

Without those controls, the extra workers just fill the review queue.

AI agents are already capable enough to recreate the coordination problems we were trying to automate away.