Aegis — the autonomous on-call war room on Band

Simulate an incident // live, in your browser

ready

Press Simulate incident — five agents open a war room and resolve a SEV1 live. Watch the validator reject the first fix, then pass the revised one.

When Band keys are configured on the server this runs the genuine reactive coordination through Band (real @mention handoffs incl. a recruited security agent); otherwise it streams the deterministic offline cascade. Either way it always completes.

The problem // on-call is the worst seat in engineering

A 3am page. A scramble to assemble a war room. ~42 minutes of mean-time-to-resolve while revenue bleeds and one tired person guesses under pressure — scale it? roll it back? fail over? — with no way to prove the fix before touching prod.

Pages don't reason

An alert says something broke. It can't diagnose, can't argue, can't disprove a bad fix.

Alert fatigue

Hundreds of alerts, mostly noise. The signal that matters gets lost; responders burn out.

Downtime is money

Every minute down is revenue + trust lost. 42 minutes × revenue-per-minute is a five-figure hit.

The solution // a 5-agent war room coordinating through Band

@observer

Detect

z-score anomaly detection; correlates the spike to a deploy and opens the room.

@diagnostician

Root cause

Confidence-scored hypothesis from the evidence — "memory leak from v2.3.1".

@remediator

Propose

Proposes a fix — and revises when the validator shoots the first one down.

@validator

Disprove

Runs a chaos replay and holds a veto; rejects fixes that still breach SLO.

@commander

Gate & execute

Recruits a security sign-off, gates on a human, executes, files the postmortem.

The reject-then-fix beat is the differentiator. One agent disproves another's fix mid-incident with evidence it generated itself — a chaos replay. That argument physically happens inside Band, over real @mentions. When the fix is irreversible, the commander recruits a security specialist into the room at runtime for a risk sign-off — a new collaborator joining a live incident, which a fixed pipeline can't do. Then exactly one human approves.

The numbers // business value in one screen

MTTR 42m → ~1.5m

A 28× collapse vs a ~42-minute manual SEV1 — modeled from the steps, deterministic.

~$38k averted

Downtime cost avoided per incident, quantified from the service's revenue-at-risk.

$35 to fix

Remediation cost. The economics aren't close — and they're computed, not asserted.

Watch the 3-minute demo // live in a Band room

▶

DEMO VIDEO

Replace DEMO_VIDEO_URL in landing.html with your YouTube / Loom embed link

How it's built // Band is the coordination layer, not a channel

Agents never talk directly. Every inter-agent message is a RoomMessage on a shared bus; flip one config value and the exact same agent code runs in a real Band room, reacting to each other's @mentions over Phoenix-Channels — so every collaboration beat physically happens inside Band.

⬡ Band

The shared agent room. Per-agent identities post + react over the band-sdk; the commander even recruits a new agent into the room mid-incident via Band's participant tools.

✦ Featherless AI

Powers @diagnostician & @validator (the skeptics) behind the CrewAI agents.

✦ AI/ML API

Powers @observer, @remediator & @commander behind the LangGraph agents.

Cross-framework (LangGraph ⇄ CrewAI ⇄ orchestrator) and cross-provider (AI/ML API + Featherless) by design.

Simulate an incident // live, in your browser

The problem // on-call is the worst seat in engineering

Pages don't reason

Alert fatigue

Downtime is money

The solution // a 5-agent war room coordinating through Band

The numbers // business value in one screen

MTTR 42m → ~1.5m

~$38k averted

$35 to fix

Watch the 3-minute demo // live in a Band room

How it's built // Band is the coordination layer, not a channel

Sign in to Aegis