AEGIS.
Band of Agents Hackathon

The autonomous on-call war room
that runs inside Band.

Five specialist agents coordinate through Band to resolve a production incident — detect, diagnose, prove a fix against a chaos replay (reject-then-fix), get a security sign-off, take one human approval, and fail over. MTTR 42 min → ~1.5 min.

42m → ~1.5m
Mean time to resolve
~$38k
Downtime averted / incident
5+1
Agents · incl. recruited security
1
Human approval — the only gate

Simulate an incident // live, in your browser

ready
Press Simulate incident — five agents open a war room and resolve a SEV1 live. Watch the validator reject the first fix, then pass the revised one.

When Band keys are configured on the server this runs the genuine reactive coordination through Band (real @mention handoffs incl. a recruited security agent); otherwise it streams the deterministic offline cascade. Either way it always completes.

The problem // on-call is the worst seat in engineering

A 3am page. A scramble to assemble a war room. ~42 minutes of mean-time-to-resolve while revenue bleeds and one tired person guesses under pressure — scale it? roll it back? fail over? — with no way to prove the fix before touching prod.

Pages don't reason

An alert says something broke. It can't diagnose, can't argue, can't disprove a bad fix.

Alert fatigue

Hundreds of alerts, mostly noise. The signal that matters gets lost; responders burn out.

Downtime is money

Every minute down is revenue + trust lost. 42 minutes × revenue-per-minute is a five-figure hit.

The solution // a 5-agent war room coordinating through Band

@observer
Detect

z-score anomaly detection; correlates the spike to a deploy and opens the room.

@diagnostician
Root cause

Confidence-scored hypothesis from the evidence — "memory leak from v2.3.1".

@remediator
Propose

Proposes a fix — and revises when the validator shoots the first one down.

@validator
Disprove

Runs a chaos replay and holds a veto; rejects fixes that still breach SLO.

@commander
Gate & execute

Recruits a security sign-off, gates on a human, executes, files the postmortem.

The reject-then-fix beat is the differentiator. One agent disproves another's fix mid-incident with evidence it generated itself — a chaos replay. That argument physically happens inside Band, over real @mentions. When the fix is irreversible, the commander recruits a security specialist into the room at runtime for a risk sign-off — a new collaborator joining a live incident, which a fixed pipeline can't do. Then exactly one human approves.

The numbers // business value in one screen

MTTR 42m → ~1.5m

A 28× collapse vs a ~42-minute manual SEV1 — modeled from the steps, deterministic.

~$38k averted

Downtime cost avoided per incident, quantified from the service's revenue-at-risk.

$35 to fix

Remediation cost. The economics aren't close — and they're computed, not asserted.

Watch the 3-minute demo // live in a Band room

DEMO VIDEO
Replace DEMO_VIDEO_URL in landing.html with your YouTube / Loom embed link

How it's built // Band is the coordination layer, not a channel

Agents never talk directly. Every inter-agent message is a RoomMessage on a shared bus; flip one config value and the exact same agent code runs in a real Band room, reacting to each other's @mentions over Phoenix-Channels — so every collaboration beat physically happens inside Band.

⬡ Band

The shared agent room. Per-agent identities post + react over the band-sdk; the commander even recruits a new agent into the room mid-incident via Band's participant tools.

✦ Featherless AI

Powers @diagnostician & @validator (the skeptics) behind the CrewAI agents.

✦ AI/ML API

Powers @observer, @remediator & @commander behind the LangGraph agents.

Cross-framework (LangGraph ⇄ CrewAI ⇄ orchestrator) and cross-provider (AI/ML API + Featherless) by design.

Sign in to Aegis

The incident demo above is public. Sign in to open the full console (incidents, jobs, history).

Demo-grade auth: passwords are pbkdf2-hashed (never plaintext). Not production-hardened.