Case Study · Modeled Scenario · Alarm Flood

The 1,200-alarm shift. One transmitter failure. Ninety minutes of pure noise.

A modeled-scenario walk-through of what happens when a single bad transmitter cascades into 1,200 alarms during a 90-minute window — and how alarm rationalization plus event-correlation analytics keep the operator informed instead of overwhelmed. EEMUA 191's nightmare scenario, demonstrated.

Aevus / Intrepid LogicModeled scenarioFor operators · control-room supervisorsPublished 2026-05-21
Modeled scenario. Composite operator, real failure pattern. Numbers modeled from realistic alarm-management research (EEMUA 191 cites the "alarm flood" scenario explicitly and gives industry benchmarks). Aevus is pre-revenue; the mitigation pattern below is what we'd ship, not a measured customer outcome.

The setup

"Cypress Refining" runs a 95,000 barrel-per-day refinery in the Houston Ship Channel. Their DCS is Honeywell Experion. Their HMI follows ISA-101 design principles for the newer screens — but the alarm system was rationalized in 2018 by an outside consultant and hasn't been touched since. The control room has six operator consoles, with one operator covering 4-5 process units depending on shift.

Operator Sam covers the crude distillation unit (CDU) on the 6 PM-6 AM shift. CDU has roughly 2,800 alarms configured. EEMUA 191's recommended ceiling for a single operator is 6 per hour, average. Cypress's measured average for CDU is 14 alarms per hour — over twice EEMUA's recommendation, and Sam has long since habituated to it.

2,800
Configured alarms on CDU
14/hr
Avg alarms per operator (vs EEMUA 6)
1
Operator covering CDU
8
Years since alarm rationalization

What happened

At 11:47 PM, a pressure transmitter on the CDU overhead system began rapidly oscillating between zero and full-scale. The transmitter had developed an internal connection fault — vibration from a nearby pump had finally worn through a wire bond. The instrument was reporting noise, not pressure.

The transmitter feeds into 87 different alarm conditions across the DCS.Direct alarms (low pressure, high pressure, deviation, rate-of-change), interlocked alarms (high pressure permissive on downstream equipment), and computed alarms (calculated columns that use pressure as an input). Every spike from the failed transmitter raised some subset of those 87 alarms. Every drop cleared a different subset. Most cleared on acknowledgment, then re-raised on the next sample.

Over 90 minutes, 1,217 alarms appeared on Sam's screen. None of them were wrong — the transmitter was reporting wildly varying values. Each alarm had a valid configured condition that was being correctly triggered. The system did exactly what the engineers programmed it to do.

What the system didn't do was tell Sam: "all of these alarms originate from a single failed instrument — silence them, send a maintenance ticket on PT-4203, and proceed with normal operations."

The shift, hour by hour

23:47
First alarm. PT-4203 low-pressure P1 critical. Sam acknowledges. Standard procedure: confirm reading on a redundant transmitter (there isn't one on this loop), check downstream flow (looks normal), open maintenance ticket. Halfway through, the alarm re-raises.
23:52
34 alarms in 5 minutes. Sam stops acknowledging individually and starts clicking "acknowledge all" repeatedly. The HMI's alarm panel scrolls past what fits on screen.
00:10
Sam calls control-room supervisor. Supervisor agrees PT-4203 is suspect. They manually mask the alarm via DCS workstation — but the masking has to be done tag-by-tag on each downstream alarm condition, not just the source instrument. Sam doesn't know which 87 alarms are involved; he masks the obvious 12.
00:30
525 alarms total. The 75 unmasked derived alarms are still raising continuously. Sam is firefighting acknowledgments while trying to monitor the rest of CDU. A separate genuine alarm appears — a sticking control valve on a different stream — and is lost in the noise.
00:58
Field tech arrives. Confirms PT-4203 is failed. Pulls it from the loop. Alarms stop multiplying. Sam begins working through the remaining queue of 600+ unacknowledged alarms.
01:17
1,217 total alarms. Final count when the queue stops growing. Sam is now spending the rest of the shift doing alarm cleanup instead of operating CDU. The genuinely concerning sticking-valve alarm from 00:30 has been unanswered for 47 minutes.
06:00
Day shift relief. The sticking-valve issue is found during shift handover — it's been getting progressively worse for hours. A controlled trim adjustment fixes it. Day shift handles it normally. No actual incident — but it could have been one.

The damage

No regulatory incident. No equipment failure. No injury. The "damage" was operator cognitive overload and a near-miss on the sticking valve. But by EEMUA 191 standards, this was an explicit operating-procedure failure:

1,217
Alarms in 90 minutes
811
Alarms per hour (peak)
135×
Over EEMUA's 6/hr ceiling
47 min
Unacknowledged genuine alarm (sticking valve)
"An alarm flood doesn't show up in the incident log. It shows up in the operator's ability to handle the next real thing that goes wrong. The cost is invisible until the day a real emergency happens during a flood — and that day, the cost is catastrophic."— EEMUA 191 commentary, paraphrased

What Aevus would have done

23:50 — root-cause clustering

Aevus's alarm-correlation engine continuously tracks which alarm conditions share input dependencies. Within 3 minutes of the first PT-4203-derived alarm, Aevus would have identified that all the new alarms in the flood share PT-4203 as a direct or indirect input. The system would have surfaced a single Severity-2 advisory: "87 alarm conditions are derivatives of PT-4203, which is currently exhibiting high-frequency oscillation consistent with instrument failure. Recommend isolating PT-4203 and silencing derivative alarms via prepared bypass."

23:55 — automatic alarm-group suggestion

Aevus doesn't (and architecturally cannot) mute the actual DCS alarms — that boundary lives in IL-9000. But the platform would have presented Sam with a one-click recommended-bypass list, scoped to the 87 derivative alarms identified, with the non-derivative alarms (i.e., the rest of CDU) explicitly preserved.

00:30 — the sticking valve gets surfaced

With the PT-4203 derivative alarms aggregated into a single suppressed group, the sticking-valve alarm at 00:30 would have appeared on Sam's screen in normal-condition prominence — not buried in a queue of 525 unrelated alarms. Sam would have addressed it in real time.

"The genuinely scary part of an alarm flood isn't the flood. It's the alarm that comes in during the flood and gets missed. Our job is to keep that signal visible."— Aevus design rationale on alarm rationalization

Why the conventional system couldn't do this

  1. Alarm derivation isn't introspectable. Cypress's DCS knows that alarm AL-87432 exists. It doesn't know that AL-87432 is computed from PT-4203's value. That dependency graph is in the engineering team's heads (and the DCS configuration database) but not surfaced to the operator at alarm-time.
  2. Alarm masking is per-tag, manually. Sam knows how to mask one alarm. He doesn't know how to mask "all 87 alarms derived from PT-4203" in one action — because that action doesn't exist in the DCS.
  3. The dependency map is the difference. Aevus computes the alarm-dependency map from the DCS configuration and the historical correlation of which alarms tend to fire together. With that map, root-cause clustering becomes a one-step operation. Without it, every flood is firefighting.

The EEMUA 191 context

The Engineering Equipment and Materials Users Association's EEMUA Publication 191 ("Alarm Systems — A Guide to Design, Management and Procurement") defines the industry-standard alarm performance benchmarks operators are expected to meet:

  • ≤ 1 alarm per 10 minutes (peak operating period)
  • ≤ 1 alarm per 30 seconds during upset
  • ≤ 6 alarms per hour, average
  • ≤ 5% of alarm queue active at any time
  • ≥ 80% of alarms acknowledged within 2 minutes

By any of these benchmarks, the 1,217-alarm shift was a Class-A failure. The mitigation isn't to add more alarms or escalate severity levels — it's to reduce the operator's cognitive load when one bad instrument is masquerading as a system failure. That's what alarm-management analytics done right looks like.

For more depth, see our alarm management interactive demo and the ISA-101 HMI philosophyarticle.

If you've worked an alarm flood.

Every operator with more than two years in a control room has lived a version of this shift. If you want to be on the team that catches the next one with a recommendation instead of 1,200 acknowledgments, that's the conversation.