UX LEAD CASE STUDY · DECISION SYSTEMS · INCIDENT OPERATIONS

Designing Decision Authority
Under Pressure

How we redesigned who gets to decide — and when — inside high-stakes SaaS incident operations.

02:47 AM — The Decision Nobody Wants to Make

It is 02:47 AM. A "Latency Degraded" alert fires for the payment gateway. The on-call engineer is alone. Customer support is silent—no tickets yet. But the dashboard shows a 12% drift in response times, just below the automatic paging threshold.

The engineer faces a critical operational trade-off. Escalation prevents potential data loss but risks a false positive. Silence protects the team from fatigue but risks a major outage. The system forces a human to resolve this signal ambiguity without sufficient data.

Incident Management Is Not a Messaging Problem

Incident management is often reduced to 'ChatOps'—a tooling problem. However, tools cannot resolve the fundamental conflict of distributed truth. When metrics, customer reports, and engineering logs disagree, the system forces humans to synthesize reality under duress. This is an impossible cognitive load.

Perception Customer-reported
  • Payments feel slow SIGNAL
  • Intermittent login failures SIGNAL
  • Support chat spike NOISE
Metrics Platform signals
  • p95 latency +12% SIGNAL
  • Error rate: normal NOISE
  • DB latency spike SIGNAL
Context Engineer-known
  • Deploy 40 min ago SIGNAL
  • Migration window active NOISE
  • No config change NOISE
Candidate State Decision requires confirmation
Candidate: INCIDENT Alignment: 2/3 signals
Δ mismatch
Decision Gap Requires human confirmation

Who This System Is For

Primary User: The On-Call Engineer (SRE/DevOps). They hold the pager and the liability. They are often sleep-deprived and socially pressured to avoid false positives.

Stakeholders: Engineering Leadership (VP/Director) who demand uptime, and Customer Success teams who demand transparency. The tension between these groups defines the decision space.

The Real Problem: Hero Culture and Decision Avoidance

We observed a consistent pattern of 'Hero Culture' masking systemic failure. Engineers would manually suppress valid alerts to 'save the team' from interruption, absorbing the risk personally. This decision avoidance was rational: the reputational cost of crying wolf was higher than the cost of a delayed response. The system needed to remove this social calculus.

On-Call Engineer Single Point of Blame
False alarm risk
Losing credibility
Being wrong publicly
SLA breach exposure
Customer impact
Waking the team
Pressure is not shared. It is absorbed.

“Is this a systemic failure or a transient anomaly?”

Design Goals

  • Eliminate Ambiguity. A state is either 'Normal' or 'Incident'. No 'Watching'.
  • Shift Liability. The system, not the human, declares the status based on pre-set thresholds.
  • Enforce Protocol. Mandatory escalation paths. No skipping steps to 'just fix it'.

Principles That Shaped the System

  • Authority must be explicit, not derived.
  • Silence is a valid state; noise is a failure mode.
  • The burden of proof belongs to the system.

The System That Decides

We rejected full automation because context matters. A database migration looks like an outage to a metric, but is expected behavior to an engineer. The system does not decide for the human; it gates the options available to the human. It ingests conflicting signals and presents a unified 'Candidate State' that must be confirmed or rejected.

Signals (Informational)
Customer Reports
System Metrics
Eng Context
Logic
System Evaluation Correlates Signals · Applies Thresholds
Authority Gate
System-owned
Candidate State Candidate: INCIDENT Confidence: Partial
Human Checkpoint (Constrained)
Declare Incident
Snooze Investigation
Actions are determined by system state

Role-Based Authority

We redistributed decision rights to align with signal fidelity. The Customer owns 'Perception' (impacting triage priority). The Platform owns 'Metrics' (triggering candidates). The Engineer owns 'Remediation' (confirming reality). This stripped leadership of the ability to 'override' a technical incident based on gut feeling, creating significant initial friction.

Customer Perceived Impact
  • Reports degradation
  • Signals urgency
  • Cannot declare incidents
Authority Domain: Input
Platform System Reality
  • Measures metrics
  • Correlates signals
  • Enforces transitions
Authority Domain: Logic
Engineer Contextual Judgment
  • Applies context
  • Executes remediation
  • Cannot bypass system status
Authority Domain: Action

The Decision Surface

The interface is strictly hierarchical. Real-time metrics are subordinated to 'Decision Cards'. We intentionally hid raw log streams during the 'Triage' phase to prevent analysis paralysis. The 'Declare Incident' action is irreversible by design—once triggered, it initiates a legal and compliance chain that cannot be undone, forcing a deliberate commitment.

Decision Surface Interface

The Decision Moment: Snooze vs Declare

The 'Snooze' action is not a pause; it is a declaration of safety. Snoozing a valid incident silences the alarm for 4 hours, during which a minor degradation can spiral into data loss. 'Declare' immediately pages executive leadership. We clarified the stakes: falsely snoozing is a dereliction of duty; falsely declaring is merely a process error.

Proof: The System Mechanics

The logic is governed by a strict Finite State Machine. An engineer cannot jump from 'Alert' to 'Resolved' without passing through 'Investigation' and 'Remediation'. This friction prevents the common anti-pattern of 'ghost fixing'—resolving an issue without documenting the root cause.

ALERT Signal received
TRIAGE Assess & confirm
INCIDENT Declared state
REMEDIATION Fix in progress
RESOLVED Closed & logged
BLOCKED: No Skipping
False Positive
BLOCKED: Skipping Investigation, Ghost Fixing
State skipping is impossible by design.

Impact

Impact here wasn’t about speed. It was about making decisions predictable instead of person-dependent.

Before Human-dependent decisions
Incident declaration Highly variable
False escalation rate Unpredictable
On-call cognitive load High
Decision accountability Implicit
Process consistency Person-dependent
After System-enforced authority
Incident declaration Bounded
False escalation rate Reduced
On-call cognitive load Lowered
Decision accountability Explicit
Process consistency System-guaranteed
Secondary Effects
  • Clearer postmortems
  • Reduced reliance on individual heroics
The system did not make decisions faster by automating humans. It made decisions safer by constraining when and how humans decide.

My Role

I owned the core decision framework and the role-based authority model. I specifically decided to hide raw logs during the "Triage" phase to force binary decision-making, a choice that initially generated significant friction with senior engineering leads. I accepted the risk that this constraint might delay root cause analysis in edge cases, arguing that the reduction in decision paralysis was the higher-value outcome.

Reflection

Designing for authority requires removing the comfort of ambiguity. It forces teams to confront their own broken processes. This system did not just clean up the UI; it exposed and corrected the political power dynamics of incident management.