Anomaly Detection on Analytics Dashboards: When the Alert Fires

TL;DR: Most analytics dashboards have anomaly detection that is either too quiet to be useful or too loud to be trusted. The two failure modes have a common cause: the detector was tuned to the data without being tuned to the human loop that responds to it. The right method depends on the metric's seasonality, the cost of a false positive, the cost of a missed positive, and the willingness of the on-call rotation to acknowledge an alert at 3 a.m. on a Sunday. This essay surveys what works (Western Electric control rules, STL decomposition with robust residual scoring, Twitter's S-H-ESD, Prophet for explicit seasonality, and a few practitioner habits) and what does not, and ends with a calibration loop that treats alerts as a feedback system rather than as code.

A note on examples. The 4% Tuesday revenue drop is a composite of patterns observed across advisory engagements with mid-market commerce and SaaS operators. The Twitter, Facebook, and AT&T projects referenced are public, the academic sources are cited, and the operational figures (alert volumes, MTTA, calibration rates) are anonymized partner-engagement observations rather than vendor benchmarks.

A 4% Drop on a Tuesday

On a Tuesday in November, the daily revenue dashboard at a mid-market direct-to-consumer brand reads 4.1% below the seven-day trailing average. The Slack channel is quiet. There is no alert because the threshold has been calibrated to fire only on drops larger than 5%, an adjustment made three months earlier after the on-call team complained about noise. By Thursday, finance reconciles the week and finds a $63,000 gap. The payment processor had rejected a slice of European 3-D Secure transactions due to a misconfigured Strong Customer Authentication update. The fix takes eleven minutes. The lost revenue does not come back.

That is one branch. On the other branch, the same Tuesday drop fires an alert. The on-call analyst opens the dashboard, sees the dip, looks at the segments, and discovers the same payment failure within fifteen minutes. The fix lands the same day, the cumulative loss is contained at about $8,000, and the team writes a post-incident note that updates the runbook.

The difference between the two branches is not the model. Both teams used a similar statistical baseline (a rolling thirty-day mean with a two-sigma envelope). The difference is the alert volume each team was willing to tolerate, the loop that closed an alert back into action, and the trust that the alert was worth acknowledging. The model is a small part of the system. The loop is most of it.

This essay is about that loop. It surveys the methods most analytics teams reach for (statistical process control, STL decomposition, Twitter's S-H-ESD, Prophet, isolation forests for multivariate cases), points out where each is misapplied, and proposes a small set of practitioner habits that work better than choosing the right model. The right model is necessary. It is not sufficient.

Statistical Process Control: The Original Method

The grandparent of all anomaly detection on monitored metrics is statistical process control, the framework Walter Shewhart developed at Bell Telephone Laboratories in the 1920s and that the Western Electric Statistical Quality Control Handbook (1956) codified into the rules every quality engineer has memorized since. The handbook predates the internet by half a century, and its rules are still the most reliable starting point for monitoring a business metric.

The Shewhart control chart plots a metric against a mean and two horizontal bands at one, two, and three standard deviations above and below. The Western Electric rules then declare a process "out of control" if any of four patterns appear:

A single point beyond three sigma from the mean
Two of three consecutive points beyond two sigma on the same side
Four of five consecutive points beyond one sigma on the same side
Eight consecutive points on the same side of the mean

The reason these rules persist is calibrational, not theoretical. Each rule is constructed so that under the null hypothesis (a process that is genuinely in control with normally distributed noise) the probability of triggering is roughly the same, in the neighborhood of 0.3% to 0.5%. The four rules together raise the false-positive rate to about 1% to 1.5%, which is the level of noise a quality engineer at AT&T's manufacturing lines in 1956 was willing to investigate. That tolerance is also roughly what most operational analytics teams will tolerate today before they reach for the mute button.

The honest critique of pure SPC for modern analytics dashboards is that very few business metrics behave like a manufacturing process. Daily revenue is not a stationary normal variable. It has weekly cycles, monthly billing cycles, holiday effects, growth trends, campaign spikes, and heteroskedasticity (variance that scales with the mean). The Shewhart chart will fire continuously on Saturday-Sunday drops and miss a Tuesday slump that sits inside the weekly cycle's bounds. So SPC alone is not enough. But the rules are still the right calibration target. A more sophisticated detector should aim for roughly the same false-positive rate as Western Electric's rules deliver in a stationary process, because the on-call rotation's tolerance for false alarms has not improved since 1956 and arguably has worsened.

Decomposition: Separating Seasonality From Surprise

The first improvement on a raw SPC chart is to remove the part of the variance that is structurally predictable before scoring residuals. The standard procedure is seasonal-trend decomposition, the cleanest version of which is STL (Cleveland, Cleveland, McRae, and Terpenning, 1990, Journal of Official Statistics). STL decomposes a series into three additive components:

$Y_t = T_t + S_t + R_t$

where $T_t$ is the slow-moving trend, $S_t$ is the periodic seasonal component, and $R_t$ is the residual. The detector runs on $R_t$ . If the trend captures the growth curve and the seasonal component captures the weekly cycle, the residual is closer to the kind of stationary noise that SPC was designed for. The weekday-weekend problem disappears, because the Saturday drop is now part of $S_t$ , not part of $R_t$ .

Daily Revenue Decomposed: Raw, Trend, Seasonal, Residual (Advisory Partner Series, Q2 2024)

In the decomposition above, the only day where the residual is meaningfully different from zero is W2 Tuesday, which is the actual anomaly. Every weekend drop is absorbed into the seasonal component, where it belongs. A naive threshold on the raw series would have alerted twice every weekend and missed Tuesday. A residual-based detector alerts on Tuesday and stays silent on the weekends.

The catch with STL is that the decomposition itself can be unstable when the series is short, when the seasonality changes over time, or when a single extreme event distorts the trend estimate. The Cleveland paper addressed the first two with robust loess weighting and time-varying seasonal smoothing. The third is the practitioner's responsibility: known one-off events (Black Friday, a Super Bowl ad, a launch day) should be marked as exogenous and excluded from the seasonal estimate, or the model will quietly absorb them into the seasonality and start expecting a Super Bowl spike every November.

A common operating choice is to use STL to remove seasonality and trend, then apply a robust score (median absolute deviation, MAD, rather than standard deviation) on the residual. The MAD-based score is less sensitive to the very anomalies the detector is supposed to find, which means the threshold does not creep upward each time a real incident occurs. This is the residual-MAD pattern that Twitter's open-source AnomalyDetection package implemented as Seasonal Hybrid ESD (Vallis, Hochenbaum, and Kejariwal, 2014).

Twitter's S-H-ESD and the Lineage of Robust Detectors

Twitter open-sourced its AnomalyDetection R package in January 2015, with a Python port (anomalydetection) following shortly after. The package implemented Seasonal Hybrid ESD, a refinement of Bernard Rosner's generalized extreme studentized deviate test from 1983. The idea is to run STL-like decomposition, take the residual, and apply ESD iteratively, replacing the mean and standard deviation at each step with the median and MAD to make the test robust to its own findings.

The headline benefit, according to the Twitter engineering blog post, was the ability to detect both global anomalies (a clear outlier that any method would catch) and local anomalies (a point that is unusual in its current seasonal-trend context but unremarkable in absolute terms). For monitoring server-side metrics on the Twitter platform, the local case was the more important one: a spike in tweet-creation latency that happens to fall during a normal traffic peak is the dangerous one, because it is masked by the legitimate traffic.

The S-H-ESD pipeline (Vallis, Hochenbaum, Kejariwal, 2014)

Loading diagram...

In practice, the S-H-ESD method works reasonably well as a daily-grain detector on metrics with weekly or daily seasonality and moderate noise. It works less well in three situations the original paper did not emphasize:

Metrics with sparse data. Hourly or sub-hourly metrics with low base rates produce many zero or near-zero observations, and STL's seasonal estimation degrades. Practitioners switch to count-based methods (Poisson or negative binomial monitoring) at this resolution.
Multivariate dependencies. S-H-ESD operates on one series at a time. A revenue drop caused by a checkout-flow regression appears as a single anomaly on the revenue series but as multiple coordinated anomalies on the add-to-cart, payment-initiation, and checkout-completion series. The univariate detector misses the coordination, which is the most diagnostic feature of the failure.
Regime changes that are not anomalies. A pricing change, a campaign launch, or a product migration shifts the level of the metric in a way the detector flags as anomalous. The detector is technically correct, in that the new level is different from the old, but operationally wrong, because the new level is intended.

The Twitter team's own use case (server-side performance metrics on a relatively stable platform) was mostly free of all three problems. Most business analytics use cases are not.

Prophet, BSTS, and Bayesian Decomposition

Facebook open-sourced Prophet in February 2017, with the accompanying paper (Taylor and Letham, 2017, PeerJ Preprints). Prophet is positioned as a forecasting library, not an anomaly detector, but the architectural choice it embeds (an additive model with explicit trend, multiple seasonalities, and holiday effects) makes it natural to adapt for anomaly detection: forecast the next period, compare the actual to the forecast and its credible interval, flag the deviation if it falls outside.

The Prophet model is approximately:

$y(t) = g(t) + s(t) + h(t) + \epsilon_t$

where $g(t)$ is a piecewise-linear or logistic growth term, $s(t)$ is a Fourier-series seasonal component (often a daily plus a weekly plus a yearly term), $h(t)$ is a holiday effects term defined on user-supplied calendars, and $\epsilon_t$ is the residual. The model is fit using Stan's Bayesian sampler in the default configuration, which produces credible intervals on the forecast that double as the bounds for anomaly detection.

Univariate Anomaly Methods, Tuned for Daily Business Metrics

Method	Best Use Case	Weakness	Typical False-Positive Rate	Calibration Effort
Static threshold (raw value)	Smoke test on a flat metric	Fires on seasonality, misses real dips	10-25% on weekly-seasonal data	Trivial but useless
Rolling Z-score (30-90 days)	Metric with mild trend, no strong seasonality	Treats Saturday-Sunday as anomalies on B2B data	5-15%	Low
STL + MAD residual	Daily-grain metric with weekly cycle	Sensitive to outliers in training window, exogenous events must be marked	1-3%	Medium
Seasonal Hybrid ESD (Twitter)	Daily or hourly metric with stable seasonality	Single-series only, no multivariate coordination	1-3%	Medium
Prophet residual band	Daily metric with multiple seasonalities and holidays	Heavy on parameters, can over-fit if holidays are mis-specified	1-5%	Medium-High
Bayesian Structural Time Series (CausalImpact)	Treatment-effect estimation, not real-time monitoring	Not designed for streaming use	Not applicable	High
Isolation Forest on feature vector	Multivariate point anomalies (e.g., session features)	No temporal context, treats every point independently	Tunable via contamination param	Medium
LSTM-based reconstruction error	Long, complex series with hard-to-specify structure	Black box, debugging an alert is painful	Tunable but opaque	High

Prophet's strengths are the holiday-effect framework and the credible intervals. Its weaknesses are well known in the practitioner literature. The default settings often produce trend changes that lag actual structural breaks, the multiplicative-versus-additive choice for seasonality is sensitive to the data scale, and the credible intervals are sometimes wider than the practitioner expects because the changepoint prior is permissive by default. A team should not adopt Prophet without fitting the model to a year of historical data, marking known holidays and events explicitly, and reviewing the changepoint locations the model selects against the team's institutional knowledge of what actually happened on those dates.

Bayesian Structural Time Series (BSTS), implemented in the CausalImpact R package by Brodersen, Gallusser, Koehler, Remy, and Scott at Google (2015, Annals of Applied Statistics), uses a similar additive-state-space architecture but is built around counterfactual estimation rather than streaming detection. BSTS is the right tool for "did the campaign cause a lift" questions and the wrong tool for "fire an alert in the next 30 seconds." Many teams confuse the two and then complain that BSTS is too slow.

The Multivariate Case: When the Failure Hides Behind Coordination

The most expensive analytics outages we have observed in advisory work are not the ones that show up cleanly on a single metric. They are the ones where the headline metric is technically inside its band, but a coordinated pattern across three or four supporting metrics tells the real story. A payment processor outage that affects only European customers does not register on aggregate revenue if European revenue is 18% of the total and the residual band is tuned to two sigma. The European-specific revenue series would show it. So would the European checkout-completion rate, the European average-order-value series, and the rate of European cards being declined.

The single-series detector misses the coordination because it cannot see across series. The multivariate response is one of three families:

Family 1: Slice-aware single-series detectors. Run the same univariate detector on a fixed set of slices (region, channel, device, traffic source, payment method) in parallel. The cost is one detector run per slice, which scales linearly. The benefit is that the slice-level alert is more diagnostic than the aggregate alert.

Family 2: Multivariate dimensionality reduction. Fit a model that projects the multivariate state into a lower-dimensional representation, then detect anomalies on the reconstruction error. Isolation Forests (Liu, Ting, Zhou, 2008, IEEE ICDM), one-class SVMs, or autoencoders. The benefit is that the detector finds the coordinated patterns the slice-level detector requires the analyst to look for. The cost is that the detector cannot tell you which slice caused the alert without a separate explainability step.

Family 3: Causal-graph-aware detectors. Encode the known dependency structure between metrics (revenue depends on add-to-cart, which depends on session count and conversion rate) and detect violations of the expected relationship rather than violations of the level. This is the family that maps best to operational diagnostics, but it is rarely worth the engineering cost outside of large e-commerce platforms.

Aggregate Revenue (Inside Band) vs EU Slice Revenue (Outside Band), Advisory Partner Series

The aggregate series above stays inside its band for the entire week, even as the European slice drifts visibly below its lower band starting Wednesday. The single-series aggregate detector remains silent. A slice-aware detector raises a region-specific alert on Wednesday, the day the anomaly begins, with three days of head start over the eventual aggregate breach (which never arrives because the recovery happens just inside the band).

From Experience

A 2024 advisory engagement with a European-headquartered DTC brand running its own commerce platform

The team's main revenue dashboard had a Prophet-based detector with a 95% credible interval. It had not fired in eight weeks. We layered a parallel detector on the European-card-decline-rate series and found a step-change six days earlier that the aggregate detector had missed because the European loss was being offset by a US campaign that fell in the same week. Once the slice-level detector was added to the main monitoring rotation, the team caught the next two coordination failures (a Stripe-side 3-D Secure regression and a regional fraud-screen rule that was rejecting legitimate transactions) within hours rather than days. The engineering cost of adding the slice-level detector was about two weeks. The recovered revenue in the first quarter after deployment was about eleven times the engineering cost.

The Alert Fatigue Problem

The single most common cause of monitoring failure is not a missed anomaly. It is alert fatigue: the alert fires, the on-call analyst muses, the analyst's threshold for action rises, and within ninety days the channel is silenced or the rules are loosened until the detector is no longer functionally a detector.

A useful way to think about alert fatigue is as a feedback loop. The alert fires at rate $\lambda$ . Each alert imposes a per-event cost on the on-call rotation: interruption time, context switching, the cognitive cost of investigation. Some fraction $p$ of alerts are true positives. The marginal benefit of the next alert depends on the precision $p$ and the cost of missing a true positive. When the precision drops below a threshold, the on-call rotation rationally disengages, and the operational system reverts to "no detector."

The medical-monitoring literature has measured this carefully. Christopher Bonafide, Phipps, Murray, et al. (2017, Pediatrics) found that as the rate of non-actionable alarms increased on inpatient wards, nurses' time-to-response to subsequent alarms grew. A 2014 ECRI Institute report listed alarm hazards as the top patient-safety risk in hospitals for the third consecutive year. The numbers are not portable to a B2B SaaS dashboard, but the structural finding is: humans habituate. The detector that overflows the human channel loses its capacity to trigger action.

The right design choice is to tune the detector for the precision the team can sustain, not the recall a theoretical analysis recommends. There are two operational tactics that help.

Tactic 1: Two-stage alerting. The first stage is a low-precision, high-recall detector that writes to a quiet channel reviewed daily. The second stage is a high-precision, lower-recall detector that pages the on-call rotation. The high-recall channel is the analyst's morning reading. The high-precision channel is the page. The two channels separate the cost of investigation (cheap, daily) from the cost of interruption (expensive, immediate).

Tactic 2: Acknowledgement-gated escalation. Every alert requires a human acknowledgement within a fixed window (say, fifteen minutes). Unacknowledged alerts escalate to the next responder in the rotation. Acknowledged alerts are tagged with one of three labels (real, false positive, deferred), and the labels feed back into the calibration of the detector. The system measures not just whether it fired correctly but whether the team responded, and the team's response rate is the metric the system optimizes against.

Calibration: The Loop That Closes

The calibration loop is the part of the system that most teams underbuild. Most teams ship the detector, configure a Slack channel, and call it done. The detector then degrades for the standard reasons: the metric's seasonality shifts, exogenous events accumulate without being marked, the team adjusts thresholds to silence specific alerts and never readjusts them upward. Six months in, the detector is firing at half the original rate and missing the kind of incident it was supposed to catch.

A healthy calibration loop has four components.

1. Labeled alert history. Every fired alert gets a label after the fact: true positive (real incident), false positive (noise), or operationally irrelevant (technically anomalous but the team would not have acted on it). The labels are recorded in a structured way, ideally in a small table that the detector itself reads on the next training cycle.

2. Periodic threshold re-fitting. Every thirty to ninety days, the detector's parameters are re-fit on the most recent training window, with the labeled history as supervision. The threshold that produces the team's target precision is recomputed and applied. The team's target precision is a configuration choice, not a default.

3. Exogenous event registry. A small calendar table that marks dates the team knows are anomalous for legitimate reasons (launches, campaigns, outages elsewhere, holidays). The detector excludes these dates from the training data and from the alert calculation. The registry is maintained by the team, not the detector.

4. Periodic dashboard of detector health. A small dashboard showing alert volume, acknowledgement rate, precision, and time-to-acknowledge over the last 30, 60, and 90 days. This is the detector watching itself. If the acknowledgement rate drops below 60%, the team's first instinct should be to raise the precision (raise the threshold or move alerts to the quiet channel), not to send a "please respond to alerts" Slack message.

The calibration loop: alerts as a feedback system

Loading diagram...

The single most underrated practitioner move in this loop is the third component, the exogenous event registry. Most detectors quietly degrade because the team forgot to mark Black Friday and the seasonality estimator absorbed it. The registry takes about thirty minutes per quarter to maintain. The recovered detector accuracy is sometimes the difference between a useful monitoring system and a disabled one.

Choosing the Right Method

The decision tree below collapses six months of advisory conversation into a short interactive flow. It is not a substitute for the calibration loop. It is a starting configuration that a team can use to avoid the most common misapplications.

Decision path: Picking a daily-grain anomaly detector

Does the metric have a clear weekly cycle?

If yes: Do you have at least 13 weeks of clean history?
- If yes: Are there holidays or campaigns that materially shift the level on specific dates?
  - If yes: Outcome: Prophet with explicit holiday/event calendar; alert on residual outside the 95% credible interval.
  - If no: Outcome: STL + MAD on residuals; alert at residual greater than 3.5x MAD from zero.
- If no: Outcome: Use a Z-score with a 30-day rolling window; revisit once you have 13+ weeks of data.
If no: Is the metric multivariate (multiple related series moving together)?
- If yes: Outcome: Slice-aware univariate detectors first; isolation forest only if dependency patterns are well understood.
- If no: Outcome: Western Electric rules on a Shewhart chart with a 60-day baseline.

The decision tree skips the question of whether to use a streaming detector versus a batch one. The answer for most analytics dashboards is batch. Streaming anomaly detection is appropriate for high-frequency operational metrics (latency, error rate, throughput) where the cost of detection delay is measured in seconds. Analytics-dashboard anomalies have detection-delay tolerances measured in minutes to hours, and a daily-grain batch detector is cheaper to operate and easier to calibrate.

The other question the tree skips is what to do when the metric is genuinely non-stationary, meaning the underlying distribution is changing over time in ways that are not seasonal. This is the case for a fast-growing product where revenue is doubling every quarter. The right response is to detect anomalies on a normalized series (revenue per active customer, or revenue per acquisition channel) rather than on the absolute level. A Prophet model on a series with a 5% weekly growth rate will produce credible intervals so wide that meaningful drops fall inside them.

Human-in-the-Loop and the Question of Auto-Remediation

The frontier of anomaly detection on analytics dashboards is not statistical. It is operational. The methods above produce alerts. The question of what to do with an alert remains a human one in most cases, and in most cases that is the right design.

A few categories of alert lend themselves to auto-remediation. A spike in 5xx error rates on a payment endpoint can be auto-routed to a circuit breaker that fails the request over to a backup processor. A drop in a feature flag's adoption rate can be auto-rolled-back if the deploy was recent. A regional drop in conversion that correlates with a CDN incident reported by the CDN's status page can be auto-routed to a degraded-mode landing page.

Most analytics alerts are not in this category. A 4% revenue drop on a Tuesday is too ambiguous to auto-remediate. The set of possible causes (payment outage, pricing bug, normal variance, campaign exhaustion, a competitor's launch) requires human triage. The right system for these is one that produces a high-quality alert with the right diagnostic context attached: the slice breakdown, the recent change history, the most similar prior incidents, the on-call runbook entry that matches the pattern.

Alert Lifecycle Metrics That Predict Detector Survival

Metric	Healthy Range (Advisory Observation)	Warning Signal
Mean alerts per week	3-12 for a single team	>25 per week; rotation will mute
Acknowledgement rate (15 min)	≥75%	<60% indicates fatigue
Median time-to-acknowledge	<5 min during work hours	>20 min indicates the channel is not trusted
Precision (last 30 days)	≥65% on label-after review	<50% means the team is wasting time
False-positive label rate	≤35%	>50% indicates the threshold is too sensitive
Alerts disabled or muted manually	<5% of total	>15% means the calibration loop is broken
Median time-to-incident-resolution post-alert	<2 hours for revenue-class alerts	>8 hours indicates either a hard problem or no runbook

The metrics in the table are the ones we have found most diagnostic of whether a detector is going to survive its first year. The single most important one is the second: acknowledgement rate. A detector firing into a channel where 60% of alerts are acknowledged within fifteen minutes is operationally alive. A detector firing into a channel where 30% are acknowledged is operationally dead, regardless of what the algorithm is.

The detector is the easy part. The loop is the hard part. The calibration is the boring part. Most teams ship the easy, skip the loop, and forget the calibration, and then wonder why the channel goes silent.

The composite recommendation we make in advisory engagements, summarized:

Start with the Western Electric rules on a Shewhart chart. Even if you replace this later, the rules give you a sanity-check false-positive rate to compare against.
Decompose the series before detecting. STL or Prophet, depending on whether you need explicit holiday effects. Detect on the residual, not the level.
Use robust statistics for the residual score. MAD instead of standard deviation. Iteratively trim known anomalies from the training window.
Alert at the slice level, not just the aggregate. The aggregate alert is the easiest to miss because it is the last to fire.
Run a two-stage channel. Quiet channel for daily review, paging channel for confirmed incidents. Keep the precision on the paging channel above 65%.
Maintain an exogenous-event registry. Thirty minutes per quarter. Removes most of the recurring false-positive sources.
Build a detector-health dashboard. Alert volume, acknowledgement rate, precision, time-to-acknowledge. Treat these as the system's own metrics.
Label every alert post-hoc. The label history is the supervision signal for the next re-fit cycle.

The methods are unglamorous. The S-H-ESD paper is short, the STL paper is from 1990, the Western Electric rules are from a manufacturing handbook that predates the personal computer. The reason they remain the right starting points is that the operational problem (a small team responding to a finite number of alerts in finite time) has not changed. The model only matters to the extent that the loop survives.

A footnote on what we have stopped recommending. In 2021 and 2022 a fashionable choice was to train a small LSTM or transformer autoencoder on the metric history, score the reconstruction error, and alert on the high-error windows. This approach can work on long, stable series with subtle structure that the additive decomposition misses. In our advisory observations it almost never beat a properly tuned STL-plus-MAD pipeline on daily-grain business metrics, and the debugging cost (when an alert fires, the team cannot easily explain which features of the reconstruction were anomalous) routinely defeated the marginal accuracy improvement. The right place for the heavier neural detectors is the multivariate operational telemetry stack at large platform companies, not the analytics dashboard at a mid-market commerce operator.

The other piece we have learned to remove from monitoring rotations is the bare percentile-of-history detector that some teams still maintain. The pattern is to compute, for the current day, where the day-of-week-matched value falls in the trailing 26-week empirical distribution, and to alert when it falls below the 5th percentile. The detector is intuitive and easy to explain. It has two failure modes that the residual-MAD pattern handles better. First, growth: in a growing series the percentile detector either flags the bottom 5 percent of every quarter as anomalies (if the percentile is computed across the full window) or under-fires (if the window is too short to be a reliable percentile). Second, structural breaks: a pricing or product change shifts the distribution, and the percentile detector takes 26 weeks to recover. The residual detector recovers in two to three weeks because the trend term absorbs the level change. Most teams that have stuck with percentile detectors have done so for explainability reasons, and most have eventually replaced them.

Key Takeaways

The right model matters less than the loop. A Shewhart chart with the Western Electric rules will outperform a sophisticated detector that the team mutes within ninety days. The deciding factor is the calibration loop, not the algorithm.
Decompose before detecting. STL or Prophet residuals score better than raw series scores on every metric with non-trivial seasonality, and the gap widens as the series gets longer.
Alert at the slice level. The coordinated multivariate failures hide inside aggregate-level bands. Slice-aware univariate detectors catch them earlier than aggregate detectors and are cheaper than full multivariate models.
Tune for the precision the team will sustain. Beyond a certain alert volume, lowering thresholds reduces operational effectiveness because the rotation disengages. The empirically useful target is the highest recall achievable at 65 to 75 percent precision.
Run a two-stage channel and maintain the event registry. A quiet channel for daily review and a paging channel for confirmed incidents separates investigation cost from interruption cost. The exogenous-event registry removes most of the recurring false-positive sources at a maintenance cost of about thirty minutes a quarter.
Measure the detector watching itself. Alert volume, acknowledgement rate, precision, time-to-acknowledge are the metrics that predict whether a detector survives. If acknowledgement drops below 60 percent, raise the threshold, do not raise the volume.