Docs » Detectors and Alerts » Using Built-in Alert Conditions » Outlier Detection

Outlier Detection

What this alert condition does

Alert when a signal is significantly different from its peers in the same time period.

When to use this alert condition

Useful for identifying inconsistent behavior among a population of emitters (within the same time period), such as which node in a cluster is using more CPU than the others.

Note

To compare current signal values to past values of the same signal, use Sudden Change or Historical Anomaly.

Example

You could use this condition to determine if you have not added a host to your load balancer, or if there is a problem between the host and the load balancer. For example, if a metric tracks requests routed to a host in a load balancer, you could trigger an outlier alert when (for example) the value of the metric is more than 2.5 standard deviations below the mean of similar signals for 80% of 5m.

Basic settings

PARAMETER VALUES USAGE NOTES
Alert when Too high, Too low, Too high or Too low none
Trigger Sensitivity Low, Medium, High, Custom Approximately how often alerts will be triggered, where Low can result in fewer alerts being triggered and alerts taking longer to clear (least flappy). Choose Custom to modify the settings that determine triggering and clearing sensitivity (listed below).

Advanced settings

PARAMETER VALUES USAGE NOTES
Define thresholds by Deviations from norm, Norm plus percentage change Whether to express comparison in terms of a statistic (number of deviations) or a percentage
Normal based on (when Define thresholds by is Deviations from norm) Mean plus standard deviation, Median plus median absolute deviation Median plus median absolute deviation is recommended for small populations (<15).
Normal defined by (when Define thresholds by is Norm plus percentage change) Mean, Median Median is less influenced by extreme values.
(optional) Group by Dimension or property chosen from dropdown menu Use a dimension or property when you want the norm to be different according to the different values of the dimension or property. For example, if you choose aws_availability_zone and your zones are US-east and US-west, instances in US-east are being compared only to other instances in US-east, and likewise for US-west. If you choose None, there is one norm, and all members are compared to this norm.
Trigger threshold and Clear threshold (when Define thresholds by is Deviations from norm) Number >= 0; Clear threshold must be lower than Trigger threshold.

The number of deviations away from the norm required to trigger an alert.

For example, a trigger value of 3.5 will trigger an alert when the values being compared differ from the norm by 3.5 standard deviations or more. Higher values result in lower sensitivity and potentially fewer alerts.

A clear value of 2.5 will clear the alert when the values being compared differ by 2.5 standard deviations or less. Higher values result in alerts taking longer to clear.

Trigger threshold and Clear threshold (when Define thresholds by is Norm plus percentage change) Number between 0 and 100, inclusive; Clear threshold must be lower than Trigger threshold.

The percentage change required to trigger or clear the alert.

For example, a trigger value of 30 will trigger an alert when the values being compared differ by 30% or more. Higher values result in lower sensitivity and potentially fewer alerts.

A clear value of 20 will clear the alert when the values being compared differ by 20% or less. A gap between Trigger threshold and Clear thresholds results in alerts taking longer to clear.

Trigger duration Percent: Integer between 1 and 100; Time indicator: Integer >= 1, followed by time indicator (s, m, h, d, w), e.g. 30s, 10m, 2h, 5d, 1w The number of times the signal must meet the trigger threshold, compared to the number of expected datapoints. Higher percentages and/or longer time periods result in lower sensitivity and potentially fewer alerts. For more information about this option, see Using the Duration option.
Clear duration Percent: Integer between 1 and 100; Time indicator: Integer >= 1, followed by time indicator (s, m, h, d, w), e.g. 30s, 10m, 2h, 5d, 1w The number of times the signal must meet the clear threshold, compared to the number of expected datapoints. Higher percentages and/or longer time periods result in longer times for alerts to clear (i.e. increased confidence that the alert condition is in fact no longer occurring). For more information about this option, see Using the Duration option.

Using the Duration option

The Trigger duration and Clear duration options are used to trigger or clear alerts based on how many signals met the threshold during the specified time window, compared to how many were expected.

  • Specifying 100% means that all expected datapoints arrived (there were no delayed or missing datapoints) and all met the threshold. In other words, if you specify 100% of a time range, an alert will not be triggered if any datapoints are delayed or do not arrive at all during that time range, even if all the datapoints that are received do meet the threshold. (For more information about delayed or missing datapoints, see Handling delayed or missing datapoints.)

    Tip

    To specify that an alert triggers immediately, specify 100% of 1 second.

  • Specifying a percentage below 100 has a few effects:

  • For the Alert threshold, a lower percentage is more sensitive (may trigger more alerts) than using 100%, because fewer signals are needed to trigger an alert. Also, it can trigger alerts even if some datapoints are missing, as long as the required number of anomalous signals arrive.
  • For the Clear threshold, it can clear alerts more quickly than using 100%, because fewer signals are needed to trigger the clear condition. Also, it can clear an alert even if some datapoints are missing, as long as the required number of non-anomalous signals arrive.

The following examples illustrate how this option would affect triggering and clearing alerts in various situations.

Alert example 1

  • Percent of duration you specify: 100% of 10 minutes

  • Resolution of the signal: 10 seconds

  • Number of datapoints expected in 10 minutes: 6 per minute * 10 minutes (60)

  • Number of anomalous datapoints (how many times the threshold must be met) to trigger alert: 100% of 60 (60)

    Total datapoints expected Total datapoints received Anomalous datapoints required Anomalous datapoints received Alert is triggered?
    60 60 60 60 Yes
    60 60 60 59 or fewer No
    60 59 60 59 No

    Note that in the last example above, even though 100% of the datapoints that arrived were anomalous, the required number of anomalous datapoints (60) did not arrive. Therefore, the alert will not be triggered. The percent you specify represents percent of expected datapoints, not percent of received datapoints.

Alert example 2

  • Percent of duration you specify: 80% of 10 minutes

  • Resolution of the signal: 10 seconds

  • Number of datapoints expected in 10 minutes: 6 per minute * 10 minutes (60)

  • Number of anomalous datapoints (how many times the threshold must be met) to trigger alert: 80% of 60 (48)

    Total datapoints expected Total datapoints received Anomalous datapoints required Anomalous datapoints received Alert is triggered?
    60 60 48 48-60 Yes
    60 50 48 48-50 Yes
    60 50 48 47 No

    Note that in the last example above, even though 47/50 is greater than the 80% you specified, the required number of anomalous datapoints (48) did not arrive. Therefore, the alert will not be triggered. The percent you specify represents percent of expected datapoints, not percent of received datapoints.

Clear example 1

  • Percent of duration you specify: 100% of 15 minutes

  • Resolution of the signal: 30 seconds

  • Number of datapoints expected in 15 minutes: 2 per minute * 15 minutes (30)

  • Number of anomalous datapoints (how many times the threshold must be met) to clear alert: 100% of 30 (30)

    Total datapoints expected Total datapoints received Normal datapoints required Normal datapoints received Alert is cleared?
    30 30 30 30 Yes
    30 30 30 29 or fewer No
    30 25 30 25 No

    Note that in the last example above, even though 100% of the datapoints that arrived were anomalous, only 35 out of the 36 expected datapoints arrived. Therefore, the alert will not be cleared. The percent you specify represents percent of expected datapoints, not percent of received datapoints.

Clear example 2

  • Percent of duration you specify: 50% of 15 minutes

  • Resolution of the signal: 30 seconds

  • Number of datapoints expected in 15 minutes: 2 per minute * 15 minutes (30)

  • Number of anomalous datapoints (how many times the threshold must be met) to clear alert: 50% of 30 (15)

    Total datapoints expected Total datapoints received Normal datapoints required Normal datapoints received Alert is cleared?
    30 30 15 15-30 Yes
    30 20 15 15-20 Yes
    30 20 15 14 No

    Note that in the last example above, even if 14 anomalous datapoints arrive, and 14/15 is greater than the 50% you specified, the required number of anomalous datapoints (15) did not arrive. Therefore, the alert will not be triggered. The percent you specify represents percent of expected datapoints, not percent of received datapoints.

Further reading

PARAMETER(S) REMARK(S)
Alert when The setting “Too high or Too low” will trigger an alert for a signal that oscillates between above and below the bands (provided of course it spends enough time outside of the band).
Trigger and clear duration These parameters should be larger than data resolution (in general by a lot).
Trigger threshold and Outlier algorithm Mean plus standard deviation will never trigger an alert for n standard deviations if n^2 + 1 is greater than or equal to the size of the population being monitored. Therefore, Median plus median absolute deviation is recommended for small populations (n <  15).
Trigger threshold and clear threshold These produce dynamic thresholds, which can be somewhat disorienting. For example, an alert can be triggered when the signal value is 31.4 (units of the original metric, not deviations) and clear when the value is 55.1 (because the rest of the population now also shows elevated values).