Docs » Detectors and Alerts » When a Detector Does Not Trigger an Alert

When a Detector Does Not Trigger an Alert 🔗

You can access detectors from the Alerts page, from the Detector menu (bell icon) on a chart, or through the Infrastructure Navigator.

If a detector does not trigger an alert under conditions when you expected one, open it and take the following troubleshooting steps:

1. Consider whether your detector is robust enough to accommodate aperiodic data. For example, metrics sent when events occur are not likely to be generated on a regular and predictable cadence.

2. Check the Alert Signal, Alert Condition, and Alert Settings tabs for the detector. If detector rules seem correct, data might have been delayed enough to arrive outside a specified time window, or the resolution of the input data might have changed since the detector was created.

3. Compare the signal against the resolution reported in the Detail View of the detector. Mismatch between the signal and the resolution in the Detail View can show up in either of two ways:

  • If the resolution in the Detail View is coarser than the resolution in the signal, then it’s likely that your reporting interval has changed and the detector no longer fires because it believes the data to be too unreliable to trigger alert events.
  • If the resolution in the Detail View is finer than the resolution of the signal, then rolled up data might be causing an inconsistency in the criteria being evaluated.

Avoiding issues with timestamp mismatches 🔗

Apart from not triggering an expected alert, delayed data can sometimes create the opposite problem of triggering an alert where none seems warranted. If data that you see in the detector does not match the chart preview shown in an alert message, then data might have been unavailable because it was delayed or missing while the detector was running.

For example, if your detector triggers an alert when a metric is above 50 for 5 minutes, and your data comes in once every minute, the following table shows how some metric data points arrive several minutes late:

Value Timestamp on metric (When it was expected to arrive) Time metric actually arrived
30 11:07 11:07
55 11:08 11:08
40 11:09 11:15
30 11:10 11:15
20 11:11 11:16
45 11:12 11:16
20 11:13 11:16

In this scenario, the detector does not trigger an alert if all data arrives on time. But the value of the metric being monitored was over 50 between 11:08 and 11:15, when the 11:09 datapoint with a value of 40 finally arrived. Given detector parameters, this means an alert would be triggered 5 minutes after 11:08, at 11:13.

When you look later at the detector, however, data points shown in the chart reflect the correct timestamps. That is, all the data points from 11:09 onwards show values under 50 with the correct timestamps (for when metrics were sent and expected to arrive), so it doesn’t look like the triggering threshold condition was met.

You can use several strategies to avoid this problem:

  • If you manually set a Max Delay value on the detector, reset that value to Auto. Letting Amazon CloudWatch metric sync and Amazon EC2 property sync adjust max delay automatically based on incoming data will usually prevent inadvertent triggering of alerts by delayed data.
  • Data points not sent within an expected time frame are considered null by default and excluded from calculations. Setting the extrapolation policy to 0 (zero) for the detector prevents alerts from being triggered by missing data.
  • You can change the signal and condition triggering the alert by, for example, adding the Mean analytics function to the signal and giving it a transformation value of 5 minutes, with the detector firing immediately. In the table used as an example, there is no 5-minute period during which the mean value is over 50, so no alert would be triggered.

Avoiding correlation conflict 🔗

Correlation is the function that shows how strongly pairs of variables are related to or associated with each other so that they change together at a constant rate.

Because a time series is defined by a metric name and a set of dimensions, if you are using a custom alert threshold that compares two plots that each hold a metric, but the metrics involved do not have the same dimensions, then a correlation conflict between them might prevent the alert from firing.

When one metric holds dimensions that the other does not, the analytic engine cannot compare (correlate) the two metrics to each other without extra help.

To fix this, aggregate the two plots, which strips the problematic dimensions and keeps correlation. See the Splunk Blogs post Metadata Correlation: The Magic Behind the Math for more information.

Using the count function to determine whether an instance should be down 🔗

In an ephemeral infrastructure environment where things are constantly going up and down, traditional mechanisms for monitoring do not work, because traditional monitoring mechanisms require manual configuration for new elements. Traditional monitoring mechanisms also assume that non-reporting of a metric is always alert-worthy. This is a problem when non-reporting is the expected effect of autoscaling, as when an instance is turned down on purpose. By using analytics, however, you can alert only when non-reporting is unexpected.

The analytics function that helps in this situation is count. Be sure to select the analytics function and not the rollup. The count function tells you how many time series are reporting a value at a given point in time. If an instance stops reporting a metric, for example, because it has been terminated purposefully, then its time series is not counted.

You can take advantage of this function to tell you how many instances are reporting, but you need a property that tells you the expected state of the instance. For example, Amazon publishes the state of an EC2 instance: terminated, running, and so on. Splunk Infrastructure Monitoring imports that as aws_state. With this information, you can do the following:

  1. Set up a plot that uses a heartbeat metric of your choosing such as memory.free.
  2. Filter out the emitters that have been terminated on purpose, for example, !aws_state:terminated.
  3. Apply the count function with a group-by on a dimension that represents a single emitter, for example, aws_tag_Name.

This plot then emits a 0 or 1. An alert that is triggered when the output is 0 tells you that the instance is down unexpectedly.

You can apply this general concept to anything you want, as long as you have three things:

  • A heartbeat metric that reports regularly
  • A canonical dimension that represents the emitter or source that you care about
  • A property on that dimension that denotes the expected state of the emitter

These items are packaged in the Heartbeat Check built-in alert condition. For more information about that alert condition, see Heartbeat Check.