Docs » Find the root cause of a problem with Splunk APM

Find the root cause of a problem with Splunk APM 🔗


The original µAPM product, released in 2019, is now called µAPM Previous Generation (µAPM PG). Wherever you see the reference, µAPM now refers to the product released on March 31, 2020.

If you’re using µAPM Previous Generation (µAPM PG), see Overview of SignalFx Microservices APM Previous Generation (µAPM PG).

The following example shows one way in which Splunk APM can help you quickly narrow in on the cause of a problem. The example is based on the following:

  • A high error rate on an endpoint
  • A cloud-native environment running several microservices
  • A containerized infrastructure

Suppose that you just received a high error rate alert on /checkout endpoint for api service. From the alert dialog box, click Troubleshoot to navigate directly to troubleshooting in APM, with time, service, endpoint, and environment context carried over.


In the service map, the circle inside the api:/checkout endpoint has hashed lines, which indicates that the error is rooted in a more downstream service.


Since you are investigating a high error rate issue, click the Requests and Errors card to get more insights. Information on the error sources is displayed at the bottom of the docked card in an error stack. An error stack identifies the full path of the error. In this example, there is one error stack, identified by the name of the service the error is rooted in (payment).


Click the error stack payment to display the full error path. The errors originate in the payment service, propagate to the checkout service, and finally to the api service.


Understanding that, you can now filter on the whole path. In the service map, double click on the checkout service and then on the payment service to see the full error path. The circle inside the payment service is solid, which indicates that the error originates with that service (root cause error).


Now, let’s see if there are any trends in the errors observed in the payment service. Top tags in error spans surfaces the indexed tags that have the highest error count in the selected service (payment). It looks like the problem is with a particular Kubernetes node (node6), as every request is resulting in an error.


Let’s explore further. You know that node6 has problems. From the Breakdown drop-down menu in the service map, select kubernetes_node to validate that the issue is only with node6.


Next, you want to find out if, within node6, there is a particular tenant that has issues. Select tenant from the Breakdown drop-down menu to further confirm that gold, platinum, and silver are all having the same issue. All of the problems are rooted in a particular node (node6) and that node was picked up with the tag analysis.


Now, let’s look at an example trace. Click on a point that corresponds with high errors in the Request Rate chart to display a list of example traces to choose from. Click a trace ID to see the trace.


Click on /payment/execute (most downstream span with errors) to display the metadata on that span. You can see all the tags, including the kubernetes_node tag, that the problematic span is running on.


Now, let’s explore what’s going on with node6 by navigating to the Kubernetes Navigator. In the node details, you can see the containers that are running in node6. Notice that a container (robot) is taking approximately 90% of memory in this node, which puts memory pressure on the payment pod. Click robot to open the sidebar and drill down to details without losing context. In this case, the container has no memory limit, which is probably why it is using all of the memory on this node.


In summary, a “noisy neighbor” put memory pressure on the pod that the payment service was running on, causing errors that then propagated all the way upstream to the api service, which triggered a high error rate alert.

To get started implementing APM, see Get started with Splunk APM.