Docs » Reference » Best Practices » Monitoring Elastic Applications

Monitoring Elastic Applications 🔗

In the last several years, many applications have been written to take advantage of cloud platforms like Amazon Web Services or newer container-based offerings like Kubernetes or Docker Cloud (see the Docker plugin for collectd). While there are many benefits to doing so, there are also some new challenges. By definition, cloud infrastructure is meant to grow or shrink with your applications as needed, and the instances or containers hosting the services that make up your application are also meant to spin up or down on demand.

The elastic and ephemeral nature of this infrastructure and the applications they support can cause a variety of issues for monitoring tools. In this document, we describe common problems with monitoring elastic applications and discuss how you can use SignalFx to address them.

Challenges with monitoring elastic applications 🔗

Elastic applications (and the services that they are composed from) scale dynamically based on load. Effective monitoring of elastic applications and services needs to consider the following:

  • Dynamic node count. An elastic service runs on different numbers of instances over time. Nodes are added or removed as load on the service changes, making it tricky to obtain meaningful views of the service as a whole.
  • Load balancing. Work is shared among nodes using a load balancing algorithm. How well is the algorithm performing? Are some nodes doing more than their fair share of work, and what is the impact to overall application performance?
  • Ensuring capacity. When do nodes need to be added or removed? Do you have enough capacity to handle the demands on your application or service?

An example 🔗

Let’s say you are running an application, ‘cloudapp’, that runs on your favorite cloud provider, and that cloudapp is composed of a number of services, including one named ‘websearch’. Every instance used by cloudapp has a unique identifier, e.g. an instance_id or hostname, and each instance is sending in instance.cpu (a system metric). In addition, websearch instances are sending in the metric websearch_num_requests (an application-specific metric).

Dynamic node count 🔗

As it turns out, cloudapp is predominantly used by people in North America, so in the daytime, it scales up to handle the traffic accordingly, and at night it scales back down, shedding instances that are unnecessary. The identifiers for the instances change regularly, confounding your tools that rely on consistent host names.

In the case of SignalFx, it is easy to build charts that automatically handle dynamic node count.

  •  Instead of using the unique, per-instance identifiers, you can use a dimension that is common across all of your websearch nodes for filtering and aggregation.
  •  Alternatively, you can use name your nodes using a pattern that works with wildcard searches, e.g. websearch*.

SignalFx will update the actual instances that match the dimension or wildcard search every few minutes, and adjust the charts that it draws accordingly.

As an example, let’s say you’ve added a dimension to all of the metrics being sent in, service:[name_of_service]. Now, when you want to look at CPU utilization across only your websearch nodes, you specify the metric as usual (instance.cpu), and then apply the appropriate dimension filter (service:websearch).

../../_images/fig11.png

Or, if you want to compare average CPU utilization of websearch vs. another service, you can use the dimension as a group‑by value.

../../_images/fig21.png

SignalFx will calculate these dynamically, so if you have a chart over one week it will automatically calculate these values based on the actual instances in use.

Load balancing 🔗

Another common issue with elastic services is understanding their load balancing effectiveness. If load is not shared evenly across the nodes that comprise the service, performance may suffer. Traditional monitoring tools aren’t particularly useful for this use case, because of their focus on node availability. In other words, even though all of the nodes in a service are up, some instances may be overloaded causing degraded performance.

SignalFx addresses this use case by making it easy to apply statistical functions to incoming metrics. For example, one way to determine load balancing effectiveness is to calculate the ratio of work being performed by the least to most loaded instance. The closer the resulting ratio is to 1, the better the load balancing. In our example, we can calculate this ratio easily by comparing the minimum and maximum websearch_num_requests.

../../_images/fig31.png

You may want to use a moving average function, e.g. mean(10m), to smooth out transient variations, or set up detectors to alert you when the load balancing effectiveness ratio falls below a threshold.

Ensuring capacity 🔗

In an application with a distributed architecture, understanding when and where to add more capacity can be challenging. It is not easy to predict how a surge in incoming requests, for example, will surface new bottlenecks in the pipeline of application services that process those requests. Those bottlenecks define your effective peak capacity, and you might not know it until it’s too late.

In these situations, having the ability to collect system metrics (CPU, memory, disk I/O, network) along with arbitrary application-specific metrics (number of requests, request latency, etc.) becomes important. With that information in hand, you can:

  • Identify the limiting system resource for a service. The limiting resource is one that will get saturated first as application load increases. With SignalFx it is easy to create a dashboard that shows application metrics side by side with relevant system metrics, making it easy to identify whether CPU, memory, disk or network is your scarcest resource.
  • Set up detectors to alert you when the overall utilization of that resource is outside the desired band, enabling you to add or remove instances in a timely manner.

Using our websearch example, let’s say we determine that we create a dashboard to show service performance alongside resource utilization.

../../_images/fig41.png

Looks like it might well be CPU-limited (although we should probably look into disk too!). We can then construct a chart to look at the average CPU utilization across the websearch nodes.

../../_images/fig51.png

And if we want to manage our capacity for websearch, we simply define a detector that determines if the average CPU utilization exceeds an 80% threshold, say, for 15 minutes.

../../_images/fig61.png

Doing so prevents websearch from becoming the bottleneck for cloudapp, and ensures that its needs as an elastic service are being met.