Docs » Integrations Guide » Integrations Reference » etcd

../../_images/integrations_etcd.png etcd πŸ”—

DESCRIPTION πŸ”—

This integration primarily consists of the Smart Agent monitor collectd/etcd. Below is an overview of that monitor.

Smart Agent Monitor πŸ”—

Monitors an etcd key/value store using the collectd etcd Python plugin.

Requires etcd 2.0.8 or later.

INSTALLATION πŸ”—

This integration is part of the SignalFx Smart Agent as the collectd/etcd monitor. You should first deploy the Smart Agent to the same host as the service you want to monitor, and then continue with the configuration instructions below.

CONFIGURATION πŸ”—

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: collectd/etcd
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

Config option Required Type Description
pythonBinary no string Path to a python binary that should be used to execute the Python code. If not set, a built-in runtime will be used. Can include arguments to the binary as well.
host yes string
port yes integer
clusterName yes string An arbitrary name of the etcd cluster to make it easier to group together and identify instances.
sslKeyFile no string Client private key if using client certificate authentication.
sslCertificate no string Client public key if using client certificate authentication.
sslCACerts no string Certificate authority or host certificate to trust.
skipSSLValidation no bool If true, etcd's SSL certificate will not be verified. Enabling this option results in the sslCACerts option being ignored. (default: false)
enhancedMetrics no bool (default: false)

USAGE πŸ”—

Interpreting Built-in dashboards πŸ”—

  • ETCD CLUSTER:

    • Number of Followers: Shows the number of followers in the cluster. A cluster that is expected to have 2n + 1 members, can tolerate failure of n members. By virtue of raft consensus algorithm, a cluster should have at least 3 members.

      ../../_images/chart-etcd-cluster-number-followers.png

    • Number of Watchers: Shows the total number of watchers on all the members of the cluster put together. Gives an overview of memory consumption by the watchers on the cluster as a whole.

      ../../_images/chart-etcd-cluster-number-watchers.png

    • Followers with Max Number of Watchers: Get an overview of the members that are being requested for watching. Watching is memory intensive. The list gives information about the members (host:port information) and the corresponding states.

      ../../_images/chart-etcd-cluster-Max-Watchers.png

    • Top Current Latency: Gives an overview of the followers (host:port) with max current latency with the leader. Since raft relies on log replication throughout all the members, this is helps in flushing out followers that have max latency.

      ../../_images/chart-etcd-cluster-top-latency.png

    • Total RPC Requests (successful/failed): A stacked chart that shows successful (in green) and failed (in red) RPC requests per second across all the followers. Leader sends RPC requests and followers receive.

      ../../_images/chart-etcd-cluster-total-rpcs.png

    • Per Member Failed RPCs: A stacked chart showing failed RPC requests per second on a per follower basis. On comparing this chart with one above, followers that cause more failures can be flushed out.

      ../../_images/chart-etcd-cluster-member-rpc-failure.png

    • Top RPC Requests: Followers with top RPC requests, both successful and failed.

      ../../_images/chart-etcd-cluster-top-rpcs.png

    • Store operations (successful/failed): This includes the following charts: Creates, Sets, Updates, Deletes, Compare-and-Swaps and Compare-and-Deletes. These charts are stacked charts that show successful operations (in green) and failed operations (in red) per second. This gives an idea of the ratio between success and failure for each operation type.

      ../../_images/chart-etcd-cluster-creates.png ../../_images/chart-etcd-cluster-sets.png ../../_images/chart-etcd-cluster-updates.png ../../_images/chart-etcd-cluster-deletes.png ../../_images/chart-etcd-cluster-cas.png ../../_images/chart-etcd-cluster-cad.png

    • Receive Packet Rate: Stacked chart of the packets received per second for each follower. At given point in time, followers receive packets from the leader (leader sends information as part of log replication).

      ../../_images/chart-etcd-cluster-packet-recv.png

    • Receive Append Requests: Stacked chart of the append requests received per second for each follower. At given point in time, followers receive append requests from the leader (leader sends information as part of log replication).

      ../../_images/chart-etcd-cluster-append-recv.png

    • Send Packet Rate: Chart for the packets sent per second for the leader. At given point in time, only leader sends packets. In the ideal world, every packet sent by the leader should be received by one of the followers. Comparing this chart with Receive Packet Rate would explain if packets are not received by followers (or an individual follower). Latency can also be observed through these charts.

      ../../_images/chart-etcd-cluster-packet-sent.png

    • Send Append Requests: Chart for the append requests sent per second for the leader. At given point in time, only leader sends append requests. In the ideal world, all append requests sent by the leader should be received by one of the followers. Comparing this chart with Receive Append Requests would explain if append requests are not received by followers (or an individual follower). Latency can also be observed through these charts.

      ../../_images/chart-etcd-cluster-append-sent.png

  • ETCD INSTANCE:

    • Number of Watchers: Shows the number of watchers on this particular instance. Watching is memory intensive and might explain high memory utilization.

      ../../_images/chart-etcd-instance-number-watchers.png

    • Expire Rate: The number of keys and directories that expire per second. This is common to the distributed key-value store. However, when a member leaves the cluster, this metric becomes instance specific.

      ../../_images/chart-etcd-instance-expire-rate.png

    • Gets (successful/failed): A stacked chart that shows successful gets (in green) and failed gets (in red) per second. This gives insight to the ratio between successful and failed get requests per second for the instance. It is possible that a high fail count for gets is because of a high expire rate.

      ../../_images/chart-etcd-instance-gets.png

    • Receive / Send Bandwidth Rate A line graph showing both, sent (in blue) and received (in green) bandwidth rate for the instance. Followers receive and Leader sends.

      ../../_images/chart-etcd-instance-bandwidth.png

    • Receive / Send Append Requests A line graph showing both, sent (in blue) and received (in green) append requests per second for the instance. Followers receive and Leader sends.

      ../../_images/chart-etcd-instance-appends.png

  • ETCD INSTANCES: Provides metrics from hosts on a particular host.

    • Number of instances: The total number of etcd isntances running on the host, group by type (follower/leader).

      ../../_images/chart-etcd-instances-number-instances.png

    • Instances by Number of Watchers: A line graph that shows number of watchers on each of the instances on the host. Instances with more number of watchers consume more memory.

      ../../_images/chart-etcd-instances-number-watchers.png

    • Instances with Most Number of Wacthers: Shows the instances with most number of watchers. Watching is memory intensive.

      ../../_images/chart-etcd-instances-most-watchers.png

    • Packets Exchange Trend: A stacked chart showing packets sent (in blue) and received (in green) across all instances on the host. Gives an idea of bandwidth usage.

      ../../_images/chart-etcd-instances-packets.png

    • Bandwidth Trend Rate: A stacked chart showing send bandwidth (in blue) and receive bandwidth (in green) rates across all instances on the host. Gives an idea of bandwidth usage and should shows similar trends as the above chart.

      ../../_images/chart-etcd-instances-bandwidth.png

    • Top Bandwidth Rate: Gives a list of the instances that consume max bandwidth, both for sending and receiving put together.

      ../../_images/chart-etcd-instances-top-bandwidth.png

    • Gets Successful Trend: A stacked chart showing the number of successful get operations per second for each of the instances running on the host.

      ../../_images/chart-etcd-instances-gets-success.png

    • Gets Failed Trend: A stack chart showing the number of failed get operations per second for each of the instances running on the host. Compare with above chart to analyze the success ratio.

      ../../_images/chart-etcd-instances-gets-fail.png

    • Top Gets per second A list of the instances on the host that perform the max number of gets per second, both successful and failed gets put together.

      ../../_images/chart-etcd-instances-gets-top.png

    • Expire Rate Trend: A line chart showing the rate of expiry of keys/directories for all the instances on host.

    ../../_images/chart-etcd-instances-expire-trend.png

    • Top Expire Rate: A list of instances with top expire rates. Can be used to analyze if gets fail due to a high expiry rate.

      ../../_images/chart-etcd-instances-top-expire.png

All metrics reported by the etcd collectd plugin will contain the following dimensions by default:

  • state, whether the member is a follower or a leader
  • cluster, human readable cluster name used to group by members by cluster
  • follower, metrics from the leader endpoint will have this dimension to group by follower

A few other details:

  • plugin is always set to etcd
  • plugin_instance will contain the IP address and the port of the member given in the configuration
  • To add metrics from the /metrics endpoint, use the configuration options mentioned in configuration. If metrics are being included individually, make sure to give names that are valid. For example, etcd_debugging_mvcc_slow_watcher_total or etcd_network_peer_sent_bytes_total

METRICS πŸ”—

Metric Name Description Type
counter.etcd.leader.counts.fail Total number of failed rpc requests to with a follower counter
counter.etcd.leader.counts.success Total number of successful rpc requests to with a follower counter
counter.etcd.self.recvappendreq.cnt Total number of append requests received by a member counter
counter.etcd.self.sendappendreq.cnt Total number of append requests sent by a member counter
counter.etcd.store.compareanddelete.fail Total number of failed compare-and-delete operations counter
counter.etcd.store.compareanddelete.success Total number of successful compare-and-delete operations counter
counter.etcd.store.compareandswap.fail Total number of failed compare-and-swap operations counter
counter.etcd.store.compareandswap.success Total number of successful compare-and-swap operations counter
counter.etcd.store.create.fail Total number of failed create operations counter
counter.etcd.store.create.success Total number of successful create operations counter
counter.etcd.store.delete.fail Total number of failed delete operations counter
counter.etcd.store.delete.success Total number of successful delete operations counter
counter.etcd.store.expire.count Total number of items expired due to TTL counter
counter.etcd.store.gets.fail Total number of failed get operations counter
counter.etcd.store.gets.success Total number of successful get operations counter
counter.etcd.store.sets.fail Total number of failed set operations counter
counter.etcd.store.sets.success Total number of successful set operations counter
counter.etcd.store.update.fail Total number of failed update operations counter
counter.etcd.store.update.success Total number of successful update operations counter
gauge.etcd.leader.latency.average Average latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.current Current latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.max Max latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.min Min latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.stddev Std dev latency of a follower with respect to the leader gauge
gauge.etcd.self.recvbandwidth.rate Bandwidth rate of a follower gauge
gauge.etcd.self.recvpkg.rate Rate at which a follower receives packages gauge
gauge.etcd.self.sendbandwidth.rate Bandwidth rate of a leader gauge
gauge.etcd.self.sendpkg.rate Rate at which a leader sends packages gauge
gauge.etcd.store.watchers Number of watchers gauge

counter.etcd.leader.counts.fail πŸ”—

counter

Total number of failed rpc requests to with a follower

counter.etcd.leader.counts.success πŸ”—

counter

Total number of successful rpc requests to with a follower

counter.etcd.self.recvappendreq.cnt πŸ”—

counter

Total number of append requests received by a member

counter.etcd.self.sendappendreq.cnt πŸ”—

counter

Total number of append requests sent by a member

counter.etcd.store.compareanddelete.fail πŸ”—

counter

Total number of failed compare-and-delete operations

counter.etcd.store.compareanddelete.success πŸ”—

counter

Total number of successful compare-and-delete operations

counter.etcd.store.compareandswap.fail πŸ”—

counter

Total number of failed compare-and-swap operations

counter.etcd.store.compareandswap.success πŸ”—

counter

Total number of successful compare-and-swap operations

counter.etcd.store.create.fail πŸ”—

counter

Total number of failed create operations

counter.etcd.store.create.success πŸ”—

counter

Total number of successful create operations

counter.etcd.store.delete.fail πŸ”—

counter

Total number of failed delete operations

counter.etcd.store.delete.success πŸ”—

counter

Total number of successful delete operations

counter.etcd.store.expire.count πŸ”—

counter

Total number of items expired due to TTL

counter.etcd.store.gets.fail πŸ”—

counter

Total number of failed get operations

counter.etcd.store.gets.success πŸ”—

counter

Total number of successful get operations

counter.etcd.store.sets.fail πŸ”—

counter

Total number of failed set operations

counter.etcd.store.sets.success πŸ”—

counter

Total number of successful set operations

counter.etcd.store.update.fail πŸ”—

counter

Total number of failed update operations

counter.etcd.store.update.success πŸ”—

counter

Total number of successful update operations

gauge.etcd.leader.latency.average πŸ”—

gauge

Average latency of a follower with respect to the leader

gauge.etcd.leader.latency.current πŸ”—

gauge

Current latency of a follower with respect to the leader

gauge.etcd.leader.latency.max πŸ”—

gauge

Max latency of a follower with respect to the leader

gauge.etcd.leader.latency.min πŸ”—

gauge

Min latency of a follower with respect to the leader

gauge.etcd.leader.latency.stddev πŸ”—

gauge

Std dev latency of a follower with respect to the leader

gauge.etcd.self.recvbandwidth.rate πŸ”—

gauge

Bandwidth rate of a follower

gauge.etcd.self.recvpkg.rate πŸ”—

gauge

Rate at which a follower receives packages

gauge.etcd.self.sendbandwidth.rate πŸ”—

gauge

Bandwidth rate of a leader

gauge.etcd.self.sendpkg.rate πŸ”—

gauge

Rate at which a leader sends packages

gauge.etcd.store.watchers πŸ”—

gauge

Number of watchers

Metrics that are categorized as container/host (default) are in bold and italics in the list below.

These are the metrics available for this integration.

  • counter.etcd.leader.counts.fail (counter)
    Total number of failed rpc requests to with a follower
  • counter.etcd.leader.counts.success (counter)
    Total number of successful rpc requests to with a follower
  • counter.etcd.self.recvappendreq.cnt (counter)
    Total number of append requests received by a member
  • counter.etcd.self.sendappendreq.cnt (counter)
    Total number of append requests sent by a member
  • counter.etcd.store.compareanddelete.fail (counter)
    Total number of failed compare-and-delete operations
  • counter.etcd.store.compareanddelete.success (counter)
    Total number of successful compare-and-delete operations
  • counter.etcd.store.compareandswap.fail (counter)
    Total number of failed compare-and-swap operations
  • counter.etcd.store.compareandswap.success (counter)
    Total number of successful compare-and-swap operations
  • counter.etcd.store.create.fail (counter)
    Total number of failed create operations
  • counter.etcd.store.create.success (counter)
    Total number of successful create operations
  • counter.etcd.store.delete.fail (counter)
    Total number of failed delete operations
  • counter.etcd.store.delete.success (counter)
    Total number of successful delete operations
  • counter.etcd.store.expire.count (counter)
    Total number of items expired due to TTL
  • counter.etcd.store.gets.fail (counter)
    Total number of failed get operations
  • counter.etcd.store.gets.success (counter)
    Total number of successful get operations
  • counter.etcd.store.sets.fail (counter)
    Total number of failed set operations
  • counter.etcd.store.sets.success (counter)
    Total number of successful set operations
  • counter.etcd.store.update.fail (counter)
    Total number of failed update operations
  • counter.etcd.store.update.success (counter)
    Total number of successful update operations
  • gauge.etcd.leader.latency.average (gauge)
    Average latency of a follower with respect to the leader
  • gauge.etcd.leader.latency.current (gauge)
    Current latency of a follower with respect to the leader
  • gauge.etcd.leader.latency.max (gauge)
    Max latency of a follower with respect to the leader
  • gauge.etcd.leader.latency.min (gauge)
    Min latency of a follower with respect to the leader
  • gauge.etcd.leader.latency.stddev (gauge)
    Std dev latency of a follower with respect to the leader
  • gauge.etcd.self.recvbandwidth.rate (gauge)
    Bandwidth rate of a follower
  • gauge.etcd.self.recvpkg.rate (gauge)
    Rate at which a follower receives packages
  • gauge.etcd.self.sendbandwidth.rate (gauge)
    Bandwidth rate of a leader
  • gauge.etcd.self.sendpkg.rate (gauge)
    Rate at which a leader sends packages
  • gauge.etcd.store.watchers (gauge)
    Number of watchers

Non-default metrics (version 4.7.0+) πŸ”—

The following information applies to the agent version 4.7.0+ that has enableBuiltInFiltering: true set on the top level of the agent config.

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.

Legacy non-default metrics (version < 4.7.0) πŸ”—

The following information only applies to agent version older than 4.7.0. If you have a newer agent and have set enableBuiltInFiltering: true at the top level of your agent config, see the section above. See upgrade instructions in Old-style whitelist filtering.

If you have a reference to the whitelist.json in your agent’s top-level metricsToExclude config option, and you want to emit metrics that are not in that whitelist, then you need to add an item to the top-level metricsToInclude config option to override that whitelist (see Inclusion filtering. Or you can just copy the whitelist.json, modify it, and reference that in metricsToExclude.