Docs » Integrations Guide » Integrations Reference » etcd

image0 etcd

Metadata associated with the etcd plugin for collectd can be found here. The relevant code for the plugin can be found here.

DESCRIPTION

This is the SignalFx etcd plugin. Follow these instructions to install the etcd plugin for collectd.

The etcd-collectd plugin collects metrics from etcd instances hitting these endpoints: statistics (default metrics) and metrics (optional metrics).

FEATURES

Built-in dashboards

  • ETCD CLUSTER: Provides a high-level overview of metrics for a single etcd cluster.

image1

image2

  • ETCD INSTANCE: Provides metrics from a single etcd instance.

image3

  • ETCD INSTANCES: Provides metrics from hosts on a particular host.

image4

REQUIREMENTS AND DEPENDENCIES

Version information

Software Version
collectd 4.9 or later
python 2.6 or later
etcd 2.0.8 or later
Python plugin for collectd (included with SignalFx collectd agent)

INSTALLATION

If you are using the new Smart Agent, see the docs for the collectd/etcdmonitor for more information. The configuration documentation below may be helpful as well, but consult the Smart Agent repo’s docs for the exact schema.

  1. Download collectd-etcd. Place the etcd_plugin.py file in /usr/share/collectd/collectd-etcd
  2. Modify the sample configuration file for this plugin to /etc/collectd/managed_config
  3. Modify the sample configuration file as described in Configuration, below
  4. Install the Python requirements with sudo pip install -r requirements.txt
  5. Restart collectd

CONFIGURATION

Using the example configuration file 10-etcd.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the etcd members

configuration option definition example value
ModulePath Path on disk where collectd can find this module. “/usr/share/collectd/collectd-etcd/”
Host Host name of the etcd member “localhost”
Port Port at which the member can be reached “2379”
Cluster Name of this etcd cluster. “1”
EnhancedMetrics Boolean to indicate whether stats from /metrics are needed “false”
IncludeMetric Metric name from the /metric endpoint to include(valid when EnhancedMetrics is “false”) “etcd_debugging_mvcc_slow_watcher_total”
ExcludeMetric Metric name from the /metric endpoint to exclude(valid when EnhancedMetrics is “true”) “etcd_server_has_leader”
Dimension Space-separated key-value pair for a user-defined dimension dimension_name dimension_value
Interval Number of seconds between calls to etcd API. 10
ssl_keyfile Path to the keyfile “path/to/file”
ssl_certificate Path to the certificate “path/to/file”
ssl_ca_certs Path to the ca file “path/to/file”

Example configuration:

LoadPlugin python
<Plugin python>
  ModulePath "/usr/share/collectd/collectd-etcd/"

  Import etcd_plugin
  <Module etcd_plugin>
    Host "localhost"
    Port "2379"
    Interval 10
    Cluster "1"
    Dimension dimension_name dimension_value
    EnhancedMetrics False
    IncludeMetric metric_name_from_metrics_endpoint
    ssl_keyfile "/Users/as001/work/play/etcd/etcd-ca/etcd-ca/private/etcd-client.key"
    ssl_certificate "/Users/as001/work/play/etcd/etcd-ca/etcd-ca/certs/etcd-client.crt"
    ssl_ca_certs "/Users/as001/work/play/etcd/etcd-ca/etcd-ca/certs/ca.crt"
  </Module>
</Plugin>

The plugin can be configured to collect metrics from multiple instances in the following manner.

LoadPlugin python

<Plugin python>
  ModulePath "/usr/share/collectd/collectd-etcd/"
  Import etcd_plugin
  <Module etcd_plugin>
    Host "localhost"
    Port "2379"
    Interval 10
    Cluster "prod"
  </Module>
  <Module etcd_plugin>
    Host "localhost"
    Port "22379"
    Interval 10
    Cluster "prod"
    IncludeMetric "etcd_debugging_mvcc_slow_watcher_total"
    IncludeMetric "etcd_debugging_store_reads_total"
    IncludeMetric "etcd_server_has_leader"
    IncludeMetric "etcd_network_peer_sent_bytes_total"
  </Module>
  <Module etcd_plugin>
    Host "localhost"
    Port "32379"
    Interval 10
    Cluster "test"
  </Module>
</Plugin>

USAGE

Interpreting Built-in dashboards

  • ETCD CLUSTER:
  • Number of Followers: Shows the number of followers in the cluster. A cluster that is expected to have 2n + 1 members, can tolerate failure of n members. By virtue of raft consensus algorithm, a cluster should have at least 3 members.

image5

  • Number of Watchers: Shows the total number of watchers on all the members of the cluster put together. Gives an overview of memory consumption by the watchers on the cluster as a whole.

image6

  • Followers with Max Number of Watchers: Get an overview of the members that are being requested for watching. Watching is memory intensive. The list gives information about the members (host:port information) and the corresponding states.

image7

  • Top Current Latency: Gives an overview of the followers (host:port) with max current latency with the leader. Since raft relies on log replication throughout all the members, this is helps in flushing out followers that have max latency.

image8

  • Total RPC Requests (successful/failed): A stacked chart that shows successful (in green) and failed (in red) RPC requests per second across all the followers. Leader sends RPC requests and followers receive.

image9

  • Per Member Failed RPCs: A stacked chart showing failed RPC requests per second on a per follower basis. On comparing this chart with one above, followers that cause more failures can be flushed out.

image10

  • Top RPC Requests: Followers with top RPC requests, both successful and failed.

image11

  • Store operations (successful/failed): This includes the following charts: Creates, Sets, Updates, Deletes, Compare-and-Swaps and Compare-and-Deletes. These charts are stacked charts that show successful operations (in green) and failed operations (in red) per second. This gives an idea of the ratio between success and failure for each operation type.
image12
image13
image14
image15
image16
image17
  • Receive Packet Rate: Stacked chart of the packets received per second for each follower. At given point in time, followers receive packets from the leader (leader sends information as part of log replication).

image18

  • Receive Append Requests: Stacked chart of the append requests received per second for each follower. At given point in time, followers receive append requests from the leader (leader sends information as part of log replication).

image19

  • Send Packet Rate: Chart for the packets sent per second for the leader. At given point in time, only leader sends packets. In the ideal world, every packet sent by the leader should be received by one of the followers. Comparing this chart with Receive Packet Rate would explain if packets are not received by followers (or an individual follower). Latency can also be observed through these charts.

image20

  • Send Append Requests: Chart for the append requests sent per second for the leader. At given point in time, only leader sends append requests. In the ideal world, all append requests sent by the leader should be received by one of the followers. Comparing this chart with Receive Append Requests would explain if append requests are not received by followers (or an individual follower). Latency can also be observed through these charts.

image21

  • ETCD INSTANCE:
  • Number of Watchers: Shows the number of watchers on this particular instance. Watching is memory intensive and might explain high memory utilization.

image22

  • Expire Rate: The number of keys and directories that expire per second. This is common to the distributed key-value store. However, when a member leaves the cluster, this metric becomes instance specific.

image23

  • Gets (successful/failed): A stacked chart that shows successful gets (in green) and failed gets (in red) per second. This gives insight to the ratio between successful and failed get requests per second for the instance. It is possible that a high fail count for gets is because of a high expire rate.

image24

  • Receive / Send Bandwidth Rate A line graph showing both, sent (in blue) and received (in green) bandwidth rate for the instance. Followers receive and Leader sends.

image25

  • Receive / Send Append Requests A line graph showing both, sent (in blue) and received (in green) append requests per second for the instance. Followers receive and Leader sends.

image26

  • ETCD INSTANCES: Provides metrics from hosts on a particular host.
  • Number of instances: The total number of etcd isntances running on the host, group by type (follower/leader).

image27

  • Instances by Number of Watchers: A line graph that shows number of watchers on each of the instances on the host. Instances with more number of watchers consume more memory.

image28

  • Instances with Most Number of Wacthers: Shows the instances with most number of watchers. Watching is memory intensive.

image29

  • Packets Exchange Trend: A stacked chart showing packets sent (in blue) and received (in green) across all instances on the host. Gives an idea of bandwidth usage.

image30

  • Bandwidth Trend Rate: A stacked chart showing send bandwidth (in blue) and receive bandwidth (in green) rates across all instances on the host. Gives an idea of bandwidth usage and should shows similar trends as the above chart.

image31

  • Top Bandwidth Rate: Gives a list of the instances that consume max bandwidth, both for sending and receiving put together.

image32

  • Gets Successful Trend: A stacked chart showing the number of successful get operations per second for each of the instances running on the host.

image33

  • Gets Failed Trend: A stack chart showing the number of failed get operations per second for each of the instances running on the host. Compare with above chart to analyze the success ratio.

image34

  • Top Gets per second A list of the instances on the host that perform the max number of gets per second, both successful and failed gets put together.

image35

  • Expire Rate Trend: A line chart showing the rate of expiry of keys/directories for all the instances on host.

image36

  • Top Expire Rate: A list of instances with top expire rates. Can be used to analyze if gets fail due to a high expiry rate.

image37

All metrics reported by the etcd collectd plugin will contain the following dimensions by default:

  • state, whether the member is a follower or a leader
  • cluster, human readable cluster name used to group by members by cluster
  • follower, metrics from the leader endpoint will have this dimension to group by follower

A few other details:

  • plugin is always set to etcd
  • plugin_instance will contain the IP address and the port of the member given in the configuration
  • To add metrics from the /metrics endpoint, use the configuration options mentioned in configuration. If metrics are being included individually, make sure to give names that are valid. For example, etcd_debugging_mvcc_slow_watcher_total or etcd_network_peer_sent_bytes_total

METRICS

By default, metrics about a member, leader and store are provided. Metrics from /metrics endpoint can be activated through the configuration file. Note, that SignalFx does not support histogram and summary metric types (hence, metrics of these will be skipped if provided in the configuration). See usage for details.

Metric naming

<metric type>.etcd.<endpoint name>.<name of metric>. This is the format of default metric names reported by the plugin. Optional metrics are named as available from the /metrics endpoint with _ replaced by ..

Below is a list of all metrics.

Metric Name Brief Type
counter.etcd.leader.counts.fail Total number of failed rpc requests to with a follower counter
counter.etcd.leader.counts.success Total number of successful rpc requests to with a follower counter
counter.etcd.self.recvappendreq.cnt Total number of append requests received by a member counter
counter.etcd.self.sendappendreq.cnt Total number of append requests sent by a member counter
counter.etcd.store.compareanddelete.fail Total number of failed compare-and-delete operations counter
counter.etcd.store.compareanddelete.success Total number of successful compare-and-delete operations counter
counter.etcd.store.compareandswap.fail Total number of failed compare-and-swap operations counter
counter.etcd.store.compareandswap.success Total number of successful compare-and-swap operations counter
counter.etcd.store.create.fail Total number of failed create operations counter
counter.etcd.store.create.success Total number of successful create operations counter
counter.etcd.store.delete.fail Total number of failed delete operations counter
counter.etcd.store.delete.success Total number of successful delete operations counter
counter.etcd.store.expire.count Total number of items expired due to TTL counter
counter.etcd.store.gets.fail Total number of failed get operations counter
counter.etcd.store.gets.success Total number of successful get operations counter
counter.etcd.store.sets.fail Total number of failed set operations counter
counter.etcd.store.sets.success Total number of successful set operations counter
counter.etcd.store.update.fail Total number of failed update operations counter
counter.etcd.store.update.success Total number of successful update operations counter
gauge.etcd.leader.latency.average Average latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.current Current latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.max Max latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.min Min latency of a follower with respect to the leader gauge
gauge.etcd.leader.latency.stddev Std dev latency of a follower with respect to the leader gauge
gauge.etcd.self.recvbandwidth.rate Bandwidth rate of a follower gauge
gauge.etcd.self.recvpkg.rate Rate at which a follower receives packages gauge
gauge.etcd.self.sendbandwidth.rate Bandwidth rate of a leader gauge
gauge.etcd.self.sendpkg.rate Rate at which a leader sends packages gauge
gauge.etcd.store.watchers Number of watchers gauge

counter.etcd.leader.counts.fail

counter

The total number of failed rpc requests to with a follower. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

counter.etcd.leader.counts.success

counter

The total number of successful rpc requests to with a follower. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

counter.etcd.self.recvappendreq.cnt

counter

The total number of append requests received by a member. Followers receive append requests from the leader of the cluster.
This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.self.sendappendreq.cnt

counter

The total number of append requests sent by a member. Leader sends append requests to followers in the cluster.
This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.compareanddelete.fail

counter

The total number of failed compare-and-delete operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.compareanddelete.success

counter

The total number of successful compare-and-delete operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.compareandswap.fail

counter

The total number of failed compare-and-swap operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.compareandswap.success

counter

The total number of successful compare-and-swap operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.create.fail

counter

The total number of failed create operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.create.success

counter

The total number of successful create operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.delete.fail

counter

The total number of failed delete operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.delete.success

counter

The total number of successful delete operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.expire.count

counter

The total number of keys/directories expired due to TTL. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.gets.fail

counter

The total number of failed get operations in the store. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.gets.success

counter

The total number of successful get operations in the store. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.sets.fail

counter

The total number of failed set operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.sets.success

counter

The total number of successful set operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.update.fail

counter

The total number of failed update operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

counter.etcd.store.update.success

counter

The total number of successful update operations in the store. This metric is common to all the members of the cluster and therefore, only reported by the leader. This metric is reported with the dimension state to indicate the current state and cluster name.

gauge.etcd.leader.latency.average

gauge

The average latency of a follower with respect to the leader. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

gauge.etcd.leader.latency.current

gauge

The current latency of a follower with respect to the leader. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

gauge.etcd.leader.latency.max

gauge

The max latency of a follower with respect to the leader. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

gauge.etcd.leader.latency.min

gauge

The min latency of a follower with respect to the leader. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

gauge.etcd.leader.latency.stddev

gauge

The std dev latency of a follower with respect to the leader. This metric is reported with the dimension state to indicate the current state, cluster name and follower name.

gauge.etcd.self.recvbandwidth.rate

gauge

The bandwidth rate of a follower. This metric is reported with the dimension state to indicate the current state and cluster name.

gauge.etcd.self.recvpkg.rate

gauge

The rate at which a follower receives packages. This metric is reported with the dimension state to indicate the current state and cluster name.

gauge.etcd.self.sendbandwidth.rate

gauge

The bandwidth rate of a leader. This metric is reported with the dimension state to indicate the current state and cluster name.

gauge.etcd.self.sendpkg.rate

gauge

The rate at which a leader sends packages. This metric is reported with the dimension state to indicate the current state and cluster name.

gauge.etcd.store.watchers

gauge

The number of watchers. This metric is reported with the dimension state to indicate the current state and cluster name.