Docs » Integrations Guide » Integrations Reference » Elasticsearch

image0 Elasticsearch

Metadata associated with the Elasticsearch collectd plugin can be found here. The relevant code for the plugin can be found here.

DESCRIPTION

This is the SignalFx Elasticsearch plugin. This will send data about Elasticsearch to SignalFx, enabling built-in Elasticsearch monitoring dashboards.

Use this plugin to monitor the following types of information from an Elasticsearch node:

  • node statistics (cpu, os, jvm, indexing, search, thread pools, etc..)
  • per-index statistics
  • cluster statistics

Original Elasticsearch Documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

FEATURES

Built-in dashboards

  • Elasticsearch: Overview of all data from Elasticsearch hosts.

image1

  • Elasticsearch Cluster: Focus on a single Elasticsearch cluster.

image2

  • Elasticsearch Node: Focus further on a single Elasticsearch node.

image3

  • Elasticsearch Indexes: Overview of all Elasticsearch indexes.

image4

  • Elasticsearch Index: Focus on a single Elasticsearch index.

image5

REQUIREMENTS AND DEPENDENCIES

Version information

Software Version
collectd 4.9 or later
Elasticsearch 1.0.0 or later
Python plugin for collectd (included with SignalFx collectd agent)

INSTALLATION

If you are using the new Smart Agent, see the docs for thecollectd/elasticsearchmonitor for more information. The configuration documentation below may be helpful as well, but consult the Smart Agent repo’s docs for the exact schema.

  1. Download the collectd-elasticsearch Python module.
  2. Download SignalFxs sample configuration file to /etc/collectd/managed_config.
  3. Modify the configuration file to provide values that make sense for your environment, as described below.
  4. Restart collectd.

CONFIGURATION

Using the example configuration file 20-elasticsearch.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Elasticsearch instance to be monitored.

The plugin is intended to be run on a per-node basis. Define only one “Module” element per 20-elasticsearch.conf configuration file.

configuration option definition example value
ModulePath Path on disk where collectd can find this module. “/usr/share/collectd/collectd-elasticsearch”
Verbose Enable verbose logging. false
Cluster A name for this cluster. Appears in the dimension cluster. “elasticsearch”
Indexes Identifies the indexes for which the plugin should collect statistics. See note below. [“_all”]
EnableIndexStats Enable or disable collection of index statistics. false
IndexStatsMasterOnly When true, index stats will only be sent if the node is the active master. When false, index stats will be sent if the node is master eligible. This requires EnableIndexStats to be true. false
EnableClusterHealth Enable or disable collection of cluster health statistics. true
Interval The interval in seconds at which the plugin will report metrics, independent of the overall collectd collection interval. 10
Host The hostname of this instance of Elasticsearch. “localhost”
Port The port number of this instance of Elasticsearch. “9200”
DetailedMetrics Turns on additional metric time series. Acceptable values: (true/false). false
IndexInterval Interval in seconds at which the plugin will report index metrics. Must be greater than or equal and divisible by the Interval. Incorrect values are automatically rounded to a compatible value. 300
AdditionalMetrics A python list of additional metrics to be emitted. The names provided must match a metric defined in the elasticsearch_collectd.py file. [“”]
Username The plain text username for accessing the Elasticsearch installation (Basic Authentication Only). Unconfigured
Password The plain text password for accessing the Elasticsearch installation (Basic Authentication Only). Unconfigured
ThreadPools “search” and “index” thread pools are required, but additional threadpools can be specified in the list. See note regarding available thread pools below. [“search”,”index”]

Note: Available thread pools

The following table indicates thread pools that can be monitored by this plugin in each version of Elasticsearch. Add thread pools of interest to the configuration parameter ThreadPools.

thread pool name ES 1.x ES 2.0 ES 2.1+
merge
   
optimize
   
bulk
flush
generic
get
snapshot
warmer
refresh
fetch_shard_started  
fetch_shard_store  
listener  
management  
percolate  
suggest  
force_merge    

Note: Using this plugin from a container deployment

If you are running the Elasticsearch plugin via a collectd deployment within a container, configure the Host and Port values inside of the 20-elasticsearch.conf file to correspond to the desired Elasticsearch instance.

ex:

<Module "elasticsearch_collectd">
    Host "XXX.XXX.XXX.XXX"
    Port "XXXX"
</Module>

Note: Authentication

Currently only Basic Authentication is supported for the plugin.

Note: Collecting index statistics

By default, the configuration parameter Indexes is set to "_all". This means that when EnableIndexStats is set to true, the plugin will collect statistics about all indexes. To collect statistics from only one index, set the configuration parameter Indexes to the name of that index: for example, ["index1"]. To collect statistics from multiple indexes (but not all), include them as a comma-separated list: for example, ["index1", "index2"].

This plugin collects index statistics only on master-eligible Elasticsearch nodes.

The call to collect index statistics can be CPU-intensive. For this reason SignalFx recommends using the Interval configuration parameter to decrease the reporting interval for nodes that report index statistics.

USAGE

Sample of built-in dashboard in SignalFx:

image6

METRICS

Below is a list of all metrics.

Metric Name Brief Type
counter.indices.get.total The total number of get requests since node startup counter
counter.indices.indexing.index-total The total number of index requests since node startup counter
counter.indices.merges.total Total number of merges since node startup counter
counter.indices.search.query-time Total time spent in search queries (milliseconds) counter
counter.indices.search.query-total The total number of search requests since node startup counter
counter.indices.total.indexing.index-total The total number of index requests per cluster counter
counter.indices.total.merges.total Total number of merges per cluster counter
counter.indices.total.search.query-total The total number of search requests per cluster counter
counter.jvm.gc.time Total garbage collection time (milliseconds) counter
counter.thread_pool.bulk.rejected Number of rejected bulk requests counter
counter.thread_pool.flush.rejected Number of rejected flush requests counter
counter.thread_pool.generic.rejected Number of rejected generic requests counter
counter.thread_pool.get.rejected Number of rejected get requests counter
counter.thread_pool.index.rejected Number of rejected index requests counter
counter.thread_pool.merge.rejected Number of rejected merge requests counter
counter.thread_pool.optimize.rejected Number of rejected optimize requests counter
counter.thread_pool.refresh.rejected Number of rejected refresh requests counter
counter.thread_pool.rejected Number of rejected thread pool requests counter
counter.thread_pool.search.rejected Number of rejected search requests counter
counter.thread_pool.snapshot.rejected Number of rejected snapshot requests counter
gauge.cluster.active-primary-shards The number of active primary shards gauge
gauge.cluster.active-shards The number of active shards gauge
gauge.cluster.initializing-shards The number of currently initializing shards gauge
gauge.cluster.number-of-data_nodes The current number of data nodes in the cluster gauge
gauge.cluster.number-of-nodes Total number of nodes in the cluster gauge
gauge.cluster.relocating-shards The number of shards that are currently being relocated gauge
gauge.cluster.status The health status of the cluster gauge
gauge.cluster.unassigned-shards The number of shards that are currently unassigned gauge
gauge.indices.cache.field.size Field data size (bytes) gauge
gauge.indices.cache.filter.size Filter cache size (bytes) gauge
gauge.indices.docs.count Number of documents on this node gauge
gauge.indices.docs.deleted Number of deleted documents on this node gauge
gauge.indices.merges.current Number of active merges gauge
gauge.indices.segments.count Number of segments on this node gauge
gauge.indices.total.docs.count Number of documents in the cluster gauge
gauge.indices.total.fielddata.memory-size Field data size (bytes) gauge
gauge.indices.total.filter-cache.memory-size Filter cache size (bytes) gauge
gauge.jvm.mem.heap-committed Total heap committed by the process (bytes) gauge
gauge.jvm.mem.heap-used Total heap used (bytes) gauge
gauge.process.open_file_descriptors Number of currently open file descriptors gauge
gauge.thread_pool.active Number of active threads counter
gauge.thread_pool.largest Highest active threads in thread pool counter
gauge.thread_pool.queue Number of Tasks in thread pool counter
gauge.thread_pool.threads Number of Threads in thread pool counter

counter.indices.get.total

counter

How many get requests have been serviced by this node since it started.

This metric tracks the number of get requests issued to this node since its startup. A get request retrieves a document by its identifier.

counter.indices.indexing.index-total

counter

How many index requests have been serviced by this node since it started.

This metric tracks the number of index requests to this node since its startup. This also includes index requests that originate from bulk operations.

counter.indices.merges.total

counter

The number of merges that happened on this node since it started.

Node merges happen every time the index is refreshed due to changes made to it. Elasticsearch’s segments are immutable, and merges group together smaller segments into bigger ones. This metric tracks merges across all indexes that exist on the node.

counter.indices.search.query-time

counter

How much time has been spent in search queries on this node since it started.

This metric indicates the cumulative time that the node spent executing search requests since the system started. The ratio between this metric and counter.indices.search.query-total can be used as a rough indicator for how efficient your queries are. The larger the ratio, the more time each query is taking, and you should consider tuning or optimization.

counter.indices.search.query-total

counter

How many search requests have been servied by this node since it started.

This includes queries to both primary and replica shards. Queries run against an Elasticsearch cluster are executed on all the shards across all nodes in the cluster. This means that this metric doesn’t map 1:1 to the number of queries issued by the client.

counter.indices.total.indexing.index-total

counter

How many index requests have been serviced by the cluster.

This metric indicates the number of index requests serviced by the cluster. This can also be examined per index, by using the index dimension. This metric includes index requests that originate from bulk operations.

counter.indices.total.merges.total

counter

The number of merges that happened on this cluster.

Merges happen every time the index is refreshed due to changes made to it. Elasticsearch’s segments are immutable, and merges group together smaller segments into bigger ones. This metric tracks the number of merges across the cluster. It can be examined per index using the index dimension.

counter.indices.total.search.query-total

counter

How many search requests have been serviced by the cluster.

This includes queries to both primary and replica shards across the entire cluster. Queries that run against an Elasticsearch cluster are executed on all the shards across all nodes in the cluster. Therefore, this metric doesn’t map 1:1 to the number of queries issued by the client.

counter.jvm.gc.time

counter

Total time in milliseconds spent in garbage collection since the node has started.

This counter will increase after garbage collections. If its rate spikes, it may indicate memory pressure on the system since the JVM is trying to free up memory by running bigger and more frequent garbage collections.

counter.thread_pool.bulk.rejected

counter

The number of bulk requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the bulk request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.flush.rejected

counter

The number of flush requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the flush request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.generic.rejected

counter

The number of generic requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the generic request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.get.rejected

counter

The number of get requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the get request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.index.rejected

counter

The number of index requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the index request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.merge.rejected

counter

The number of merge requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the merge request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.optimize.rejected

counter

The number of optimize requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the optimize request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.refresh.rejected

counter

The number of refresh requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the refresh request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.rejected

counter

The number of requests that have been rejected per thread pool on this node since it started.

If the get request queue fills up to its limit, new work units will begin to be
rejected, and you will see that reflected in this rejected metric. This is
often a sign that your cluster is starting to bottleneck on some resources,
since a full queue means your node/cluster is processing at maximum speed but
unable to keep up with the influx of work.

Number of rejected requests for a given thread pool. The represented thread pool is indicated by the dimension “thread_pool”.

The following thread pools may be configured:

Thread Pool Description Configured By Default
bulk The number of bulk requests that have been rejected on this node since it started. False
force_merge The number of force merges that have been rejected on this node since it started False
fetch_shard_started The number of fetch_shard_started requests that have been rejected on this node since it started False
fetch_shard_store The number of fetch_shard_store requests that have been rejected on this node since it started False
flush The number of flush requests that have been rejected on this node since it started. False
generic The number of generic requests that have been rejected on this node since it started. False
get The number of get requests that have been rejected on this node since it started. False
index The number of index requests that have been rejected on this node since it started. True
listener The number of listener requests that have been rejected on this node since it started False
management The number of management requests that have been rejected on this node since it started False
merge The number of merge requests that have been rejected on this node since it started. False
optimize The number of optimize requests that have been rejected on this node since it started. False
percolate The number of percolate requests that have been rejected on this node since it started. False
refresh The number of refresh requests that have been rejected on this node since it started. False
search The number of search requests that have been rejected on this node since it started. True
snapshot The number of snapshot requests that have been rejected on this node since it started. False
suggest The number of suggest requests that have been rejected on this node since it started False
warmer The number of warmer requests that have been rejected on this node since it started False

counter.thread_pool.search.rejected

counter

The number of search requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the search request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

counter.thread_pool.snapshot.rejected

counter

The number of snapshot requests that have been rejected on this node since it started. Superseded by counter.thread_pool.rejected.

If the snapshot request queue fills up to its limit, new work units will begin to be rejected, and you will see that reflected in this rejected metric. This is often a sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work.

gauge.cluster.active-primary-shards

gauge

How many primary shards are currently active across the cluster.

This is an aggregate total of all primary shards across all indexes.

gauge.cluster.active-shards

gauge

How many shards are currently active, including replica shards.

This is an aggregate total of all shards across all indexes, which includes replica shards.

gauge.cluster.initializing-shards

gauge

How many shards are currently being initialized.

This is a count of shards that are being freshly created. For example, when you first create an index, the shards will all briefly reside in initializing state. This is typically a transient event, and shards shouldnt linger in this state too long. You may also see initializing shards when a node is first restarted: as shards are loaded from disk, they start as initializing.

gauge.cluster.number-of-data_nodes

gauge

How many data nodes the cluster currently have.

This metric indicates the number of data nodes that are currently participating in the cluster. Each Elasticsearch node can either be allowed to store data locally or not. Storing data locally means that shards of different indexes can be allocated to that node. By default, each node is considered to be a data node. This can be turned off by setting node.data to false.

gauge.cluster.number-of-nodes

gauge

How many nodes are currently participating in the cluster.

This metric indicates the number of nodes currently in the cluser. This number may change
based on nodes leaving or joining the cluster.

gauge.cluster.relocating-shards

gauge

How many shards are currently being relocated.

This shows the number of shards that are currently moving from one node to another node. This number is often zero, but can increase when Elasticsearch decides a cluster is not properly balanced, a new node is added, or a node is taken down.

gauge.cluster.status

gauge

Whether the cluster is green, yellow or red.

This gauge can have three different values:

  • 0, meaning green: all primary and replica shards are allocated. Your cluster is 100% operational.
  • 1, meaning yellow: all primary shards are allocated, but at least one replica is missing. No data is missing, so search results will still be complete. However, your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation.
  • 2, meaning red: at least one primary shard (and all of its replicas) are missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception.

gauge.cluster.unassigned-shards

gauge

How many shards are currently unassigned.

These are shards that exist in the cluster state, but cannot be found in the cluster itself. A common source of unassigned shards is unassigned replicas. For example, an index with five shards and one replica will have five unassigned replicas in a single-node cluster. Unassigned shards will also be present if your cluster is red (since primaries are missing).

gauge.indices.cache.field.size

gauge

The size of the field data on this node in bytes.

The memory used by field data, which is used for aggregations, sorting, etc.

gauge.indices.cache.filter.size

gauge

The size of the filter cache on this node in bytes.

This metric indicates the amount of memory used by the cached filter bitsets.

gauge.indices.docs.count

gauge

How many documents are currently on this node.

This metric tracks the number of documents that are currently stored on this node. This includes primary and replica shards. This also includes documents that may actually be deleted but have not been cleaned up yet (through a merge).

gauge.indices.docs.deleted

gauge

How many documents are currently deleted on this node.

Deleted documents are not cleaned up until merge happens. Elasticsearch merge policy may not favour the cleaning up of deleted documents. A high ratio of deleted documents may impact search performance and cache sizes.

gauge.indices.merges.current

gauge

The number of merges that are currently being executed.

Merge statistics can be important if your cluster is write heavy. Merging consumes a large amount of disk I/O and CPU resources.

gauge.indices.segments.count

gauge

The number of segments that are active on this node.

This number doesn’t include the segments of indexes that are closed.

gauge.indices.total.docs.count

gauge

How many documents are currently in this cluster.

This metric tracks the number of documents that are currently stored in the cluster. This includes primary and replica shards. This also includes documents that may actually be deleted but have not been cleaned up yet (through a merge). This metric is also available per index.

gauge.indices.total.fielddata.memory-size

gauge

The size of the field data across the cluster in bytes.

The memory used by field data, which is used for aggregations, sorting, etc. This metric is also available per index.

gauge.indices.total.filter-cache.memory-size

gauge

The size of the filter cache across the cluster in bytes.

This metric indicates the amount of memory used by the cached filter bitsets across the cluster. It is also available per index.

gauge.jvm.mem.heap-committed

gauge

The size of the heap that has been committed and is actually allocated to the process.

This tracks the maximum size that can be used before the JVM tries to allocate more memory for the process (unless it reaches the maximum configured).

gauge.jvm.mem.heap-used

gauge

The size of the used java heap in bytes.

This tracks how much heap the JVM is currently using. The heap usage may include objects that have not been garbage collected yet.

gauge.process.open_file_descriptors

gauge

The number of file descriptors used by the Elasticsearch process.

File descriptors are used for files as well as for network connections. Usually this metric is correlated with the number of segments that Elasticsearh is managing (gauge.indices.segments.count).

gauge.thread_pool.active

counter

The number of active threads in the current thread pool.

gauge.thread_pool.largest

counter

The highest number of active threads in the current thread pool.

gauge.thread_pool.queue

counter

The number of tasks in the queue for the current thread pool.

gauge.thread_pool.threads

counter

The number of threads in the current thread pool.