Docs » Integrations Guide » Integrations Reference » Kafka

../../_images/integrations_kafka.png Kafka πŸ”—

DESCRIPTION πŸ”—

This integration primarily consists of the Smart Agent monitor collectd/kafka. Below is an overview of that monitor.

Smart Agent Monitor πŸ”—

Monitors a Kafka instance using collectd’s GenericJMX plugin. See the collectd/genericjmx monitor for more information on how to configure custom MBeans, as well as information on troubleshooting JMX setup.

This monitor has a set of built in MBeans configured for which it pulls metrics from Kafka’s JMX endpoint.

Note that this monitor supports Kafka v0.8.2.x and above. For Kafka v1.x.x and above, apart from the list of default metrics, kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs is a good metric to monitor since it gives an understanding of how long brokers wait for requests to Zookeeper to be completed. Since Zookeeper is an integral part of a Kafka cluster, monitoring it using the Zookeeper monitor is recommended. It is also a good idea to monitor disk utilization and network metrics of the underlying host.

INSTALLATION πŸ”—

This integration is part of the SignalFx Smart Agent as the collectd/kafka monitor. You should first deploy the Smart Agent to the same host as the service you want to monitor, and then continue with the configuration instructions below.

CONFIGURATION πŸ”—

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: collectd/kafka
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

Config option Required Type Description
host yes string Host to connect to -- JMX must be configured for remote access and accessible from the agent
port yes integer JMX connection port (NOT the RMI port) on the application. This correponds to the com.sun.management.jmxremote.port Java property that should be set on the JVM when running the application.
name no string
serviceName no string This is how the service type is identified in the SignalFx UI so that you can get built-in content for it. For custom JMX integrations, it can be set to whatever you like and metrics will get the special property sf_hostHasService set to this value.
serviceURL no string The JMX connection string. This is rendered as a Go template and has access to the other values in this config. NOTE: under normal circumstances it is not advised to set this string directly - setting the host and port as specified above is preferred. (default: service:jmx:rmi:///jndi/rmi://{{.Host}}:{{.Port}}/jmxrmi)
instancePrefix no string Prefixes the generated plugin instance with prefix. If a second instancePrefix is specified in a referenced MBean block, the prefix specified in the Connection block will appear at the beginning of the plugin instance, and the prefix specified in the MBean block will be appended to it
username no string Username to authenticate to the server
password no string User password to authenticate to the server
customDimensions no map of strings Takes in key-values pairs of custom dimensions at the connection level.
mBeansToCollect no list of strings A list of the MBeans defined in mBeanDefinitions to actually collect. If not provided, then all defined MBeans will be collected.
mBeansToOmit no list of strings A list of the MBeans to omit. This will come handy in cases where only a few MBeans need to omitted from the default list
mBeanDefinitions no map of objects (see below) Specifies how to map JMX MBean values to metrics. If using a specific service monitor such as cassandra, kafka, or activemq, they come pre-loaded with a set of mappings, and any that you add in this option will be merged with those. See collectd GenericJMX for more details.
clusterName yes string Cluster name to which the broker belongs

The nested mBeanDefinitions config object has the following fields:

Config option Required Type Description
objectName no string Sets the pattern which is used to retrieve MBeans from the MBeanServer. If more than one MBean is returned you should use the instanceFrom option to make the identifiers unique
instancePrefix no string Prefixes the generated plugin instance with prefix
instanceFrom no list of strings The object names used by JMX to identify MBeans include so called "properties" which are basically key-value-pairs. If the given object name is not unique and multiple MBeans are returned, the values of those properties usually differ. You can use this option to build the plugin instance from the appropriate property values. This option is optional and may be repeated to generate the plugin instance from multiple property values
values no list of objects (see below) The value blocks map one or more attributes of an MBean to a value list in collectd. There must be at least one value block within each MBean block
dimensions no list of strings

The nested values config object has the following fields:

Config option Required Type Description
type no string Sets the data set used within collectd to handle the values of the MBean attribute
table no bool Set this to true if the returned attribute is a composite type. If set to true, the keys within the composite type is appended to the type instance. (default: false)
instancePrefix no string Works like the option of the same name directly beneath the MBean block, but sets the type instance instead
instanceFrom no list of strings Works like the option of the same name directly beneath the MBean block, but sets the type instance instead
attribute no string Sets the name of the attribute from which to read the value. You can access the keys of composite types by using a dot to concatenate the key name to the attribute name. For example: β€œattrib0.key42”. If table is set to true, path must point to a composite type, otherwise it must point to a numeric type.
attributes no list of strings The plural form of the attribute config above. Used to derive multiple metrics from a single MBean.

USAGE πŸ”—

Sample of built-in dashboard in SignalFx:

../../_images/dashboard_kafka.png

METRICS πŸ”—

Metric Name Description Type
counter.kafka-bytes-in Number of bytes received per second across all topics cumulative
counter.kafka-bytes-out Number of bytes transmitted per second across all topics cumulative
counter.kafka-isr-expands When a broker is brought up after a failure, it starts catching up by reading from the leader cumulative
counter.kafka-isr-shrinks When a broker goes down, ISR for some of partitions will shrink cumulative
counter.kafka-leader-election-rate Number of leader elections cumulative
counter.kafka-messages-in Number of messages received per second across all topics cumulative
counter.kafka-unclean-elections-rate Number of unclean leader elections cumulative
counter.kafka.fetch-consumer.total-time.count Number of fetch requests from consumers per second across all partitions cumulative
counter.kafka.fetch-follower.total-time.count Number of fetch requests from followers per second across all partitions cumulative
counter.kafka.logs.flush-time.count Number of log flushes cumulative
counter.kafka.produce.total-time.count Number of producer requests cumulative
gauge.jvm.threads.count Number of JVM threads gauge
gauge.kafka-active-controllers Specifies if the broker an active controller gauge
gauge.kafka-max-lag Maximum lag in messages between the follower and leader replicas gauge
gauge.kafka-offline-partitions-count Number of partitions that don’t have an active leader and are hence not writable or readable gauge
gauge.kafka-request-queue Number of requests in the request queue across all partitions on the broker gauge
gauge.kafka-underreplicated-partitions Number of underreplicated partitions across all topics on the broker gauge
gauge.kafka.fetch-consumer.total-time.99th 99th percentile of time in milliseconds to process fetch requests from consumers gauge
gauge.kafka.fetch-consumer.total-time.median Median time it takes to process a fetch request from consumers gauge
gauge.kafka.fetch-follower.total-time.99th 99th percentile of time in milliseconds to process fetch requests from followers gauge
gauge.kafka.fetch-follower.total-time.median Median time it takes to process a fetch request from follower gauge
gauge.kafka.logs.flush-time.99th 99th percentile of time in milliseconds to flush logs gauge
gauge.kafka.logs.flush-time.median Median time it takes to flush logs gauge
gauge.kafka.produce.total-time.99th 99th percentile of time in milliseconds to process produce requests gauge
gauge.kafka.produce.total-time.median Median time it takes to process a produce request gauge
gauge.loaded_classes Number of classes loaded in the JVM gauge
invocations Total number of garbage collection events cumulative
jmx_memory.committed Amount of memory guaranteed to be available in bytes gauge
jmx_memory.init Amount of initial memory at startup in bytes gauge
jmx_memory.max Maximum amount of memory that can be used in bytes gauge
jmx_memory.used Current memory usage in bytes gauge
total_time_in_ms.collection_time Amount of time spent garbage collecting in milliseconds cumulative

counter.kafka-bytes-in πŸ”—

cumulative

Number of bytes received per second across all topics

counter.kafka-bytes-out πŸ”—

cumulative

Number of bytes transmitted per second across all topics

counter.kafka-isr-expands πŸ”—

cumulative

When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.

counter.kafka-isr-shrinks πŸ”—

cumulative

When a broker goes down, ISR for some of partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.

counter.kafka-leader-election-rate πŸ”—

cumulative

Number of leader elections

counter.kafka-messages-in πŸ”—

cumulative

Number of messages received per second across all topics

counter.kafka-unclean-elections-rate πŸ”—

cumulative

Number of unclean leader elections. This happens when a leader goes down and an out-of-sync replica is chosen to be the leader

counter.kafka.fetch-consumer.total-time.count πŸ”—

cumulative

Number of fetch requests from consumers per second across all partitions

counter.kafka.fetch-follower.total-time.count πŸ”—

cumulative

Number of fetch requests from followers per second across all partitions

counter.kafka.logs.flush-time.count πŸ”—

cumulative

Number of log flushes

counter.kafka.produce.total-time.count πŸ”—

cumulative

Number of producer requests

gauge.jvm.threads.count πŸ”—

gauge

Number of JVM threads

gauge.kafka-active-controllers πŸ”—

gauge

Specifies if the broker an active controller

gauge.kafka-max-lag πŸ”—

gauge

Maximum lag in messages between the follower and leader replicas

gauge.kafka-offline-partitions-count πŸ”—

gauge

Number of partitions that don’t have an active leader and are hence not writable or readable

gauge.kafka-request-queue πŸ”—

gauge

Number of requests in the request queue across all partitions on the broker

gauge.kafka-underreplicated-partitions πŸ”—

gauge

Number of underreplicated partitions across all topics on the broker

gauge.kafka.fetch-consumer.total-time.99th πŸ”—

gauge

99th percentile of time in milliseconds to process fetch requests from consumers

gauge.kafka.fetch-consumer.total-time.median πŸ”—

gauge

Median time it takes to process a fetch request from consumers

gauge.kafka.fetch-follower.total-time.99th πŸ”—

gauge

99th percentile of time in milliseconds to process fetch requests from followers

gauge.kafka.fetch-follower.total-time.median πŸ”—

gauge

Median time it takes to process a fetch request from follower

gauge.kafka.logs.flush-time.99th πŸ”—

gauge

99th percentile of time in milliseconds to flush logs

gauge.kafka.logs.flush-time.median πŸ”—

gauge

Median time it takes to flush logs

gauge.kafka.produce.total-time.99th πŸ”—

gauge

99th percentile of time in milliseconds to process produce requests

gauge.kafka.produce.total-time.median πŸ”—

gauge

Median time it takes to process a produce request

gauge.loaded_classes πŸ”—

gauge

Number of classes loaded in the JVM

invocations πŸ”—

cumulative

Total number of garbage collection events

jmx_memory.committed πŸ”—

gauge

Amount of memory guaranteed to be available in bytes

jmx_memory.init πŸ”—

gauge

Amount of initial memory at startup in bytes

jmx_memory.max πŸ”—

gauge

Maximum amount of memory that can be used in bytes

jmx_memory.used πŸ”—

gauge

Current memory usage in bytes

total_time_in_ms.collection_time πŸ”—

cumulative

Amount of time spent garbage collecting in milliseconds

These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.

  • counter.kafka-bytes-in (cumulative)
    Number of bytes received per second across all topics
  • counter.kafka-bytes-out (cumulative)
    Number of bytes transmitted per second across all topics
  • counter.kafka-isr-expands (cumulative)
    When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.
  • counter.kafka-isr-shrinks (cumulative)
    When a broker goes down, ISR for some of partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
  • counter.kafka-leader-election-rate (cumulative)
    Number of leader elections
  • counter.kafka-messages-in (cumulative)
    Number of messages received per second across all topics
  • counter.kafka-unclean-elections-rate (cumulative)
    Number of unclean leader elections. This happens when a leader goes down and an out-of-sync replica is chosen to be the leader
  • counter.kafka.fetch-consumer.total-time.count (cumulative)
    Number of fetch requests from consumers per second across all partitions
  • counter.kafka.fetch-follower.total-time.count (cumulative)
    Number of fetch requests from followers per second across all partitions
  • counter.kafka.logs.flush-time.count (cumulative)
    Number of log flushes
  • counter.kafka.produce.total-time.count (cumulative)
    Number of producer requests
  • gauge.kafka-active-controllers (gauge)
    Specifies if the broker an active controller
  • gauge.kafka-max-lag (gauge)
    Maximum lag in messages between the follower and leader replicas
  • gauge.kafka-offline-partitions-count (gauge)
    Number of partitions that don’t have an active leader and are hence not writable or readable
  • gauge.kafka-request-queue (gauge)
    Number of requests in the request queue across all partitions on the broker
  • gauge.kafka-underreplicated-partitions (gauge)
    Number of underreplicated partitions across all topics on the broker
  • gauge.kafka.fetch-consumer.total-time.99th (gauge)
    99th percentile of time in milliseconds to process fetch requests from consumers
  • gauge.kafka.fetch-consumer.total-time.median (gauge)
    Median time it takes to process a fetch request from consumers
  • gauge.kafka.fetch-follower.total-time.99th (gauge)
    99th percentile of time in milliseconds to process fetch requests from followers
  • gauge.kafka.fetch-follower.total-time.median (gauge)
    Median time it takes to process a fetch request from follower
  • gauge.kafka.logs.flush-time.99th (gauge)
    99th percentile of time in milliseconds to flush logs
  • gauge.kafka.logs.flush-time.median (gauge)
    Median time it takes to flush logs
  • gauge.kafka.produce.total-time.99th (gauge)
    99th percentile of time in milliseconds to process produce requests
  • gauge.kafka.produce.total-time.median (gauge)
    Median time it takes to process a produce request

Group jvm πŸ”—

All of the following metrics are part of the jvm metric group. All of the non-default metrics below can be turned on by adding jvm to the monitor config option extraGroups:

  • gauge.jvm.threads.count (gauge)
    Number of JVM threads
  • gauge.loaded_classes (gauge)
    Number of classes loaded in the JVM
  • invocations (cumulative)
    Total number of garbage collection events
  • jmx_memory.committed (gauge)
    Amount of memory guaranteed to be available in bytes
  • jmx_memory.init (gauge)
    Amount of initial memory at startup in bytes
  • jmx_memory.max (gauge)
    Maximum amount of memory that can be used in bytes
  • jmx_memory.used (gauge)
    Current memory usage in bytes
  • total_time_in_ms.collection_time (cumulative)
    Amount of time spent garbage collecting in milliseconds

Non-default metrics (version 4.7.0+) πŸ”—

The following information applies to the agent version 4.7.0+ that has enableBuiltInFiltering: true set on the top level of the agent config.

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.

Legacy non-default metrics (version < 4.7.0) πŸ”—

The following information only applies to agent version older than 4.7.0. If you have a newer agent and have set enableBuiltInFiltering: true at the top level of your agent config, see the section above. See upgrade instructions in Old-style whitelist filtering.

If you have a reference to the whitelist.json in your agent’s top-level metricsToExclude config option, and you want to emit metrics that are not in that whitelist, then you need to add an item to the top-level metricsToInclude config option to override that whitelist (see Inclusion filtering. Or you can just copy the whitelist.json, modify it, and reference that in metricsToExclude.