Docs » Integrations Guide » Integrations Reference » Kafka

image0 Kafka

Metadata associated with SignalFx’s integration with Kafka can be found here. The relevant code for the plugin can be found here.

DESCRIPTION

This is the Kafka plugin for collectd. It will send data about Kafka to SignalFx, enabling built-in Kafka monitoring dashboards.

FEATURES

Built-in dashboards

  • Broker: Focus on a single Kafka broker.
  • Brokers: Overview of data from all Kafka brokers. cluster dimension which is available in the SignalFx Agent can be used to get a per cluster view of Brokers.
  • Producer: Focus on a single Java based Producer.
  • Producers: Overview of Java based Producers.
  • Consumer: Focus on a single Java based Consumer.
  • Consumers: Overview of Java based Consumers.
  • JVM: Focus on the java virtual machine performance on a single instance.

Note: Metrics from Java based Kafka consumers and producers are available by default only when using the SignalFx Agent.

REQUIREMENTS AND DEPENDENCIES

Version information

Software Version
collectd 4.9 or later

INSTALLATION

If you are using the new Smart Agent, see the docs for thecollectd/kafkamonitor for more information. The configuration documentation below may be helpful as well, but consult the Smart Agent repo’s docs for the exact schema.

  1. RHEL/CentOS and Amazon Linux users: Install the Java plugin for collectd if it is not already installed.

  2. Download SignalFx’s example Kafka configuration file to /etc/collectd/managed_config: 20-kafka82.conf.
    Note: If you’re using a version of Kafka earlier than v0.8.2, download this sample Kafka configuration file instead_: *20-kafka.conf*
  3. Modify your Kafka configuration file to provide values that make sense for your environment, as described in Configuration, below.

  4. Restart collectd.

CONFIGURATION

Using the example configuration file 20-kafka.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Kafka instance to be monitored.

Value Description
ServiceURL URL of your JMX application.
Host The name of your host. Appears as dimension host in SignalFx. Note: (Please leave the identifier [hostHasService=kafka]) in the host name.

USAGE

Sample of built-in dashboard in SignalFx:

image1

METRICS

For a comprehensive list of metrics, other the ones available by default, see here.

Note that the metrics prefixed by kafka.consumer and kafka.producer are available only via the kafka_consumer and kafka_producer monitors of SignalFx Agent. Also, if using the SignalFx Agent, metrics from Broker will be added with
a user provided cluster dimension.

Below is a list of all metrics.

Metric Name Brief Type
counter.kafka-all-bytes-in Number of bytes received per second across all topics cumulative counter
counter.kafka-all-bytes-out Number of bytes transmitted per second across all topics cumulative counter
counter.kafka-isr-expands Increase in ISR of partitions counter
counter.kafka-isr-shrinks Decrease in ISR of partitions counter
counter.kafka-leader-election-rate Number of leader elections counter
counter.kafka-log-flushes Number of log flushes per second cumulative counter
counter.kafka-messages-in Number of messages received per second across all topics cumulative counter
counter.kafka-unclean-elections Number of unclean leader elections counter
counter.kafka.fetch-consumer.total-time.count Number of fetch requests from consumers per second across all partitions cumulative counter
counter.kafka.fetch-follower.total-time.count Number of fetch requests from followers per second across all partitions cumulative counter
counter.kafka.produce.total-time.99th 99th percentile of time in milliseconds to process produce requests gauge
counter.kafka.produce.total-time.count Number of producer requests cumulative counter
counter.kafka.produce.total-time.median Median time it takes to process a produce request gauge
gauge.kafka-active-controllers Specifies if the broker an active controller gauge
gauge.kafka-log-flush-time-ms-p95 95th percentile of log flush time in milliseconds gauge
gauge.kafka-log-flush-time-ms Average number of milliseconds to flush a log gauge
gauge.kafka-max-lag Max lag in messages between the follower and leader replicas gauge
gauge.kafka-partition-count Number of partitions available in the cluster gauge
gauge.kafka-request-queue Number of requests in the request queue across all partitions on the broker gauge
gauge.kafka-response-queue Amount of time the request waits in the response queue gauge
gauge.kafka-total-fetch-requests Total fetch requests per second gauge
gauge.kafka-total-produce-requests Total produce requests per second gauge
gauge.kafka-underreplicated-partitions Number of underreplicated partitions across all topics on the broker gauge
gauge.kafka.consumer.bytes-consumed-rate Average number of bytes consumed per second gauge
gauge.kafka.consumer.fetch-rate Number of records consumed per second gauge
gauge.kafka.consumer.fetch-size-avg Average number of bytes fetched per request gauge
gauge.kafka.consumer.records-consumed-rate Average number of records consumed per second gauge
gauge.kafka.consumer.records-lag-max Consumer lag in number of records gauge
gauge.kafka.fetch-consumer.total-time.99th 99th percentile of time in milliseconds to process fetch requests from consumers gauge
gauge.kafka.fetch-consumer.total-time.median Median time it takes to process a fetch request from consumers gauge
gauge.kafka.fetch-follower.total-time.99th 99th percentile of time in milliseconds to process fetch requests from followers gauge
gauge.kafka.fetch-follower.total-time.median Median time it takes to process a fetch request from follower gauge
gauge.kafka.producer.byte-rate Byte rate per topic gauge
gauge.kafka.producer.compression-rate Compression rate per topic gauge
gauge.kafka.producer.io-wait-time-ns-avg I/O wait time in ns gauge
gauge.kafka.producer.outgoing-byte-rate Average outgoing bytes rate gauge
gauge.kafka.producer.record-error-rate Record sends that resulted in error per topic gauge
gauge.kafka.producer.record-retry-rate Record retry rate per topic gauge
gauge.kafka.producer.record-send-rate Record send rate per topic gauge
gauge.kafka.producer.request-latency-avg Request latency in ms gauge
gauge.kafka.producer.request-rate Rate of produce requests sent gauge
gauge.kafka.producer.response-rate Rate of responses received gauge
gauge.kafka.zk.request-latency ZooKeeper client request latency gauge

counter.kafka-all-bytes-in

cumulative counter

Number of bytes received per second across all topics.

Use this metric to find out how many bytes each Kafka broker is receiving.
The more bytes a broker receives, the more work it has to do to flush them to its logs.

If the value of this metric on one server differs significantly from other servers:

  • Kafka partitions are not balanced properly across brokers. Check gauge.kafka-log-flush-time-ms-p95 to see if log latency is particularly high on the brokers processing more bytes.
  • If the partitions are balanced across messages, some of the topics might have bigger messages than the others.

counter.kafka-all-bytes-out

cumulative counter

Number of bytes transmitted per second across all topics.

Use this metric to find out how many bytes each Kafka broker is transmitting.
Bytes are transmitted to both consumers and to replicas.

This metric usually increases when:

  • New Kafka instances have come online and partitions are being synced to them.
  • New consumers have come online and are requesting more data.

counter.kafka-isr-expands

counter

When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR (in-sync-replicas).

counter.kafka-isr-shrinks

counter

When a broker goes down, ISR (in-sync-replicas) for some of partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.

counter.kafka-leader-election-rate

counter

Number of leader elections. Non-zero when there are Broker failures.

counter.kafka-log-flushes

cumulative counter

Number of log flushes per second across all partitions on the broker.

Each kafka partition has a log associated with it. Use this metric to find out how many times logs are flushed per second.

counter.kafka-messages-in

cumulative counter

Number of messages received per second across all topics.

Use this metric to find out how many messages each Kafka broker is receiving.

If the value of this metric on one broker differs significantly from other brokers:

  • This broker may have a disproportionately high number of Kafka topic partitions. Check gauge.kafka-log-flush-time-ms-p95 to see if log latency is particularly high on the brokers getting more messages.
  • Some topics or partitions might be receiving more traffic than others. Consider rebalancing the partitions across brokers so that all brokers have similar levels of log flush latency and CPU utilization.

counter.kafka-unclean-elections

counter

Number of unclean leader elections. This happens when a leader goes down and an out-of-sync replica is chosen to be the leader.

counter.kafka.fetch-consumer.total-time.count

cumulative counter

Number of fetch requests from consumers per second.

Use this value to check how many fetch requests per second each Kafka broker is receiving from consumers.

counter.kafka.fetch-follower.total-time.count

cumulative counter

Number of fetch requests from followers per second.

Use this value to check how many fetch requests per second each Kafka broker is receiving from followers.

counter.kafka.produce.total-time.99th

gauge

99th percentile of time in milliseconds to process produce requests from all producers to a broker.

counter.kafka.produce.total-time.count

cumulative counter

Number of producer requests across all partitions on a particular broker.

counter.kafka.produce.total-time.median

gauge

Median time it takes to process a produce request from all producers to that broker.

Use this value to check how long it takes to process produce reequests across all partitions.

gauge.kafka-active-controllers

gauge

Set to 1 if the broker is an active controller, 0 otherwise.

For each independent Kafka cluster there should be a single broker which is the active controller at any time.
The sum of this metric across all brokers in any given cluster should be one.

gauge.kafka-log-flush-time-ms-p95

gauge

95th percentile of log flush time in milliseconds across all partitions on the broker.

Each Kafka partition has a log associated with it. Use this metric to find out how much time a log flush takes in 95% of cases.

If the value of this metric on one broker is higher than the others

  • Check if this broker is getting more traffic (messages or bytes) than other brokers.
  • If this broker is receiving a balanced amount of traffic, then the disks on that machine might have degraded. Check disk performance metrics on the broker and consider replacing it.

gauge.kafka-log-flush-time-ms

gauge

Average number of milliseconds to flush logs across all partitions on the broker.

Each Kafka partition has a log associated with it. Use this metric to find out the average time a log flush takes.

If the value of this metric on one broker is higher than the others

  • Check if this broker is getting more traffic (messages or bytes) than other brokers.
  • If this broker is receiving a balanced amount of traffic, then the disks on that machine might have degraded. Check disk performance metrics on the broker and consider replacing it.

gauge.kafka-max-lag

gauge

Max lag in messages between the follower and leader replicas

gauge.kafka-partition-count

gauge

Number of partitions available in the cluster

gauge.kafka-request-queue

gauge

Number of requests in the request queue across all partitions on the broker.

If this number is consistently high or growing in size, then the broker is unable to keep up with incoming requests.

  • The broker may be overloaded. Check the CPU and memory usage on the broker to see if it does not have enough resources.
  • Check the metric gauge.kafka-log-flush-time-ms-p95 to find out if log flush time has increased, causing requests to take longer to process.

gauge.kafka-response-queue

gauge

Amount of time the request waits in the response queue

gauge.kafka-total-fetch-requests

gauge

Total fetch requests per second

gauge.kafka-total-produce-requests

gauge

Total produce requests per second

gauge.kafka-underreplicated-partitions

gauge

Number of partitions that are under replicated, for which this broker is the leader.

Each topic has a configured number of brokers that its partitions should be replicated to. A non-zero value for this metric means that a broker is having trouble talking to other broker(s) for partition replication. This increases the risk of losing data that has been acknowledged as committed.

gauge.kafka.consumer.bytes-consumed-rate

gauge

Average number of bytes consumed per second. This metric has either client-id dimension or, both client-id and topic dimensions. The former is an aggregate across all topics of the latter.

gauge.kafka.consumer.fetch-rate

gauge

Number of records consumed per second across all topics.

gauge.kafka.consumer.fetch-size-avg

gauge

Average number of bytes fetched per request. This metric has either client-id dimension or, both client-id and topic dimensions. The former is an aggregate across all topics of the latter.

gauge.kafka.consumer.records-consumed-rate

gauge

Average number of records consumed per second. This metric has either client-id dimension or, both client-id and topic dimensions. The former is an aggregate across all topics of the latter.

gauge.kafka.consumer.records-lag-max

gauge

Maximum lag in of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.

gauge.kafka.fetch-consumer.total-time.99th

gauge

99th percentile of time in milliseconds it takes to process fetch requests from consumers.

gauge.kafka.fetch-consumer.total-time.median

gauge

Median time it takes to process a fetch request from consumers.

gauge.kafka.fetch-follower.total-time.99th

gauge

99th percentile of time in milliseconds to process fetch requests from followers.

gauge.kafka.fetch-follower.total-time.median

gauge

Median time it takes to process a fetch request from followers.

gauge.kafka.producer.byte-rate

gauge

Average number of bytes sent per second for a topic. This metric has client-id and topic dimensions.

gauge.kafka.producer.compression-rate

gauge

Average compression rate of record batches for a topic. This metric has client-id and topic dimensions.

gauge.kafka.producer.io-wait-time-ns-avg

gauge

Average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds. This metric has client-id dimension.

gauge.kafka.producer.outgoing-byte-rate

gauge

Average number of outgoing bytes sent per second to all servers. This metric has client-id dimension.

gauge.kafka.producer.record-error-rate

gauge

Average per-second number of record sends that resulted in errors for a topic. This metric has client-id and topic dimensions.

gauge.kafka.producer.record-retry-rate

gauge

Average per-second number of retried record sends for a topic. This metric has client-id and topic dimensions.

gauge.kafka.producer.record-send-rate

gauge

Average number of records sent per second for a topic. This metric has client-id and topic dimensions.

gauge.kafka.producer.request-latency-avg

gauge

Average request latency in ms. Time it takes on average for the producer to get. This metric is related to kafka.gauge.producer.response-rate (the response rate). This metric has client-id dimension.

gauge.kafka.producer.request-rate

gauge

Average number of requests sent to Broker per second. This metric has client-id dimension.

gauge.kafka.producer.response-rate

gauge

Average number of responses received from Broker per second. This metric has client-id dimension.

gauge.kafka.zk.request-latency

gauge

ZooKeeper client request latency (available from v1.x.x)