Docs » Integrations Guide » Integrations Reference » Cassandra

image0 Cassandra

Metadata associated with SignalFx’s Cassandra integration with collectd can be found here. The relevant code for the plugin can be found here.

DESCRIPTION

Monitor Cassandra using SignalFx’s configuration of the Java plugin for collectd.

Use this integration to monitor the following types of information from Cassandra nodes:

  • read/write/range-slice requests
  • read/write/range-slice errors (timeouts and unavailable)
  • read/write/range-slice latency (median, 99th percentile, maximum)
  • compaction activity
  • hint activity

FEATURES

Built-in dashboards

  • Cassandra Nodes: Overview of data from all Cassandra nodes.

image1

  • Cassandra Node: Focus on a single Cassandra node.

image2

REQUIREMENTS AND DEPENDENCIES

Version information

Software Version
collectd 4.9+
Java plugin for collectd (match with collectd version)
Cassandra 2.0.10+

INSTALLATION

If you are using the new Smart Agent, see the docs for thecollectd/cassandramonitor for more information. The configuration documentation below may be helpful as well, but consult the Smart Agent repo’s docs for the exact schema.

System modifications

Open the JMX port on your Cassandra app. Cassandra will listen for connections on port 8080 (port 7199 starting in 0.8.0-beta1). More information can be found at the Cassandra Projectsite. There is also a page covering a few commonissues.

Install Cassandra integration

  1. RHEL/CentOS and Amazon Linux users: Install the Java plugin for collectd](https://docs.signalfx.com/en/latest/integrations/integrations-reference/integrations.java.html) if it is not already installed.
  2. Download SignalFx’s example Cassandra configuration file to /etc/collectd/managed_config: 20-cassandra.conf
  3. Modify 20-cassandra.conf to provide values that make sense for your environment, as described in Configuration, below.
  4. Restart collectd.

CONFIGURATION

Using the example configuration file 20-cassandra.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Cassandra instance to be monitored.

Configuration Option Description Default
ServiceURL URL of your JMX application. service:jmx:rmi:///jndi/rmi://localhost:7199/jmxrmi
Host The name of your host. Appears as dimension host in SignalFx. Note: (Please leave the identifier [hostHasService=cassandra]) in the host name. testcassandraserver[hostHasService=cassandra]

USAGE

Sample of built-in dashboard in SignalFx:

image3

METRICS

Below is a list of all metrics.

Metric Name Brief Type
counter.cassandra.ClientRequest.RangeSlice.Latency.Count Count of range slice operations since server start cumulative_counter
counter.cassandra.ClientRequest.RangeSlice.Timeouts.Count Count of range slice timeouts since server start cumulative_counter
counter.cassandra.ClientRequest.RangeSlice.Unavailables.Count Count of range slice unavailables since server start cumulative_counter
counter.cassandra.ClientRequest.Read.Latency.Count Count of read operations since server start cumulative_counter
counter.cassandra.ClientRequest.Read.Timeouts.Count Count of read timeouts since server start cumulative_counter
counter.cassandra.ClientRequest.Read.Unavailables.Count Count of read unavailables since server start cumulative_counter
counter.cassandra.ClientRequest.Write.Latency.Count Count of write operations since server start cumulative_counter
counter.cassandra.ClientRequest.Write.Timeouts.Count Count of write timeouts since server start cumulative_counter
counter.cassandra.ClientRequest.Write.Unavailables.Count Count of write unavailables since server start cumulative_counter
counter.cassandra.Compaction.TotalCompactionsCompleted.Count Number of compaction operations since node start cumulative_counter
gauge.cassandra.ClientRequest.RangeSlice.Latency.50thPercentile 50th percentile (median) of Cassandra range slice latency gauge
gauge.cassandra.ClientRequest.RangeSlice.Latency.99thPercentile 99th percentile of Cassandra range slice latency gauge
gauge.cassandra.ClientRequest.RangeSlice.Latency.Max Maximum Cassandra range slice latency gauge
gauge.cassandra.ClientRequest.Read.Latency.50thPercentile 50th percentile (median) of Cassandra read latency gauge
gauge.cassandra.ClientRequest.Read.Latency.99thPercentile 99th percentile of Cassandra read latency gauge
gauge.cassandra.ClientRequest.Read.Latency.Max Maximum Cassandra read latency gauge
gauge.cassandra.ClientRequest.Write.Latency.50thPercentile 50th percentile (median) of Cassandra write latency gauge
gauge.cassandra.ClientRequest.Write.Latency.99thPercentile 99th percentile of Cassandra write latency gauge
gauge.cassandra.ClientRequest.Write.Latency.Max Maximum Cassandra write latency gauge
gauge.cassandra.Compaction.PendingTasks.Value Number of compaction operations waiting to run gauge
gauge.cassandra.Storage.Load.Count Storage used for Cassandra data in bytes gauge
gauge.cassandra.Storage.TotalHints.Count Total hints since node start gauge
gauge.cassandra.Storage.TotalHintsInProgress.Count Total pending hints gauge

counter.cassandra.ClientRequest.RangeSlice.Latency.Count

cumulative_counter

Count of range slice operations since server start

This metric indicates the range slice load of the server.

counter.cassandra.ClientRequest.RangeSlice.Timeouts.Count

cumulative_counter

Count of range slice timeouts since server start

This typically indicates a server overload condition.

If this value is increasing across the cluster then the cluster is too small for the application range slice load.

If this value is increasing for a single server in a cluster, then one of the following conditions may be true:

  • one or more clients are directing more load to this server than the others
  • the server is experiencing hardware or software issues and may require maintenance.

counter.cassandra.ClientRequest.RangeSlice.Unavailables.Count

cumulative_counter

Count of range slice unavailables since server start

A non-zero value means that insufficient replicas were available to fulfil a range slice request at the requested consistency level.

This typically means that one or more nodes are down. To fix this condition, any down nodes must be restarted, or removed from the cluster.

counter.cassandra.ClientRequest.Read.Latency.Count

cumulative_counter

Count of read operations since server start

This metric indicates the read load of the server.

counter.cassandra.ClientRequest.Read.Timeouts.Count

cumulative_counter

Count of read timeouts since server start

This typically indicates a server overload condition.

If this value is increasing across the cluster then the cluster is too small for the application read load.

If this value is increasing for a single server in a cluster, then one of the following conditions may be true:

  • one or more clients are directing more load to this server than the others
  • the server is experiencing hardware or software issues and may require maintenance.

counter.cassandra.ClientRequest.Read.Unavailables.Count

cumulative_counter

Count of read unavailables since server start

A non-zero value means that insufficient replicas were available to fulfil a read request at the requested consistency level.

This typically means that one or more nodes are down. To fix this condition, any down nodes must be restarted, or removed from the cluster.

counter.cassandra.ClientRequest.Write.Latency.Count

cumulative_counter

Count of write operations since server start

This metric indicates the write load of the server.

counter.cassandra.ClientRequest.Write.Timeouts.Count

cumulative_counter

Count of write timeouts since server start

This typically indicates a server overload condition.

If this value is increasing across the cluster then the cluster is too small for the application write load.

If this value is increasing for a single server in a cluster, then one of the following conditions may be true:

  • one or more clients are directing more load to this server than the others
  • the server is experiencing hardware or software issues and may require maintenance.

counter.cassandra.ClientRequest.Write.Unavailables.Count

cumulative_counter

Count of write unavailable since server start

A non-zero value means that insufficient replicas were available to fulfil a write request at the requested consistency level.

This typically means that one or more nodes are down. To fix this condition, any down nodes must be restarted, or removed from the cluster.

counter.cassandra.Compaction.TotalCompactionsCompleted.Count

cumulative_counter

Number of compaction operations since node start

If this value does not increase steadily over time then the node may be experiencing problems completing compaction operations.

gauge.cassandra.ClientRequest.RangeSlice.Latency.50thPercentile

gauge

50th percentile (median) of recent Cassandra range slice latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.RangeSlice.Latency.99thPercentile

gauge

99th percentile of recent Cassandra range slice latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.RangeSlice.Latency.Max

gauge

Maximum recent Cassandra range slice latency

gauge.cassandra.ClientRequest.Read.Latency.50thPercentile

gauge

50th percentile (median) of recent Cassandra read latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.Read.Latency.99thPercentile

gauge

99th percentile of recent Cassandra read latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.Read.Latency.Max

gauge

Maximum recent Cassandra read latency

gauge.cassandra.ClientRequest.Write.Latency.50thPercentile

gauge

50th percentile (median) of recent Cassandra Write latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.Write.Latency.99thPercentile

gauge

99th percentile of recent Cassandra write latency

This value should be similar across all nodes in the cluster. If some nodes have higher values than the rest of the cluster then they may have more connected clients or may be experiencing heavier than usual compaction load.

gauge.cassandra.ClientRequest.Write.Latency.Max

gauge

Maximum recent Cassandra write latency

gauge.cassandra.Compaction.PendingTasks.Value

gauge

Number of compaction operations waiting to run

If this value is continually increasing then the node may be experiencing problems completing compaction operations.

gauge.cassandra.Storage.Load.Count

gauge

Storage used for Cassandra data in bytes

Use this metric to see how much storage is being used for data by a Cassandra node

The value of this metric is influenced by:

  • Total data stored into the database
  • compaction behavior

gauge.cassandra.Storage.TotalHints.Count

gauge

Total hints since node start

Indicates that write operations cannot be delivered to a node, usually because a node is down. If this value is increasing and all nodes are up then there may be some connectivity issue between nodes in the cluster.

gauge.cassandra.Storage.TotalHintsInProgress.Count

gauge

Total pending hints

Indicates that write operations cannot be delivered to a node, usually because a node is down. If this value is increasing and all nodes are up then there may be some connectivity issue between nodes in the cluster.