Docs » Integrations Guide » Integrations Reference » Riak KV

image0 Riak KV

Metadata associated with the Riak KV collectd Configuration can be found here. The relevant code for the cURL-JSON plugin can be found here.

DESCRIPTION

From Basho site:

Riak KV is a distributed NoSQL database with a key/value design and advanced local and multi-cluster replication that guarantees reads and writes even in the event of hardware failures or network partitions.

This plugin captures the following metrics about Riak:

  • vNode Put and Get metrics
  • Read Repair counters
  • Search index and query times
  • memory
  • cpu activity

The plugin also captures the following enterprise metrics:

  • Multi-Datacenter replication throughput
  • queue backlog

REQUIREMENTS AND DEPENDENCIES

This plugin requires:

Software Version
collectd 4.9+
cURL-JSON plugin (match with collectd version)
Riak KV 1.4.0+

INSTALLATION

This plugin is included with SignalFx’s collectdpackage.

  1. Download SignalFx’s sample configuration file for this plugin.

  2. Modify the sample configuration file as described in Configuration, below.

  3. Add the following line to /etc/collectd.conf, replacing the example path with the location of the configuration file:

    include '/path/to/10-riak.conf'
    
  4. Restart collectd.

CONFIGURATION

Using the example configuration file 10-riak.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Riak KV instance to be monitored.

Setting Value
Hostname riak1
Base directory for collectd /var/lib/collectd
collectd .pid file /var/run/collectd.pid
collectd plugin directory /usr/local/lib/collectd
collectd types.db file /usr/local/share/collectd/types.db
Riak stats URL http://localhost:8098/stats
Riak Repl stats URL http://localhost:8098/riak-repl/stats
Riak node/instance name riak1@127.0.0.1

Note: Monitoring Riak Multi-Datacenter Replication

Replication is part of the Riak KV enterprise package. Unless this feature is enabled, all the metrics available at ../riak-repl/stats will be empty.

USAGE

Below are screen captures of dashboards created for this plugin by SignalFx, illustrating the metrics emitted by this plugin.

For general reference on how to monitor Riak performance, see RiakStats andMonitoring.

Monitoring Riak Clusters

Throughput for Riak KV can be measured in a few different ways. PUTs and GETs are the most common type of requests. These metrics can be gathered for nodes as well as vNodes.

image1

vNode PUTs and GETs showing the number cluster operations per minute

Riak Search throughput can be measured separately in terms of number of documents indexed and queries performed. If using Search, it’s a good idea to keep an eye on these numbers since running queries can heavily tax system resources.

image2

Search Throughput is useful to see where cpu/mem/network usage might be coming from

Latency metrics are useful to determine if Riak is slowing down requests from applications. Riak should respond to PUTs and GETs very quickly - single digit milliseconds - otherwise applications start to suffer. If there is a serious issue with Riak, this is most likely the first indicator that something is going wrong. Keeping an eye on the 95/99 percentile metrics are also useful to uncover issues unseen in the mean latency times.

image3

Put mean times below 5ms means our applications are very happy

General Riak Health can be determined using a few metrics available from the stats URL. Here we show Read Repairs which show how healthy the data is in the custer. If repairs increase significantly this could indicate nodes going offline or vnode data missing.

image4

Consistent and low count of read repairs indicates a healthy cluster

Riak Multi-Datacenter Replication Monitoring

Note: This section applies to Riak KV Enterprise Edition.

Some production workloads require replicating all data from one Riak cluster to another Riak cluster (often in another datacenter or Availability Zone). This would be useful in the event of the loss of one cluster or other distributed workloads. Each node in each cluster can participate in replication. There is a local queue on each Riak node that should be monitored to ensure it is not filling up.

image5
image6

The queue is very low so replication is working

This is not an exhaustive list of metrics, there are dozens of metrics that are useful to keep an eye on regularly. In addition to this plugin, collectd can measure CPU, memory, disk IO, and network all of which are relevant to maintaining a healthy cluster.

METRICS

Note: Discover all Available Riak Metrics

There are nearly 400 metrics that can be sent to SignalFx using curl. The configurations here include some commonly used stats. The exhaustive list can be obtained by curling any Riak KV node.

curl -X GET http://localhost:8098/stats

Optionally, use jsonpp to produce an easily parsable list:

brew install jsonpp
curl -X GET http://localhost:8098/stats | jsonpp

Below is a list of all metrics.

Metric Name Brief Type
gauge.node_get_fsm_time_mean Time between reception of client read request and subsequent response to client in microseconds gauge
gauge.node_gets Reads coordinated by this node in the last minute gauge
gauge.node_put_fsm_time_mean Time between reception of client write request and subsequent response to client in microseconds gauge
gauge.node_puts Writes coordinated by this node in the last minute gauge

gauge.node_get_fsm_time_mean

gauge

Time between reception of client read request and subsequent response to client in microseconds

gauge.node_gets

gauge

Reads coordinated by this node in the last minute

gauge.node_put_fsm_time_mean

gauge

Time between reception of client write request and subsequent response to client in microseconds

gauge.node_puts

gauge

Writes coordinated by this node in the last minute