Docs » Integrations Guide » Integrations Reference » Mesos

image0 Mesos

DESCRIPTION

Use the Mesos plugin for collectd to monitor the following information about Mesos:

  • Cluster status: number of activated slaves, schedulers and tasks
  • CPU, disk and memory usage for Mesos
  • Tasks finished, lost, and failed

FEATURES

Built-in dashboards

  • Mesos Clusters: Overview of data from all Mesos clusters.

image1

  • Mesos Cluster: Focus on a single Mesos cluster.

image2

  • Mesos Master: Focus further on a single Mesos master.

image3

  • Mesos Slave: Focus further on a single Mesos slave.

image4

REQUIREMENTS AND DEPENDENCIES

This plugin requires:

  • collectd 4.9+
  • Python plugin for collectd (included with SignalFx collectd agent)
  • Python 2.3+ (2.7.5+ for DC/OS strict mode)
  • Mesos 0.19.0 or greater

INSTALLATION

  1. Download the three Python modules for Mesos from the following URL:
    https://github.com/signalfx/collectd-mesos. Place them in a convenient spot (e.g. in /usr/share/collectd/mesos-collectd-plugin)
  2. Download SignalFx’s sample configuration files for a Mesos master or a Mesos slave to /etc/collectd/managed_config.

  3. Modify the configuration file to contain values that make sense for your environment, as described below.

  4. Restart collectd.

  5. OPTIONAL: This step needs to be followed when the Mesos cluster being monitored is running under a DC/OS cluster operating in strict mode.

    • Make a new user on DC/OS.
    • Give the new user the following permission strings:
      • dcos:mesos:agent:endpoint:path:/metrics/snapshot read
      • dcos:mesos:master:endpoint:path:/metrics/snapshot read
    • Configure the plugin with the required options. See below.

Note: The /system/health/v1 endpoint on port 1050 for DC/OS is not available if operating in strict mode.

CONFIGURATION

Using the example configuration files 10-mesos-master.conf or 10-mesos-slave.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Mesos instance to be monitored.

configuration option definition default value
ModulePath Path on disk where collectd can find the Mesos python modules. “/usr/share/collectd/mesos-collectd-plugin”
Cluster The name of the cluster to which the Mesos instance belongs. Appears in the dimension cluster. “cluster-0”
Instance The name of this Mesos master/slave instance. Appears in the dimension plugin_instance. “master-0” / “slave-0”
Path The location of the mesos-master/mesos-slave binary. “/usr/sbin”
scheme Scheme the plugin needs to use to fetch metrics. It is either “http” or “https”. “http”
Host The hostname or IP address of the Mesos instance to be monitored. “%%%MASTER_IP%%%”
Port The port on which the Mesos instance is listening for connections. %%%MASTER_PORT%%%
Verbose Enable verbose logging from this plugin to collectd’s log file false
IncludeSystemHealth Enable the sending of DC/OS System Service Health Metrics (this option is only applicable for a DC/OS master) false
ca_file_path Path to CA file required for server verification. If not provided, verification is skipped (this option is only applicable if ssl is enabled) “path/to/file”
dcos_sfx_username New DC/OS username created for the plugin (this option is only applicable for DC/OS in strict mode) sfx-collectd
dcos_sfx_password Password of the above username (this option is only applicable for DC/OS in strict mode) signalfx
dcos_url The DC/OS authentication endpoint (this is an optional config and is only applicable for DC/OS in strict mode) https://leader.mesos/acs/api/v1/auth/login

Note: (Applicable if operating DC/OS in strict mode) The default dcos_url makes use of the leader.mesos hostname provided by DC/OS. If the hostname does not exist, dcos_url can be set by the user. See below example.

Below is an example configuration:

<LoadPlugin "python">
  Globals true
</LoadPlugin>

<Plugin "python">
  ModulePath "/opt/collectd-mesos"

  Import "mesos-master"

  <Module "mesos-master">
    Cluster "cluster-0"
    Instance "master-0"
    Path "/usr/sbin"
    scheme "https"
    Host "10.0.142.190"
    Port 5050
    Verbose false
    IncludeSystemHealth false
    dcos_sfx_username "test-collectd"
    dcos_sfx_password "1234"
    # Note that https://sfx-dco-elasticl-qyuyl8k0dc99-1879689557.us-west-2.elb.amazonaws.com is
    # base URL of the DC/OS UI and /acs/api/v1/auth/login is the authentication endpoint the plugin
    # uses to obtain token for subsequent requests. By default dcos_url takes -
    # https://leader.mesos/acs/api/v1/auth/login
    dcos_url "https://sfx-dco-elasticl-qyuyl8k0dc99-1879689557.us-west-2.elb.amazonaws.com/acs/api/v1/auth/login"
  </Module>
</Plugin>

USAGE

Below are screen captures of dashboards created for this plugin by SignalFx, illustrating the metrics emitted by this plugin.

Monitoring Mesos clusters

image5

Its important to keep track of the status of tasks in the cluster. An increase in failed tasks for a master or slave can indicate a problem with a framework.

image6

It can be important to analyze performance per Mesos host. An increase in failed tasks for many masters and slaves on a single host may indicate a hardware problem.

image7

Track week-over-week growth of tasks in your cluster to be informed of changing workloads.

Monitoring Mesos masters and slaves

image8

An unexpectedly low number of connected slaves on a Mesos master can indicate a network problem preventing them from connecting. To verify this, check to see if theres an unexpectedly high number of dropped messages in counter.master_dropped_messages.

image9

On the Mesos master dashboard, you can view in detail the number of tasks that are finished, failed, lost or errored out. Monitoring connected and active frameworks can help you determine the health of your Mesos scheduler.

For additional information on how to monitor Mesos, check out Apache’s guide here.

METRICS

Below is a list of all metrics.

Metric Name Brief Type
bytes.system_mem_free_bytes Free memory in bytes gauge
bytes.system_mem_total_bytes Total memory available in bytes gauge
counter.datapoints Count of datapoints produced by this collectd. cumulative counter
counter.master_dropped_messages Number of dropped messages counter
counter.master_invalid_framework_to_executor_messages Number of invalid framework to executor messages counter
counter.master_invalid_status_update_acknowledgements Number of invalid status update acknowledgements counter
counter.master_invalid_status_updates Number of invalid status updates counter
counter.master_messages_authenticate Number of authentication messages counter
counter.master_messages_deactivate_framework Number of framework deactivation messages counter
counter.master_messages_decline_offers Number of offers declined counter
counter.master_messages_exited_executor Number of terminated executor messages counter
counter.master_messages_framework_to_executor Number of messages from a framework to an executor counter
counter.master_messages_kill_task Number of kill task messages counter
counter.master_messages_launch_tasks Number of launch task messages counter
counter.master_messages_reconcile_tasks Number of reconcile task messages counter
counter.master_messages_register_framework Number of framework registration messages counter
counter.master_messages_register_slave Number of slave registration messages counter
counter.master_messages_reregister_framework Number of framework re-registration messages counter
counter.master_messages_reregister_slave Number of slave re-registration messages counter
counter.master_messages_resource_request Number of resource request messages counter
counter.master_messages_revive_offers Number of offer revival messages counter
counter.master_messages_status_update Number of status update messages counter
counter.master_messages_status_update_acknowledgement Number of status update acknowledgement messages counter
counter.master_messages_unregister_framework Number of framework unregistration messages counter
counter.master_messages_unregister_slave Number of slave unregistration messages counter
counter.master_recovery_slave_removals Number of slaves not re-registered during master failover counter
counter.master_slave_registrations Number of slaves that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected counter
counter.master_slave_removals Number of slave removed for various reasons, including maintenance counter
counter.master_slave_reregistrations Number of slave re-registrations counter
counter.master_slave_shutdowns_canceled Number of cancelled slave shutdowns counter
counter.master_slave_shutdowns_scheduled Number of slaves which have failed their health check and are scheduled to be removed counter
counter.master_tasks_error Number of tasks that were invalid counter
counter.master_tasks_failed Number of failed tasks counter
counter.master_tasks_finished Number of finished tasks counter
counter.master_tasks_killed Number of killed tasks counter
counter.master_tasks_lost Number of lost tasks counter
counter.master_valid_framework_to_executor_messages Number of valid framework to executor messages counter
counter.master_valid_status_update_acknowledgements Number of valid status update acknowledgement messages counter
counter.master_valid_status_updates Number of valid status update messages counter
counter.notifications Count of notifications produced by this collectd. cumulative counter
counter.slave_executors_terminated Number of terminated executors counter
counter.slave_invalid_framework_messages Number of invalid framework messages counter
counter.slave_invalid_status_updates Number of invalid status updates counter
counter.slave_recovery_errors Number of errors encountered during slave recovery gauge
counter.slave_tasks_failed Number of failed tasks counter
counter.slave_tasks_finished Number of finished tasks counter
counter.slave_tasks_killed Number of killed tasks counter
counter.slave_tasks_lost Number of lost tasks counter
counter.slave_valid_framework_messages Number of valid framework messages counter
counter.slave_valid_status_updates Number of valid status updates counter
gauge.master_cpus_total Number of CPUs available gauge
gauge.master_cpus_used Number of allocated (used) CPUs gauge
gauge.master_disk_total Disk space available in MB gauge
gauge.master_disk_used Allocated (used) disk space in MB gauge
gauge.master_elected Whether this is the elected master gauge
gauge.master_event_queue_dispatches Number of dispatches in the event queue gauge
gauge.master_event_queue_http_requests Number of HTTP requests in the event queue gauge
gauge.master_event_queue_messages Number of messages in the event queue gauge
gauge.master_frameworks_active Number of active frameworks gauge
gauge.master_frameworks_connected Number of connected frameworks gauge
gauge.master_frameworks_disconnected Number of disconnected frameworks gauge
gauge.master_frameworks_inactive Number of inactive frameworks gauge
gauge.master_mem_total Memory available in MB gauge
gauge.master_mem_used Allocated (used) memory in MB gauge
gauge.master_outstanding_offers Number of outstanding resource offers gauge
gauge.master_slaves_active Number of active slaves gauge
gauge.master_slaves_connected Number of connected slaves gauge
gauge.master_slaves_disconnected Number of disconnected slaves gauge
gauge.master_slaves_inactive Number of inactive slaves gauge
gauge.master_tasks_running Number of running tasks gauge
gauge.master_tasks_staging Number of staging tasks gauge
gauge.master_tasks_starting Number of starting tasks gauge
gauge.master_uptime_secs Uptime in seconds gauge
gauge.registrar_queued_operations Number of queued operations in registry gauge
gauge.registrar_registry_size_bytes Registry size in bytes gauge
gauge.registrar_state_fetch_ms Registry read latency in ms gauge
gauge.registrar_state_store_ms Registry write latency in ms gauge
gauge.registrar_state_store_ms_count Registry write count gauge
gauge.registrar_state_store_ms_max Maximum registry write latency in ms gauge
gauge.registrar_state_store_ms_min Minimum registry write latency in ms gauge
gauge.registrar_state_store_ms_p50 Median registry write latency in ms gauge
gauge.registrar_state_store_ms_p90 90th percentile registry write latency in ms gauge
gauge.registrar_state_store_ms_p95 95th percentile registry write latency in ms gauge
gauge.registrar_state_store_ms_p99 99th percentile registry write latency in ms gauge
gauge.registrar_state_store_ms_p999 99.9th percentile registry write latency in ms gauge
gauge.registrar_state_store_ms_p9999 99.99th percentile registry write latency in ms gauge
gauge.sine A sine wave gauge
gauge.slave_cpus_total Number of CPUs available gauge
gauge.slave_cpus_used Number of allocated (used) CPUs gauge
gauge.slave_disk_total Disk space available in MB gauge
gauge.slave_disk_used Allocated (used) disk space in MB gauge
gauge.slave_executors_registering Number of executors registering gauge
gauge.slave_executors_running Number of executors running gauge
gauge.slave_executors_terminating Number of terminating executors gauge
gauge.slave_frameworks_active Number of active frameworks gauge
gauge.slave_mem_total Memory available in MB gauge
gauge.slave_mem_used Allocated (used) memory in MB gauge
gauge.slave_registered Whether this slave is registered with a master gauge
gauge.slave_tasks_running Number of running tasks gauge
gauge.slave_tasks_staging Number of staging tasks gauge
gauge.slave_tasks_starting Number of starting tasks gauge
gauge.slave_uptime_secs Uptime in seconds gauge
gauge.system_cpus_total Number of CPUs available gauge
gauge.system_load_15min Load average for the past 15 minutes gauge
gauge.system_load_1min Load average for the past minute gauge
gauge.system_load_5min Load average for the past 5 minutes gauge
percent.master_cpus_percent Percentage of allocated (used) CPUs gauge
percent.master_disk_percent Percentage of allocated (used) disk space gauge
percent.master_mem_percent Percentage of allocated (used) memory gauge
percent.slave_cpus_percent Percentage of allocated (used) CPUs gauge
percent.slave_disk_percent Percentage of allocated (used) disk space gauge
percent.slave_mem_percent Percentage of allocated (used) memory gauge

bytes.system_mem_free_bytes

gauge

Free memory in bytes, on this system.

bytes.system_mem_total_bytes

gauge

Total memory available in bytes, on this system.

counter.datapoints

cumulative counter

Count of Datapoints

As any plugin (including this one) emits a datapoint we will count it and on
every reporting interval report the count we’ve seen.

counter.master_dropped_messages

counter

Number of dropped messages, on this master.

counter.master_invalid_framework_to_executor_messages

counter

Number of invalid framework to executor messages, on this master.

counter.master_invalid_status_update_acknowledgements

counter

Number of invalid status update acknowledgements, on this master.

counter.master_invalid_status_updates

counter

Number of invalid status updates, on this master.

counter.master_messages_authenticate

counter

Number of authentication messages, on this master.

counter.master_messages_deactivate_framework

counter

Number of framework deactivation messages, on this master.

counter.master_messages_decline_offers

counter

Number of offers declined, on this master.

counter.master_messages_exited_executor

counter

Number of terminated executor messages, on this master.

counter.master_messages_framework_to_executor

counter

Number of messages from a framework to an executor, on this master.

counter.master_messages_kill_task

counter

Number of kill task messages, on this master.

counter.master_messages_launch_tasks

counter

Number of launch task messages, on this master.

counter.master_messages_reconcile_tasks

counter

Number of reconcile task messages, on this master.

counter.master_messages_register_framework

counter

Number of framework registration messages, on this master.

counter.master_messages_register_slave

counter

Number of slave registration messages, on this master.

counter.master_messages_reregister_framework

counter

Number of framework re-registration messages, on this master.

counter.master_messages_reregister_slave

counter

Number of slave re-registration messages, on this master.

counter.master_messages_resource_request

counter

Number of resource request messages, on this master.

counter.master_messages_revive_offers

counter

Number of offer revival messages, on this master.

counter.master_messages_status_update

counter

Number of status update messages, on this master.

counter.master_messages_status_update_acknowledgement

counter

Number of status update acknowledgement messages, on this master.

counter.master_messages_unregister_framework

counter

Number of framework unregistration messages, on this master.

counter.master_messages_unregister_slave

counter

Number of slave unregistration messages, on this master.

counter.master_recovery_slave_removals

counter

Number of slaves not re-registered during master failover, on this master.

counter.master_slave_registrations

counter

Number of slaves that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected, on this master.

counter.master_slave_removals

counter

Number of slave removed for various reasons, including maintenance, on this master.

counter.master_slave_reregistrations

counter

Number of slave re-registrations, on this master.

counter.master_slave_shutdowns_canceled

counter

Number of cancelled slave shutdowns, on this master. This happens when the slave removal rate limit allows for a slave to reconnect and send a PONG to the master before being removed.

counter.master_slave_shutdowns_scheduled

counter

Number of slaves which have failed their health check and are scheduled to be removed, on this master. They will not be immediately removed due to the Slave Removal Rate-Limit, but master/slave_shutdowns_completed will start increasing as they do get removed.

counter.master_tasks_error

counter

Number of tasks that were invalid, on this master. A task is invalid when the task launch attempt failed because of an error in the task specification.

counter.master_tasks_failed

counter

Number of failed tasks, on this master. A task has failed when the task aborted with an error.

counter.master_tasks_finished

counter

Number of finished tasks, on this master. A task has finished when the task completes successfully.

counter.master_tasks_killed

counter

Number of killed tasks, on this master. A task has been killed when the task was killed by the executor.

counter.master_tasks_lost

counter

Number of lost tasks, on this master. A task is lost when the task was running on an agent that has lost contact with the current master (typically due to a network partition or the agent host crashing).

counter.master_valid_framework_to_executor_messages

counter

Number of valid framework to executor messages, on this master.

counter.master_valid_status_update_acknowledgements

counter

Number of valid status update acknowledgement messages, on this master.

counter.master_valid_status_updates

counter

Number of valid status update messages, on this master.

counter.notifications

cumulative counter

Count of Notifications

As any plugin (including this one) emits a notification we will count it and on
every reporting interval report the count we’ve seen.

counter.slave_executors_terminated

counter

Number of terminated executors, on this slave.

counter.slave_invalid_framework_messages

counter

Number of invalid framework messages, on this slave.

counter.slave_invalid_status_updates

counter

Number of invalid status updates, on this slave.

counter.slave_recovery_errors

gauge

Number of errors encountered during slave recovery, on this slave.

counter.slave_tasks_failed

counter

Number of failed tasks, on this slave. A task has failed when the task aborted with an error.

counter.slave_tasks_finished

counter

Number of finished tasks, on this slave. A task has finished when the task completes successfully.

counter.slave_tasks_killed

counter

Number of killed tasks, on this slave. A task has been killed when the task was killed by the executor.

counter.slave_tasks_lost

counter

Number of lost tasks, on this slave. A task is lost when the task was running on an agent that has lost contact with the current master (typically due to a network partition or the agent host crashing).

counter.slave_valid_framework_messages

counter

Number of valid framework messages, on this slave.

counter.slave_valid_status_updates

counter

Number of valid status updates, on this slave.

gauge.master_cpus_total

gauge

Number of CPUs available, in this cluster.

gauge.master_cpus_used

gauge

Number of allocated (used) CPUs, in this cluster.

gauge.master_disk_total

gauge

Disk space available in MB, in this cluster.

gauge.master_disk_used

gauge

Allocated (used) disk space in MB, in this cluster.

gauge.master_elected

gauge

Whether this is the elected master (1 if it is, 0 if not).

gauge.master_event_queue_dispatches

gauge

Number of dispatches in the event queue, on this master.

gauge.master_event_queue_http_requests

gauge

Number of HTTP requests in the event queue, on this master.

gauge.master_event_queue_messages

gauge

Number of messages in the event queue, on this master.

gauge.master_frameworks_active

gauge

Number of active frameworks with tasks, on this master.

gauge.master_frameworks_connected

gauge

Number of connected frameworks, on this master.

gauge.master_frameworks_disconnected

gauge

Number of disconnected frameworks, on this master.

gauge.master_frameworks_inactive

gauge

Number of inactive frameworks, on this master.

gauge.master_mem_total

gauge

Memory available in MB, in this cluster.

gauge.master_mem_used

gauge

Allocated (used) memory in MB, in this cluster.

gauge.master_outstanding_offers

gauge

Number of outstanding resource offers, on this master.

gauge.master_slaves_active

gauge

Number of active slaves with tasks, on this master.

gauge.master_slaves_connected

gauge

Number of connected slaves, on this master.

gauge.master_slaves_disconnected

gauge

Number of disconnected slaves, on this master.

gauge.master_slaves_inactive

gauge

Number of inactive slaves, on this master.

gauge.master_tasks_running

gauge

Number of running tasks, on this master. A task is running after it starts running successfully.

gauge.master_tasks_staging

gauge

Number of staging tasks, on this master. A task is staging when the master has received the frameworks request to launch the task but the task has not yet started to run.

gauge.master_tasks_starting

gauge

Number of starting tasks, on this master. A task is starting when a custom executor has learned about the task (and maybe started fetching its dependencies) but has not yet started to run it.

gauge.master_uptime_secs

gauge

Uptime in seconds, on this master.

gauge.registrar_queued_operations

gauge

Number of queued operations in registry.

gauge.registrar_registry_size_bytes

gauge

Registry size in bytes.

gauge.registrar_state_fetch_ms

gauge

Registry read latency in ms.

gauge.registrar_state_store_ms

gauge

Registry write latency in ms.

gauge.registrar_state_store_ms_count

gauge

Registry write count.

gauge.registrar_state_store_ms_max

gauge

Maximum registry write latency in ms.

gauge.registrar_state_store_ms_min

gauge

Minimum registry write latency in ms.

gauge.registrar_state_store_ms_p50

gauge

Median registry write latency in ms.

gauge.registrar_state_store_ms_p90

gauge

90th percentile registry write latency in ms.

gauge.registrar_state_store_ms_p95

gauge

95th percentile registry write latency in ms.

gauge.registrar_state_store_ms_p99

gauge

99th percentile registry write latency in ms.

gauge.registrar_state_store_ms_p999

gauge

99.9th percentile registry write latency in ms.

gauge.registrar_state_store_ms_p9999

gauge

99.99th percentile registry write latency in ms.

gauge.sine

gauge

A sine wave

A sine wave is a curve representing periodic oscillations of constant amplitude
as given by a sine function. We send this in as it is a good way to show a
gauge.

gauge.slave_cpus_total

gauge

Number of CPUs available, on this slave.

gauge.slave_cpus_used

gauge

Number of allocated (used) CPUs, on this slave.

gauge.slave_disk_total

gauge

Disk space available in MB, on this slave.

gauge.slave_disk_used

gauge

Allocated (used) disk space in MB, on this slave.

gauge.slave_executors_registering

gauge

Number of executors registering, on this slave.

gauge.slave_executors_running

gauge

Number of executors running, on this slave.

gauge.slave_executors_terminating

gauge

Number of terminating executors, on this slave.

gauge.slave_frameworks_active

gauge

Number of active frameworks, on this slave.

gauge.slave_mem_total

gauge

Memory available in MB, on this slave.

gauge.slave_mem_used

gauge

Allocated (used) memory in MB, on this slave.

gauge.slave_registered

gauge

Whether this slave is registered with a master (1 if it is, 0 if not).

gauge.slave_tasks_running

gauge

Number of running tasks, on this slave. A task is running after it starts running successfully.

gauge.slave_tasks_staging

gauge

Number of staging tasks, on this slave. A task is staging when the master has received the frameworks request to launch the task but the task has not yet started to run.

gauge.slave_tasks_starting

gauge

Number of starting tasks, on this slave. A task is starting when a custom executor has learned about the task (and maybe started fetching its dependencies) but has not yet started to run it.

gauge.slave_uptime_secs

gauge

Uptime in seconds, on this slave.

gauge.system_cpus_total

gauge

Number of CPUs available, on this system.

gauge.system_load_15min

gauge

Load average for the past 15 minutes, on this system.

gauge.system_load_1min

gauge

Load average for the past minute, on this system.

gauge.system_load_5min

gauge

Load average for the past 5 minutes, on this system.

percent.master_cpus_percent

gauge

Percentage of allocated (used) CPUs, in this cluster.

percent.master_disk_percent

gauge

Percentage of allocated (used) disk space, in this cluster.

percent.master_mem_percent

gauge

Percentage of allocated (used) memory, in this cluster.

percent.slave_cpus_percent

gauge

Percentage of allocated (used) CPUs, on this slave.

percent.slave_disk_percent

gauge

Percentage of allocated (used) disk space, on this slave.

percent.slave_mem_percent

gauge

Percentage of allocated (used) memory, on this slave.