Docs » µAPM Deployment Guide » Deploying a Smart Gateway Cluster

Deploying a Smart Gateway Cluster

To benefit from high-availability or to handle large trace volumes, you can deploy multiple instances of the SignalFx Smart Gateway that work together as a cluster. The Smart Gateway instances are intended to be deployed behind a traditional HTTP load balancer (HAProxy, NGINX, AWS ALB, …), each instance being responsible for part of the incoming volume of traces. SignalFx’s NoSample tail-based sampling requires that all the spans for a given trace are processed by the same Smart Gateway instance; the cluster therefore coordinates on distributing the responsibility of the trace ID space across the cluster instances by redirecting incoming trace spans to the appropriate instance when necessary.

To achieve this, a Smart Gateway cluster internally creates an Etcd cluster. The bootstrapping of this cluster requires that one instance be started as the seed, with further instances joining this established seed. Additionally, for larger clusters of more than 7 instances, we recommend that additional instances connect to the cluster as clients instead of joining the cluster. The instructions below will guide you through deploying a small 3-node cluster behind a NGINX load balancer, including steps for bootstrapping a new cluster.

Deployment

In this example, we will install and configure a highly-available deployment of the SignalFx Smart Gateway, using a 3-node cluster of Smart Gateway instances behind two NGINX load balancers accessible through a single round-robin DNS endpoint. This architecture ensures that there is no single point of failure and that any one instance going down does not impact the availability of the Smart Gateway.

Instance provisioning

The first step is to provision instances. The workload on the load balancers is mostly CPU-driven, while the load on the Smart Gateway instances is a combination of CPU (for processing), memory (for live state), and disk usage (for trace span buffering). For more information, see the Smart Gateway instance sizing guidelines.

The load balancer instances must be accessible from your instrumented applications and hosts via HTTP. Similarly, your Smart Gateway instances must be accessible from those load balancer instances. Finally, the Smart Gateway instances must be able to connect out to the Internet to reach SignalFx’s cloud platform. Make sure that your instances can communicate appropriately before continuing.

Install and configure the NGINX load balancers

The easiest way to run NGINX on your load balancer instances depends on your preferred application deployment method. For example, on an Amazon Linux system, you can install NGINX with yum install nginx. Alternatively, you can run NGINX as a Docker container:

$ docker run -v /host/path/nginx.conf:/etc/nginx/nginx.conf:ro -d nginx

In both cases, create your nginx.conf configuration file to setup NGINX for load balancing across the three provisioned Smart Gateway instances:

events {
}

http {
    upstream smart-gateway {
        server IP_OF_SGW_1:8080;
        server IP_OF_SGW_2:8080;
        server IP_OF_SGW_3:8080;
    }

    server {
        listen 80;

        location / {
            proxy_set_header Host $host;
            proxy_pass http://smart-gateway;
        }
    }
}

This will configure NGINX to listen on port 80 and proxy incoming requests to one of the configured Smart Gateway backend instances, on port 8080. For more information on configuring NGINX, see NGINX’s load balacing documentation.

Configure DNS

To ensure high-availability of your load balancers, configure a DNS name as a round-robin entry pointing at both your provisioned load balancer instances. This can be done by defining two A records for the same DNS name in your DNS zone file. If you are deployed on AWS and rely on Route 53, you can find additional information about configuring Simple Routing with multiple values here.

This DNS name will be the metrics and trace destination and ingest address to configure in your applications and SignalFx Smart Agents.

Install and configure the Smart Gateway instances

Note about realms

A realm is a self-contained deployment of SignalFx in which your organization is hosted. Different realms have different API endpoints (e.g. the endpoint for sending data is ingest.us1.signalfx.com for the us1 realm, and ingest.eu0.signalfx.com for the eu0 realm).

Various statements in the instructions below include a YOUR_SIGNALFX_REALM placeholder that you should replace with the actual name of your realm. This realm name is shown on your profile page in SignalFx. If you do not include the realm name when specifying an endpoint, SignalFx will interpret it as pointing to the us0 realm.

The configuration of clustered Smart Gateway instances is similar to that of a single-instance deployment, with the addition of some important cluster-related settings that must be set at the top level of the gateway’s JSON configuration file:

  • A common ClusterName that identifies the application environment this gateway cluster is deployed for.
  • A ClusterDataDir configured to a persistent on-disk location for Etcd cluster state; if running inside a container this location must persist across container restarts (for example using a bind-mounted volume).
  • ListenOnClientAddress and ListenOnPeerAddress, defining the address and port to listen for Etcd client and peer connections.
  • AdvertiseClientAddress and AdvertisePeerAddress, defining the address and port to advertise the client and peer addresses into Etcd itself (this is often necessary when running inside a container).
  • A list of TargetClusterAddresses, pointing at all the deployed Smart Gateway instances’ advertised client addresses.

Additionally, the signalfx forwarder must be in use and configured with a few additional settings:

  • An IngestAddress matching the gateway’s signalfx listener HTTP endpoint (here on port 8080).
  • ListenRebalanceAddress, defining the address on which to accept connections from other gateways during cluster rebalancing events.
  • AdvertiseRebalanceAddress, defining the rebalance address to advertise (again, often necessary when running inside a container).

Adding the above, here is what the Smart Gateway’s configuration for the first instance of our 3-node cluster looks like:

{
  "ServerName": "smart-gateway-1",
  "ClusterName": "prod",
  "ClusterDataDir": "/var/lib/gateway/etcd",
  "ListenOnClientAddress": "0.0.0.0:2379",
  "ListenOnPeerAddress": "0.0.0.0:2380",
  "AdvertiseClientAddress": "IP_OF_SGW_1:2379",
  "AdvertisePeerAddress": "IP_OF_SGW_1:2380",
  "TargetClusterAddresses": [
    "IP_OF_SGW_1:2379",
    "IP_OF_SGW_2:2379",
    "IP_OF_SGW_3:2379"
  ],
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "URL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/datapoint",
      "EventURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/event",
      "TraceURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v1/trace",
      "DefaultAuthToken": "YOUR_SIGNALFX_API_TOKEN",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/lib/gateway/data",
        "IngestAddress": "http://IP_OF_SGW_1:8080",
        "ListenRebalanceAddress": "0.0.0.0:2382",
        "AdvertiseRebalanceAddress": "IP_OF_SGW_1:2382"
      }
    }
  ]
}

The configuration of the other nodes of the cluster is similar, with the exception of the AdvertiseClientAddress, AdvertisePeerAddress, AdvertiseRebalanceAddress and IngestAddress that all need to point to the current node’s IP address (reachable by the other nodes).

Bootstrapping

Because the Smart Gateway embeds Etcd, bootstrapping a new Smart Gateway cluster (or restarting an existing cluster from scratch) cannot be done by simply bringing all the instances up at the same time. First, a seed node must be started by running one Smart Gateway instance with the seed cluster operation; once it is operational the rest of the cluster can be started using the join cluster operation. When running cluster larger than 7 nodes, we recommend that additional nodes past 7 be started using the client cluster operation. This makes additional Smart Gateway instances simply connect to the existing Etcd cluster as clients instead of joining that cluster and its consensus operations.

Starting the seed node

Start the seed node by passing the appropriate command-line flag when starting the Smart Gateway:

$ ./smart-gateway --configfile gateway.conf --cluster-op seed

Alternatively, the same result can be achieved by setting the SFX_CLUSTER_OPERATION=seed environment variable when starting the Smart Gateway. This can be useful when running the Smart Gateway as a container.

Note

Once the seed node is started, it does not need to be restarted to join; it is already and correctly connected and a member of the cluster. Conversely, if this node needs to be restarted for some reason, it needs to be restarted with --cluster-op join so it rejoins the rest of the cluster instead of seeding a new cluster.

Starting the rest of the cluster

First, verify that the first Smart Gateway instance is up by checking its health endpoint:

$ curl -s http://IP_OF_SGW_1:8080/healthz
OK

Then, start the remaining nodes, instructing them to join the cluster created by the seed node:

$ ./smart-gateway --configfile gateway.conf --cluster-op join

Once again, the same result can be achieved by setting the SFX_CLUSTER_OPERATION=join environment variable.

You can check that the cluster contains the members you expect by querying Etcd directly:

$ curl -s http://IP_OF_SGW_1:2379/v2/members
{
  "members": [
    {
        "id": "ff3eb1de3791519",
        "name": "smart-gateway-1",
        "clientURLs": [ "http://IP_OF_SGW_1:2379" ],
        "peerURLs": [ "http://IP_OF_SGW_1:2380" ]
    },
    {
        "id": "4f95f8065f774793",
        "name": "smart-gateway-2",
        "clientURLs": [ "http://IP_OF_SGW_2:2379" ],
        "peerURLs": [ "http://IP_OF_SGW_2:2380" ]
    },
    {
        "id": "df410f818d6c6227",
        "name": "smart-gateway-3",
        "clientURLs": [ "http://IP_OF_SGW_3:2379" ],
        "peerURLs": [ "http://IP_OF_SGW_3:2380" ]
    }
  ]
}

Cluster operations

Graceful instance shutdown

To avoid data loss, Smart Gateway instances should be gracefully terminated by sending the process a SIGTERM signal and waiting for the process to terminate. Upon receipt of this signal, the Smart Gateway implements a graceful shutdown procedure that ensures it correctly drains its incoming connections and buffers, and transfers to the rest of the cluster any in-flight trace spans that still need to be analyzed.

Performing a rolling restart or upgrade

You can restart and upgrade Smart Gateway clusters with no downtime by performing a rolling restart of the Smart Gateway instances. The safest approach to performing a rolling restart is to restart one instance at a time, waiting for the new process to join the cluster and reach steady state before moving on to the next instance. When restarting an instance within an existing established cluster, you should never use the seed cluster operation.

Follow the steps below to perform rolling restart or upgrade of a Smart Gateway cluster:

  • Identify a node to restart or upgrade.
  • Gracefully shut down the identified node by sending a SIGTERM signal to the Smart Gateway process.
  • Wait for the process to gracefully terminate.
  • Start the new replacement process; ensure that you use either a join or client cluster operation based on your desired cluster topology.
  • Wait for the process to start (HTTP requests to /healthz on the listener port should return 200 OK).

Performing an instance replacement

When replacing an instance, or often when restarting an instance in a containerized environment, it is likely that the new instance will not have the same IP address as the instance it replaces. As cluster members’ IP addresses are embedded into the configuration’s TargetClusterAddresses field, it is important to understand the implications of replacing an instance.

Because the cluster members primarily learn about each other through Etcd, it is not required for the TargetClusterAddresses to be an always up-to-date and accurate list of the other members’ IP addresses. This list is only used when a Smart Gateway instance starts up and connects to an established cluster; once connected, all other coordination and cluster membership information is performed through Etcd. Therefore, the absolute minimum requirement for the TargetClusterAddresses list is that at least one of those IP addresses is a valid and active cluster instance that can be reached to establish this connection.

When replacing an instance, it is therefore not necessary to reconfigure and restart the rest of the cluster to make it aware of this new instance, as this discovery will happen through Etcd for all instances. The TargetClusterAddresses list might of course become stale in long-running instances. If the cluster has changed significantly since the last restart, we recommend that you refresh this part of the configuration with an up-to-date list before (re)starting a Smart Gateway process.

Configuration reference

Configuration Environment Variable Description Default Value
ServerName SFX_SERVER_NAME The name the server should be identified by. This value should be unique for each instance in the cluster. Should be set to what the Smart Agent emits as host <none>
ClusterName <none> The name the cluster should be identified by. This value should be the same for each instance in the cluster. An instance will not be allowed to join a cluster without being configured with the same cluster name as existing members. gateway
ClusterDataDir SFX_CLUSTER_DATA_DIR A file system path for the etcd server to store data in. Note: if running in a container, make sure this path points to a location that is persisted outside the container. ./etcd-data
ClusterOperation SFX_CLUSTER_OPERATION The cluster operation that the smart gateway should perform on startup. Options are “join” or “seed”. If left blank then the Smart Gateway will not operate in cluster mode. Note that the command line flag --cluster-op will override both the config file and the environment variable. <none>
TargetClusterAddresses SFX_TARGET_CLUSTER_ADDRESSES A comma-separated list of client addresses and ports for the Smart Gateway to join. If using the environment variable, assign the list as a single string of addresses with ports separated by commas. These addresses are static. For example: SFX_TARGET_CLUSTER_ADDRESSES=”127.0.0.1:2379,127.0.0.2:2379” <none>
ListenOnPeerAddress SFX_LISTEN_ON_PEER_ADDRESS The address and port that the etcd server listens on for peer connections. This address is static. 127.0.0.1:2380
AdvertisePeerAddress SFX_ADVERTISE_PEER_ADDRESS The address and port advertised by the etcd server for peer connections. This address is static. 127.0.0.1:2380
ListenOnClientAddress SFX_LISTEN_ON_CLIENT_ADDRESS The address and port that the etcd server listens on for client connections. This address is static. 127.0.0.1:2379
AdvertiseClientAddress SFX_ADVERTISE_CLIENT_ADDRESS The address and port advertised by the etcd server for client connections. This address is static. 127.0.0.1:2379
ETCDMetricsAddress SFX_ETCD_METRICS_ADDRESS The address and port used to expose Prometheus style metrics about the embedded etcd server. This address is static. 127.0.0.1:2381
UnhealthyMemberTTL SFX_UNHEALTHY_MEMBER_TTL The duration after which an etcd member should be removed from the cluster when it is presumed unhealthy. 5s
RemoveMemberTimeout SFX_REMOVE_MEMBER_TIMEOUT The time to wait for the instance to remove itself from the etcd cluster when shutting down the Smart Gateway instance. 1s