Docs » µAPM Deployment Guide » Deploying the SignalFx Smart Gateway

Deploying the SignalFx Smart Gateway

The Smart Gateway is a key component in your deployment of SignalFx Microservices APM: it receives all the distributed traces from your instrumented applications, generates metrics for each unique span and trace path, and selects the interesting, erroneous or outlier traces to forward to SignalFx. It is designed to run within your environment, close to your application, and operate reliably and at scale.

The Smart Gateway is an extension of SignalFx’s Gateway (formely known as the Metric Proxy) with support for Microservices APM’s NoSample™ Tail-Based Distributed Tracing features. It is available for download as a statically-linked Linux x86_64 binary that you can deploy and run in your environment. While most applications only require a single Smart Gateway instance, high-availability and scale can be achieved by running multiple Smart Gateway instances, configured to operate as a coordinated cluster.

Instance sizing

Many factors are involved in the resource utilization of the Smart Gateway, ranging from how detailed your tracing instrumentation is to, of course, the volume of transactions being captured and analyzed by the Smart Gateway. Generally, we recommend current generation instances like the c5-class AWS instances, offering high speed networking, fast CPU cores and plenty of system memory.

Recommended instance sizing to run the Smart Gateway are as follows, based on your expected volume of trace spans per minute (SPM). If you’re unsure which one to use, err on the side of larger instances and monitor your Smart Gateway instances to evaluate your actual resource utilization.

SPM AWS EC2 Type
Up to 6M c5.18xlarge
Up to 3M c5.9xlarge
Up to 1.5M c5.4xlarge
Up to 750k c5.2xlarge

Deploying the Smart Gateway

Note about realms

A realm is a self-contained deployment of SignalFx in which your organization is hosted. Different realms have different API endpoints (e.g. the endpoint for sending data is ingest.us1.signalfx.com for the us1 realm, and ingest.eu0.signalfx.com for the eu0 realm).

Various statements in the instructions below include a YOUR_SIGNALFX_REALM placeholder that you should replace with the actual name of your realm. This realm name is shown on your profile page in SignalFx. If you do not include the realm name when specifying an endpoint, SignalFx will interpret it as pointing to the us0 realm.

The Smart Gateway is available as a statically-linked Linux x86_64 binary that you can download from SignalFx if you are a µAPM customer. You can also download the latest Smart Gateway’s binary from the command line if you know your organization token and your realm name:

$ curl -qs -H"X-SF-Token:ORG_TOKEN" \
    https://api.YOUR_SIGNALFX_REALM.signalfx.com/v2/smart-gateway/download | gunzip > smart-gateway

Building a Docker image

If you intend to deploy your Smart Gateway as a Docker image for easier orchestration, you can use the following Dockerfile to build a functioning Docker image. This step is optional; you can distribute and orchestrate the Smart Gateway whichever way best fits your current infrastructure and environment.

FROM scratch
ADD smart-gateway /
ADD https://raw.githubusercontent.com/signalfx/gateway/master/ca-bundle.crt /etc/pki/tls/certs/ca-bundle.crt
VOLUME /var/lib/gateway
CMD ["/smart-gateway", "--configfile", "/var/lib/gateway/etc/gateway.conf"]

Assuming both the smart-gateway binary and this Dockerfile are in your current working directory, you can build the corresponding Docker image with:

$ docker build -t signalfx-smart-gateway .

Smart Gateway configuration

The Smart Gateway, like the SignalFx Gateway it extends, reads its configuration from a single JSON file. By default, the Smart Gateway will look for this configuration file at /etc/gateway.conf; to specify a different location, use the command line flag --configfile. The configuration is composed of three main elements: top-level configuration of the Smart Gateway, listeners (ListenFrom) and forwarders (ForwardTo); for a complete reference of those sections, refer to this section in the SignalFx Gateway’s configuration reference.

The following instructions enable our NoSample™ Tail-Based Distributed Tracing feature, and “transform” the SignalFx Gateway into the SignalFx Smart Gateway. To this effect, there are four key configuration elements to pay attention to:

  • the ServerName of this Smart Gateway instance. This should match the hostname of the underlying instance as reported by the SignalFx Smart Agent to enable monitoring of the Smart Gateway;
  • the ClusterName of the cluster this Smart Gateway is a part of. This is typically an environment name, like qa or prod. It is required even for single-instance deployments;
  • a signalfx listener must be configured for the Smart Gateway to receive traces;
  • a signalfx forwarder with a TraceSample section and persistent BackupLocation must be configured to enable smart trace sampling and send the selected traces to SignalFx.

When put together, your Smart Gateway configuration should look as follows (replace YOUR_SIGNALFX_REALM by the name of the SignalFx realm your organization is hosted in and YOUR_SIGNALFX_API_TOKEN with your organization token):

{
  "ServerName": "smart-gateway-1",
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "URL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/datapoint",
      "EventURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/event",
      "TraceURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v1/trace",
      "DefaultAuthToken": "YOUR_SIGNALFX_API_TOKEN",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/config/gateway/data"
      }
    }
  ]
}

Note that the ServerName of your SignalFx Smart Gateway must be unique within a given Smart Gateway cluster.

Dockerized Smart Gateway configuration

If you are packaging your Smart Gateway in a Docker image as described above, you must ensure that the configured BackupLocation persists across container restarts. This can be achieved by placing your BackupLocation on a bind-mounted volume like /var/lib/gateway in the example above. Similarly, you might want to redirect the Smart Gateway’s log to standard out for Docker to capture (using "LogDir": "-"), or to a file placed on a bind-mounted volume.

Using the Dockerfile above, make sure to place this configuration under /var/lib/gateway/etc/gateway.conf (replace YOUR_SIGNALFX_REALM by the name of the SignalFx realm your organization is hosted in and YOUR_SIGNALFX_API_TOKEN with your organization token):

{
  "ServerName": "smart-gateway-1",
  "StatsDelay": "10s",
  "LogDir": "/var/lib/gateway/logs",
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "URL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/datapoint",
      "EventURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v2/event",
      "TraceURL": "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v1/trace",
      "DefaultAuthToken": "YOUR_SIGNALFX_API_TOKEN",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/lib/gateway/data"
      }
    }
  ]
}

Running the Smart Gateway

Start your Smart Gateway by running its binary, eventually specifying the appropriate configuration file location if you’re not using the default at /etc/gateway.conf:

$ ./smart-gateway --configfile /wherever/is/gateway.conf

Or, if you’re using Docker:

$ docker run -d --name smart-gateway -v /var/lib/gateway:/var/lib/gateway -p 8080 signalfx-smart-gateway

You can verify that your Smart Gateway is running and accepting traces by sending an empty payload to its /v1/trace endpoint and expecting a return value of "OK":

$ curl -d'[]' -H'Content-Type:application/json' http://<your-gateway>:8080/v1/trace
"OK"

Finally, make sure your deployed SignalFx Smart Agents are configured to send data through your Smart Gateway by configuring their ingestUrl to http://<your-gateway>:8080/ (as per the Smart Agent Configuration documentation).

Monitoring the Smart Gateway

Monitoring your Smart Gateway is paramount to ensure its correct operation and that you have deployed the appropriate Smart Gateway capacity to handle your workload. To this effect, SignalFx provides pre-built, curated dashboards for the Smart Gateway. They will appoar in your organization soon after you deploy the Smart Gateway.

For these dashboards to fully populate with all the data you need to monitor the health and resource utilization of your Smart Gateway, you need to deploy the SignalFx Smart Agent on your Smart Gateway instances. This makes sure that the relevant metrics are being reported to SignalFx.

When configuring the Smart Agent on your gateway instances, pay attention to the following settings:

  • the hostname used by the agent should match the ServerName used by the Smart Gateway;
  • the collectd/signalfx-metadata monitor must be enabled, and configured to report its metrics with two additional dimensions: a source: gateway dimension, and a cluster dimension whose value matches the ClusterName used by the Smart Gateway.

Finally, the following monitors must be configured to provide the appropriate metrics for the Smart Gateway monitoring dashboards:

hostname: REPLACE-WITH-SERVER-NAME
signalFxAccessToken: YOUR_SIGNALFX_API_TOKEN
ingestUrl: https://ingest.YOUR_SIGNALFX_REALM.signalfx.com

monitors:
  - type: host-metadata
  - type: collectd/cpu
  - type: collectd/cpu
  - type: collectd/cpufreq
  - type: collectd/df
  - type: collectd/disk
  - type: collectd/interface
  - type: collectd/load
  - type: collectd/memory
  - type: collectd/vmem
  - type: collectd/signalfx-metadata
    extraDimensions:
      source: gateway
      cluster: REPLACE-WITH-CLUSTER-NAME

Install and configure a clustered Smart Gateway

To benefit from high-availability or to handle large trace volumes, you can deploy multiple instances of the SignalFx Smart Gateway that work together as a cluster. To configure a clustered gateway, perform the following steps after completing the initial installation and configuration steps. Once your Smart Gateway instances are installed and configured, you will need to deploy an HTTP load balancer in front of them (HAProxy or Nginx are good options).

Cluster configuration options

The configurations for clustering the Smart Gateway are defined in the gateway.conf file, but can be overridden at start up using their corresponding environment variables.

Please note that all cluster configurations will be ignored if a cluster operation is not specified (“seed” or “join”).

Review the configuration options below, and then continue to Configuring the clustered Smart Gateway.

Configuration Environment Variable Description Default Value
ServerName SFX_SERVER_NAME The name the server should be identified by. This value should be unique for each instance in the cluster. Should be set to what the Smart Agent emits as host <none>
ClusterName <none> The name the cluster should be identified by. This value should be the same for each instance in the cluster. An instance will not be allowed to join a cluster without being configured with the same cluster name as existing members. gateway
ClusterOperation SFX_CLUSTER_OPERATION The cluster operation that the smart gateway should perform on startup. Options are “join” or “seed”. If left blank then the Smart Gateway will not operate in cluster mode. Please note that the command line flag “-cluster-op” will override both the config file and the environment variable. <none>
TargetClusterAddresses SFX_TARGET_CLUSTER_ADDRESSES A comma-separated list of peer addresses and ports for the Smart Gateway to join. If using the environment variable, assign the list as a single string of addresses with ports separated by commas. These addresses are static. For example: SFX_TARGET_CLUSTER_ADDRESSES=”127.0.0.1:2379,127.0.0.1:2380” <none>
ListenOnPeerAddress SFX_LISTEN_ON_PEER_ADDRESS The address and port that the etcd server listens on for peer connections. This address is static. 127.0.0.1:2380
AdvertisePeerAddress SFX_ADVERTISE_PEER_ADDRESS The address and port advertised by the etcd server for peer connections. This address is static. 127.0.0.1:2380
ListenOnClientAddress SFX_LISTEN_ON_CLIENT_ADDRESS The address and port that the etcd server listens on for client connections. This address is static. 127.0.0.1:2379
AdvertiseClientAddress SFX_ADVERTISE_CLIENT_ADDRESS The address and port advertised by the etcd server for client connections. This address is static. 127.0.0.1:2379
ETCDMetricsAddress SFX_ETCD_METRICS_ADDRESS The address and port used to expose prometheus style metrics about the embedded etcd server. This address is static. 127.0.0.1:2381
ClusterDataDir SFX_CLUSTER_DATA_DIR A file system path for the etcd server to store data in. NOTE: If running in a container make sure this is persisted outside the container. ./etcd-data
UnhealthyMemberTTL SFX_UNHEALTHY_MEMBER_TTL he duration after which an etcd member should be removed from the cluster when it is presumed unhealthy. 5s
RemoveMemberTimeout SFX_REMOVE_MEMBER_TIMEOUT The time to wait for the instance to remove itself from the etcd cluster when shutting down the Smart Gateway instance. 1s

Configuring the clustered Smart Gateway

After reviewing the information in Cluster configuration options above, create a config similar to the one below for each config in the cluster. Notice that you must set ServerName to something different for each member of the cluster; the name should also be globally unique.

You’ll notice this example is a config file for a cluster of 3 members. They should all be listed in the config file. If you leave out any off the addresses they’ll default to listening on localhost.

If a cluster operation is not specified then the Smart Gateway will ignore all other cluster specific configurations and start in standalone mode.

Note that the cluster mode requires the SignalFx listener to be configured, and for the IngestAddress and ListenRebalanceAddress options to be set in the TraceSample stanza. If you’re inside a container you will probably want to specify the AdvertiseRebalanceAddress so that you can listen on a different host/port combination from what the real machine exposes. If you don’t specify the AdvertiseRebalanceAddress the listener will advertise the ListenRebalanceAddress.

Here is an example of what might go on non-containerized machine, where the IP of that machine is 10.1.77.44 in a cluster of 3 machines.

{
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ServerName": "smart-gateway1",
  "ClusterName": "prod-apj",
  "ListenOnPeerAddress": "10.1.77.44:2380",
  "AdvertisePeerAddress": "10.1.77.44:2380",
  "ListenOnClientAddress": "10.1.77.44:2379",
  "AdvertiseClientAddress": "10.1.77.44:2379",
  "ETCDMetricsAddress": "10.1.77.44:2381",
  "ClusterDataDir": "/var/config/gateway/etcd",
  "ClusterOperation": "join",
  "UnhealthyMemberTTL": "5s",
  "RemoveMemberTimeout": "1s",
  "TargetClusterAddresses": [
    "10.1.77.44:2379",
    "10.5.88.55:2379",
    "10.0.140.232:2379"
  ],
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "DefaultAuthToken": "PUTYOURTOKENHERE",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/config/gateway/data",
        "ListenRebalanceAddress": "0.0.0.0:2382",
        "IngestAddress": "http://10.1.77.44:8080"
      }
    }
  ]
}

Here is an example of what might go onto a containerized machine where the external IP of the machine is still 10.1.77.44 but internal ports are mapped externally with a 2 in front of them.

{
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ServerName": "smart-gateway1",
  "ClusterName": "prod-apj",
  "ListenOnPeerAddress": "0.0.0.0:2380",
  "AdvertisePeerAddress": "10.1.77.44:22380",
  "ListenOnClientAddress": "0.0.0.0:2379",
  "AdvertiseClientAddress": "10.1.77.44:22379",
  "ETCDMetricsAddress": "0.0.0.0:2381",
  "ClusterDataDir": "/var/config/gateway/etcd",
  "ClusterOperation": "join",
  "UnhealthyMemberTTL": "5s",
  "RemoveMemberTimeout": "1s",
  "AdditionalDimensions": {"cluster":"bb"},
  "TargetClusterAddresses": [
    "10.1.77.44:22379",
    "10.5.88.55:22379",
    "10.0.140.232:22379"
  ],
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "type": "signalfx",
      "DefaultAuthToken": "PUTYOURTOKENHERE",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/config/gateway/data",
        "ListenRebalanceAddress": "0.0.0.0:2382",
        "AdvertiseRebalanceAddress": "10.1.77.44:22382",
        "IngestAddress": "http://10.1.77.44:28080"
      }
    }
  ]
}

Start the gateway

To start up the first instance in the cluster, you will need to override the join function and seed the network. So the command line would be something like ./smart-gateway --cluster-op seed --configfile /var/config/gateway/gateway.conf. The rest and any restarts should just require the config file parameter.

After the first node has completely stood up, start up each additional node in the cluster one by one, waiting for each node to completely stand up before starting the next node. If StatsDelay is configured on the gateway, then you can verify that the node joined the cluster by looking at the reported cluster size via the metric proxy.tracing.sampler.clusterSize.

Finally, make sure your deployed SignalFx Smart Agents are configured to send data through your Smart Gateway by configuring their ingestUrl to http://<your-gateway>:8080/ (as per the Smart Agent Configuration documentation).

Stop the gateway

To stop the gateway, send a SIGTERM to the gateway process and wait for the process to complete. You must do this one by one in the cluster.

Restart the gateway

To restart the gateway, stop and start the gateway one by one and ensure that the gateway has stood up before continuing to the next one.

Metrics emitted by the SignalFx Smart Gateway

In addition to the metrics listed in Traces, spans, and SignalFx metrics, we also emit the metrics listed below. All metrics sent by the Smart Gateway have the dimensions host:ServerName and source:gateway on them.

Metric Name Additional Dimensions Description
gateway.commit samplerCommit: the sampler’s commit SHA, gatewayCommit: the gateway’s commit SHA Gauge value emits the value 1 and contains the SHAs of the components that make up the Smart Gateway.
gateway.processedTraces none Cumulative counter of all traces processed by this gateway or cluster
gateway.processedSpans none Cumulative counter of all spans processed by this gateway or cluster
gateway.sentTraces none Cumulative counter of all traces that were selected by the Smart Gateway and sent to SignalFx
gateway.sentSpans none Cumulative counter of all spans that were selected by the Smart Gateway and sent to SignalFx
dropped_spans reason: the reason the span was dropped Cumulative counter of all spans dropped by the Smart Gateway
dropped_traces reason: the reason the trace was dropped Cumulative counter of all traces dropped by the Smart Gateway

Troubleshooting

Etcd is embedded inside of the Smart Gateway and is used for cluster management.

In some circumstances the etcd cluster may become unhealthy if a Smart Gateway is terminated, but etcd was unable cleanly remove the member from the cluster.

In this situation, the remaining nodes should eventually remove the member from etcd.

You can verify this using etcdctl and pointing etcdctl at one of the other cluster members. etcdctl is distributed as a binary executable on the etcd github repository.

Verify that the member that was terminated has been removed by listing the members in the cluster:

$ ./etcdctl --endpoints=http://<client address>:2379 member list
8a052b9d07b922c8: name=gateway-1 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=true
abb83521a48373b5: name=gateway-2 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=false
d936b01b7ddff746: name=gateway-3 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=false

You can also check the health of each cluster member using etcdctl:

$ ./etcdctl --endpoints=http://<client address>:<client port> cluster-health
member 8a052b9d07b922c8 is healthy: got healthy result from http://<host>:<port>
member abb83521a48373b5 is healthy: got healthy result from http://<host>:<port>
member d936b01b7ddff746 is healthy: got healthy result from http://<host>:<port>

If a Smart Gateway is failing to restart because the cluster is “unhealthy,” check that the Smart Gateway is no longer listed as a member of the cluster using the above two commands. If the Smart Gateway still appears in the list of members, try removing the Smart Gatway manually via etcdctl using the Smart Gateway instance’s etcd member ID. The member ID is printed first on each entry in the member list.

$ ./etcdctl member remove 8a052b9d07b922c8
# Member 8a052b9d07b922c8 removed from cluster ef37ad9dc622a7c4

Once the member has been successfully removed from the etcd cluster, try restarting the Smart Gateway instance.

High cardinality span identities generated by variables in span names

Many applications are instrumented with variable span names. This is an anti-pattern and will lead to poor performance and sampling accuracy of the Smart Gateway; it creates a very large number of span and trace identities, which results in the Smart Gateway’s inability to construct consistent baselines for those spans and traces while also consuming more memory resources. This pattern will also impact the performance and user experience of the SignalFx APM UI.

Instead of using variable span names, we recommend using span tags instead. However, if you are unable to modify your application to not emit variable span names and leverage tags instead for those variable elements, the Smart Gateway can turn these high cardinality names into tags using a configurable set of replacement rules.

Doing this will make the span identity space much smaller and will allow the Smart Gateway to establish span-level and trace-level baselines accurately, restoring the quality and accuracy of the trace selection algorithm while retaining all the required information on the spans, making them available for analysis by our Outlier Analyzer.

What’s Next?

Continue to Deploying the SignalFx Smart Agent.