Introducing Signal Analog, the troposphere-like Library for Automating Monitoring Resources

This article originally appeared as part of the Nike Engineering publication.

In March 2018, Nike open sourced a library that makes it easy to define, version and deploy monitoring resources in SignalFx called signal_analog. SignalFx is a platform that provides near real-time monitoring capabilities across a wide range of technologies and platforms in use today. At Nike, we like to monitor our stuff, so we've become heavy users of SignalFx to make our microservice ecosystem more visible to engineers and stakeholders alike.

The concept of observability isn't new, and at Nike we're constantly iterating on how to get better insight into our running services. We've run the gamut of monitoring tools from an in-house Graphite cluster, to New Relic, to Splunk (yes, for monitoring), to SignalFx. All of these tools have their strengths, but in today's article we'll focus on the some of the things SignalFx provides that have made it a good complement to our observability suite:

  • Breadth of chart types make it simple to monitor the health of services
  • Sophisticated alerting rules (e.g. resource exhaustion) and configurability for straightforward, actionable alerts
  • API parity with the UI, making common actions in the UI automatable
  • A Domain Specific Language (DSL) called SignalFlow that makes representing charts as programs straightforward

What we were missing was a way to store our monitoring configuration as code, a practice that we rely on heavily for our application configuration, service deployments, and infrastructure.

signal_analog was the jigsaw piece to complete our monitoring puzzle. This Python library makes declarative charts, dashboards, and detectors easy to manage as a part of your application's configuration. We use this library today alongside other staples like boto and troposphere to automate our application deployments.

A Brief Primer on SignalFx

For those new to SignalFx, we'll spend some time building a shared understanding of concepts in the platform. We'll then build on this foundation in subsequent sections, introducing the signal_analog library in more detail. If at any point you're interested in going deeper, SignalFx has some great API documentation available for browsing.

SignalFx is a near real-time monitoring system designed to make it easy to uncover patterns in service behavior. To do this, it builds a representation of your application's data into a time series. A time series is a set of data points that are organized chronologically. Time series data can optionally have a number of dimensions - key/value pairs that represent additional information about the time series. Request count is an example of a time series in SignalFx for which we can add additional context to request counts by including dimensions like status code, latency, endpoint uri, and so on. We can have different kinds of time series, depending on the type of data point we're sending. In our request count example, this would be a counter, but we can also have cumulative counters and gauges to represent other kinds of data. We won't get into data ingestion in this article, but be sure to check out the SignalFx documentation for pointers on getting started.

Once a time series is receiving consistent data points, we can build charts representing this data graphically. Continuing our request count example, we could create a line chart that plots request count over a period of time. Charts are not limited to a single time series; we can compose them, transform them, and hide irrelevant time series, so the final chart represents exactly what we're looking to observe.

We can also build detectors on top of time series, which is primarily how we build alerting in our systems. Detectors provide flexible alerting strategies that notify us on different channels based on signals like severity.

Finally, we have Dashboards and Dashboard Groups, which are encapsulations for collections of charts. We can logically group charts by purpose, service, or any other organizational scheme. Dashboard-level filtering allows us to filter a set of charts by a particular dimension, which enables us to reuse common dashboards across a set of services in a domain.

We'll now see how SignalFlow layers functionality on top of SignalFx for more nuanced usage.

Advanced SignalFx Usage with SignalFlow

Now that we're familiar with some of the basic concepts in SignalFx, it's time to dig a little deeper into the SignalFlow DSL. The SignalFlow DSL encapsulates the description of time series data and the operations we can perform on them. For example, let's say that we have our request count chart from earlier, and we define a dimension app that specifies the application under observation. To express this in the SignalFlow language, we would write a simple program, like so:

data('request.count', filter=filter('app', 'my_app'))

We can take this simple program and create as many charts as we want with it. SignalFx will handle the creation of charts for us!

Now, let's say we also have a dimension called uri. It specifies which endpoint in our application was called. If we want to group our counts by that uri, we might extend our application to read like so:

data('request.count', filter=filter('app', 'my_app')).sum(by='uri')

And finally, if we want to actually see this data represented in a chart, we would call the publish function to make it visible:

data('request.count', filter=filter('app', 'my_app')).sum(by='uri').publish(label='request count by uri')

SignalFlow is very expressive and lets us do some pretty neat things that would be cumbersome in a point-and-click interface. What's missing from SignalFlow is a runtime for the DSL that works outside of the SignalFx platform. This is where signal_analog comes in. It provides a library for writing, composing, and validating SignalFx programs. It doesn't add new functionality on top of the SignalFlow DSL itself, but instead aims to make it as simple as possible to author and share SignalFlow programs across engineering teams.

The Basics of signal_analog

The signal_analog library is written in Python, which gives us the range of functionality of the full Python language and plays well with the expressivity of the SignalFlow DSL. This means that we can package, version, and distribute anything from small SignalFlow programs all the way up to pre-built Dashboards that can be readily used by our engineering teams.

If you'd like to play along with the following examples, we encourage you to follow our installation instructions.

Starting Small: a Simple SignalFlow Program

Resources begin life as simple programs. We provide an abstraction for the SignalFlow language in the signal_analog.flow module. For this example, let's monitor the memory utilization for a sample service in AWS:

from signal_analog.flow import Data

data = Data('memory.utilization').publish(label='memory util')

The similarity in syntax to the above SignalFlow examples are intentional. We've tried to preserve the syntactic similarities between the DSL and our library as much as possible so that engineers aren't forced to learn a separate syntax. You'll notice that Data is a Python object. This allows us to maintain the same function-calling style as the DSL along with various validations we've found to be useful when authoring SignalFLow programs. This can be anything from invalid metric names all the way up to program-level validations in the Program object.

Now, let's add a filter to our program so that we're only looking at memory utilization for our application:

from signal_analog.flow import Data, Filter

data = Data('memory.utilization' filter=Filter('app', 'my_app')).publish(label='memory util')

If we wanted to add another filter, let's say to monitor our application in a specific region, we could compose filters with the And combinator:

from signal_analog.flow import Data, Filter
from signal_analog.combinators import And


all_filters = And(Filter('app', 'my_app'), Filter('aws_region', 'us-east-1'))
data = Data('memory.utilization' filter=all_filters).publish(label='memory util')

Leveling Up with Charts and Dashboards

Now that we have a simple program it's time to create our first chart. We have a number of options available, but for the time being we'll create a straightforward line chart using the TimeSeriesChart object:

from signal_analog.flow import Data, Filter, Program
from signal_analog.combinators import And
from signal_analog.charts import TimeSeriesChart


all_filters = And(Filter('app', 'my_app'), Filter('aws_region', 'us-east-1'))
data = Data('memory.utilization' filter=all_filters).publish(label='memory util')
chart = TimeSeriesChart()\
  .with_name('Memory Utilization for my_app')\
  .with_program(Program(data))

In order to create this chart in the API, we have two options:

  • Create it via the object itself
  • Use the provided CLI builder to generate a command line application to create resources on our behalf

The option you choose depends on your needs. If you want to quickly iterate on a few charts, you can create them via the objects themselves. If you want to maintain your monitoring configuration via some automated tool like Jenkins, then the CLI builder gives you a convenient way to do so without having to persist things, like your API key in source control.

We show both approaches below:

### Option 1: Create the chart from the object directly.
# chart is defined in the previous code snippet
chart.with_api_token('my-token-here').create()

### Option 2: Create the chart using the CLI builder.
from signal_analog.cli import CliBuilder
if __name__ == '__main__':
  cli = CliBuilder().with_resources(chart).build()
  cli()
# We chould then run this script from the command line like so:
# python my_config.py --api-key my-token-here create

Be sure to check out our CLI documentation for more on what you can do with the provided CLI builder.

When the library creates resources in SignalFx, it takes an opinionated approach to managing things like duplicate resources. If there is only one resource with a given name, then the library will update accordingly. Otherwise, it'll either ask the user for a selection or require a resource ID to update. It's rare that we run into resource collisions with this approach, and in the event that a resource is clobbered, we can quickly fix the offending charts and recreate resources from our configurations in source control.

Finally, adding a chart to a dashboard is as simple as bringing in the Dashboard object:

from signal_analog.dashboards import Dashboard


dashboard = Dashboard()\
  .with_name('My Application Dashboard')\
  .with_charts(chart)  # Defined in a previous code snippet

So far, we've barely scratched the surface of what's possible with signal_analog and the SignalFx platform. We encourage you to read the official docs as well as our signal_analog README to see what you can do to monitor your applications.

Composing signal_analog Objects to Maximize Re-use

One of the challenges in introducing new technologies to large organizations is making sure that teams are trending in the same direction. The last thing a technologist wants to see is the proliferation of one-off configurations that fail to scale beyond a single engineering team. One of the driving architectural decisions behind signal_analog is to make it easy to author and share configurations, so enigneering teams can learn from and use each others creations like we would a library in our own services.

Let's run through our previous examples and see how we might share them with different teams.

For our memory utilization program, we might choose to parameterize the application name so it's applicable to more teams. It makes sense to use a simple Python function here:

from signal_analog.flow import Data, Filter


def memory_utilization(app_name):
  return Data('memory.utilization', filter=Filter('app', app_name))

If we want to make the chart extensible, we can extend the chart classes themselves like so:

from sign_analog.flow import Program
from signal_analog.charts import TimeSeriesChart


class MemoryUitilizationChart(TimeSeriesChart):
  def __init__(self, app_name):
    self.with_name('Memory Uitilization for ' + app_name.title())
    self.with_program(memory_utilization(app_name))


# And to use our new chart...
memutil_my_app = MemoryUitilizationChart('my_app')

We've applied this strategy to a number of charts that are common to our application teams. Anything from request counts to error rates are encapsulated in re-usable charts and dashboards that provide out-of-the-box monitoring for new applications along with any customizations, while allowing teams to make specific changes for their use case. We've found this pattern to be straightforward to implement for developers, even if they aren't familiar with Python itself. It has allowed us to build greater uniformity in how our various teams are monitoring their applications.

What's Next for signal_analog

We believe signal_analog is a great way for teams to build shareable configurations with SignalFx today, but we aren't quite done yet. We're continuing to roll out this experience to engineering teams across the organization while folding any lessons learned into signal_analog and our internal patterns project.

Beyond simple fixes, we also intend to keep pace with SignalFx additions to the API and SignalFlow DSL. As time allows, we'll continue to add validations that make it harder to send "bad programs" to SignalFx with error messages that approach those that exist in the Elm language today.

Getting Started with signal_analog

To get started using the library today check out our documentation, as well as the upstream SignalFx documentation, both of which will get you up and running quickly.

If you have feature requests or are interested in getting involved, we curate issues publicly via GitHub issues here.