Real-World SRE
上QQ阅读APP看书,第一时间看更新

Instrumenting an application

First, a caveat—there are a lot of programming languages and monitoring systems. We will be talking about various monitoring systems later in the chapter, and there are libraries for all sorts of languages and systems. So, just because I am providing examples here with specific languages and libraries, it does not mean that you cannot do something very similar with your language and monitoring system of choice.

For the first example, we will use Ruby and StatsD. Ruby is a popular scripting language and tends to be what I use when I want to build something quickly. Also, some very large websites use Ruby, including GitHub, Spotify, and Hulu. StatsD is a monitoring system from Etsy. It is open source and used by many companies including Kickstarter and Hillary for America.

I have commented on this simple application as much as possible. However, if you need more documentation than my comments, see the references section.

Sinatra is a simple web framework. It creates a domain-specific language inside of Ruby for responding to web requests:

require "sinatra"

One of the most popular Ruby StatsD libraries is installed by running gem install statsd-ruby. require is Ruby's import function:

require "statsd"

Logger is part of the Ruby standard library and lets us write out to the console with a predefined format:

require "logger"

Here, we configure our StatsD library to write debug messages out to the standard output, whenever it tries to send a request to the StatsD server. This is nice for debugging and also for our example application to see what is going on:

Statsd.logger = Logger.new(STDOUT)

This creates an instance of the StatsD class to talk to a StatsD server that is running on port 9125 on the local machine. StatsD traditionally runs over UDP instead of TCP. Note that, because StatsD traditionally runs on UDP, the StatsD service does not need to be running to test our code. UDP does not retry if it fails, so the network requests will just go into the void. While it would be better if it was running, because we are logging all requests to the standard output with the preceding line, we still know what is happening, despite not having set up a server to receive the data. We talk about the protocols of UDP and TCP in depth in Chapter 9, Networking Foundations.

$statsd = Statsd.new 'localhost', 9125

This is a part of the DSL given to us by Sinatra. It says that if the HTTP server receives a request to/of the GET method (which is just a normal page load), we should run the code inside the block. The block is defined between the do and end keywords.

get "/" do

This StatsD function times the length of the block that is passed to it and, when the block finishes, sends that time to StatsD.

  $statsd.time "request.time" do

This increments the request counter by one. Note that StatsD doesn't natively support tags, or any way to describe this metric. Other implementations of StatsD do, but by default it is not supported. As such, this does not get any of the useful extra data mentioned earlier, such as path, method, or status code.

    $statsd.increment "request.count"

In Ruby, the last line of the block is the return value. In this case, "hello world" is returned from the timing block, then returned by the Sinatra block, and then sent to the user, as Sinatra considers the returned string to be what to return to users in their HTTP response body.

    "hello world"
  end
end

This is another Sinatra DSL. It is fired whenever Sinatra does not match the URL a user is looking for. So, if they visit any URL besides / on this server, this block will run. We want to know about all types of requests, so we increment request.count here, as well as recording the request.time. The new thing we do here, though, is increment request.error.

not_found do
  $statsd.time "request.time" do
    $statsd.increment "request.count"
    $statsd.increment "request.error"

    "This is not found."
  end
end

If any other error besides a not found error happens, this block will run. This looks identical to the not_found block, except for the return message, because they are doing the same thing, just returning from a slightly different error.

error do
  $statsd.time "request.time" do
    $statsd.increment "request.count"
    $statsd.increment "request.error"

    "An error occurred!"
  end
end

While this example is somewhat straightforward, you may be thinking it is annoying to have to write code like this for each route in your application. Also note that this example is being overly explicit. Often, you would want to use a library that gives you these metrics for free, or to use a service like NGINX, in between the user and your application, that records some request metrics. However, the main takeaway from this code sample is the two monitoring function calls:

  • increment
  • time

With these two functions, you can do a lot. With time you could measure how long it takes to talk to a dependency. With increment you could do anything you want, from the number of requests received to the number of times a user performs an action. StatsD also supports gauges for numbers you want to take a snapshot of, which is similar to a timing, but without the unit of time. We will be talking later about other things you might want to monitor and how to talk to your team about what is important for them to monitor.

The next example is written in Go, using the Prometheus monitoring library. Prometheus is a relatively new and increasingly popular monitoring system, written by the folks at SoundCloud. It is very similar to Google's internal monitoring system, Borgmon. While StatsD is a push-based monitoring system, Prometheus is a pull-based system. As for the programming language this example is written in, it was chosen because Go has well-written examples and opinionated tools that make development and processes easier.

Like the previous example, for more details please check out the references section, which contains links to the official documentation.

Go files always start with a package declaration. main is the package for creating an executable.

package main

import (

The import section is how Go imports packages. log is part of the Go standard library. It lets us write out messages prepended with a timestamp. net/http is also part of the Go standard library and includes functions for building an HTTP server. Lastly, time is part of the Go standard library and includes functions for dealing with time. We will use it to determine how long it takes for something to run.

  "log"
  "net/http"
  "time"

The next two libraries are not part of the standard library. github.com/prometheus/client_golang/prometheus is the third-party library for building Prometheus metrics. Despite looking like a URL, do not expect a web page there. Go structures import links for external resources as URLs, where the beginning of the path is a domain and path to the source repository, and then the following slashes are directories inside that source repository. To install a package for use, run go get github.com/prometheus/client_golang/prometheus. Some domains, such as gopkg.in, publish metadata redirecting to another source repository, which the go get tool follows to download the code when you install the package.

Unlike StatsD, Prometheus publishes its metrics on the server it is running from. promhttp is the second third-party library in this import and its job is to make running a Prometheus metric server endpoint easy for us. It is also part of the Prometheus Go library.

  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promhttp"
)

Prometheus requires us to declare metrics before we can write to them. First, we define the request count metric we will be incrementing. All Prometheus metrics require a name and a description to make them more readable. We are also providing an optional list that will describe the required attributes each request increment will need.

var (
  httpRequests = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "http_request_count_total",
        Help: "HTTP request counts",
      },
      []string{"path", "method"},
  )
  httpDurations = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
      Name: "http_request_durations_microseconds",
      Help: "HTTP request latency distributions.",
      Objectives: map[float64]float64{},
    },
    []string{"path", "method"},
  )
)

On startup, the Prometheus library will make sure all of our metrics are properly defined or the application will die in a panic. This "must" is a common pattern for runtime checks in Go.

func init() {
  prometheus.MustRegister(httpRequests)
  prometheus.MustRegister(httpDurations)
}

This is a very small middleware, which will wrap any handler passed to it, increment the request count, and record duration. For brevity and simplicity, we aren't recording errors.

func MetricHandler(handlerFunc func(http.ResponseWriter, *http.Request)) http.HandlerFunc {
    

This function takes in a function. It then returns a wrapping function. When this wrapped function is called, we start recording the time by marking what time it is currently. Then, we call the function we are wrapping, which does all of the actual work for the request. We are essentially building a generic wrapping function and this is yielding to the function we passed. The writer variable is what is actually receiving the method calls to modify the HTTP response that Go is building. After the function has returned, we calculate the time passed in microseconds and submit that metric. We then increment the response count with path and method labels:

return http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
      start := time.Now()

      handlerFunc(writer, request)

      elapsed := time.Since(start)
      msElapsed := elapsed / time.Microsecond

      httpDurations.WithLabelValues(request.URL.Path, request.Method).Observe(float64(msElapsed))
      httpRequests.WithLabelValues(request.URL.Path, request.Method).Inc()
    })
  }

  func main() {

This is a function, saved to a variable that, when given a request, returns the text "hello world":

    helloWorld := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
      w.Header().Set("Content-Type", "text/plain")
      w.Write([]byte("hello world"))
  })

This says that all requests to / go first to our metric handler, inside of which we have passed our helloWorld function. So the MetricHandler will call the helloWorld function where it says handlerFunc(writer, request):

  http.Handle("/", MetricHandler(helloWorld))

The Handler function provides a default handler to expose metrics via an HTTP server. /metrics is the normal endpoint for that. If you visit /metrics on this server, you will see the current value of all of the metrics this server exports. It is also the path that Prometheus will scrape to store metrics in its datastore. You configure Prometheus to scrape paths in its configuration, so you can actually have this at whatever path you want: /metrics is just a common default.

  http.Handle("/metrics", promhttp.Handler())

This starts the actual server on port 8080, and if it dies, logs the final status to the console:

  log.Fatal(http.ListenAndServe(":8080", nil))
}

The structure of this code is very different to the previous example because it is a different language and monitoring system, but the general philosophy remains the same. Just like with StatsD, you create data sources and, over time, insert data points at timestamps. Whatever you want to measure, there are often many ways to record it, which are available in a multitude of programming languages.

Note that in both of these examples we are ignoring logging. There are lots of ways to log information, including writing to a plain text file, JSON objects written out to the standard output, or even writing to a database. Sometimes, this is called event-driven monitoring. It's just another set of tools and can be used to acquire similar data, just in a different way.

Now that you have a view of the starting point of how to record metrics, we should pivot back to the theoretical. What should we monitor?

What should we measure?

Here is one of my favorite SRE interview questions: "What is your monitoring philosophy?" I find this question fascinating because it tends to spark discussion. A monitoring philosophy is how you decide what to monitor and when. Some people believe in the monitor everything philosophy. They think that if something happens, or changes, a counter should be incremented or a log line produced (or both). Another philosophy is a more minimal wait and see approach. In this philosophy, if in a postmortem the team wishes they had a piece of data, then it should be instrumented, monitored, and stored.

Note

Postmortems are retrospectives after incidents. We will talk about them in depth in Chapter 4, Postmortems. You can also read a paragraph about them in Chapter 1, Introduction.

The only response to this question that I have ever rejected as a valid response was from a friend who was looking to get started in software and claimed his monitoring philosophy was that of On Kawara's, I Got Up art piece. Kawara is a conceptual artist and throughout the 1960s and 1970s he mailed postcards from the cities he awoke in each morning to friends and random people, containing a timestamp of when he woke up that day, his current address, and nothing else. So, because it was semi-random, there was no way to view when he woke up over time. Now museums have subsets of this collection, but there are significant gaps. It is fascinating to look at this project as an early view of human logging, but I cannot imagine a more frustrating scenario than having to look at multiple data stores to view a single metric over an irregular period. I am pretty sure the friend was joking, but the idea is still interesting to think about. It is a good reminder that monitoring is not unique to computers.

Note

Kawara's art is wonderful for envisioning monitoring pre-computers. He also created works of art around the walks he took each day, drew on maps, made lists of who he met for years, and sent telegrams to people telling them he was still alive.

Philosophies aside, if you are new to monitoring a system, it can be hard to decide on what you want to monitor. You may have storage or cost concerns preventing you from monitoring everything, and your service may be too new to have postmortems that provide you with valuable insights on what you are not monitoring.

My recommendations are probably predictable given the previous section and its examples. I always say that it is best to start with the basics. If your service is a web server, then start by making sure error counts, request counts, and request durations are collected. If your service is some sort of data pipeline, batch service, or cron job, then a better starting place is job invocations, job durations, and job successes. If you need a cute little acronym or abbreviation, I like ERD for web servers and IDS for background services.

After you lay those foundations, there are lots of areas you can start expanding. Remember that your goal is to validate that your service is doing what you expect it to. I tend to start recording business logic things, like how many records there are in the database, metrics around connecting to our dependencies, how many types of actions are happening, and so on. After that, I focus on capacity.

Note

We talk about capacity metrics in greater depth in Chapter 6, Capacity Planning.

Capacity is often a recording of the resources your server is using—metrics such as hard drive storage, network bandwidth, CPU, and memory. To provide an example, let us start with an arbitrary image upload service. Maybe it is the server that receives all of the images that users upload to Instagram. Maybe it is the code in your blog that handles uploading images for posts. Whatever the case, this service has a single URL that you post image uploads to. It then writes the images to the local disk, inspects them, and stores the metadata to a database.

The service then uploads the images to a cloud storage service, where we put all of our user images.

What should we measure?

Diagram of the example image upload service

So, what would you monitor? Take a second and try to answer these questions—if a developer introduced a bug, what would you be concerned about? What metrics could tell you that this server is doing its job?

I would monitor the following things:

  • Request errors, request duration, and request count: This is our ERD for the server, since we are a web server. It is at our very baseline of are things working?
  • Bytes uploaded: This will account for how much bandwidth we are using. Since some of our requests could be very large, we want to know if we start seeing a spike in how much data is coming into our server, because not every request will be the same.
  • Image size in bytes: This provides the same information but from a different angle. After we have received the bytes and processed them, how big are the final images? This way, if, for example, our upload client changes (such as a new iPhone launches with a more powerful camera, or we change our upload code to compress before sending), we can see the effect versus our network bandwidth.
  • Images uploaded: This metric will only be useful if we see significant changes. If we get a magnitudinal change in images, it might explain why things fail in the event of an outage.
  • Count of image metadata in database: We keep a count of this to compare it to the image count uploaded, to make sure we are not dropping image data on the floor if the database is down.
  • Count of images in storage: This makes sure that images are getting to the data source that we eventually serve images from. If they are not, maybe our cloud storage is down or there is a network issue.

This is not an exhaustive list! You may have listed other metrics, but the key is just to start somewhere. Just because you write monitoring today, it does not mean you are done. This is something you will want to evolve. As your service grows, you will discover what metrics you look at and which ones are useless. As your team grows, some other developer can, and should, add metrics they care about. I like to review everything we are monitoring once a year with the team, to make sure we are consistent with how we monitor. Are metric names consistent? Do the metrics have units? Are we collecting and storing our metrics consistently?

Note

We haven't mentioned it yet, but it is worth at least a shout out that, when thinking about what to monitor, there are often two ways to look at the problem—black-box monitoring and white-box monitoring.

Black-box monitoring assumes that your monitoring tool knows nothing or very little about your application. Often, it is a probe or regular series of requests to check whether things are working the way you expect.

White-box monitoring is the instrumentation of your code. It knows everything about how things work because it is coming from inside of the application. If you were to compare this to something like the electronics of an airplane, white-box monitoring is the gauges in the cockpit. Black-box monitoring is the updates the radio towers send and receive as planes fly over them.

Once you have started to monitor things, you can start working with your product and business team to make sure that your assumptions about system health match theirs.

A short introduction to SLIs, SLOs, and error budgets

At some point, you are going to discuss whether or not it is safe to deploy an update to your software. Change is the thing that improves our software, but it also introduces the possibility of an outage. So, at times of great importance, how do you convince someone in the business that it is safe to make a software change? Can you push a code change on Black Friday, or Singles' Day, to an e-commerce website and know you will not lose the business millions of dollars? We will talk about this in more detail in Chapter 5, Testing and Releasing, but a good starting place is figuring out what your team views as its most important metric.

Service levels

A Service Level Indicator (SLI) is a possible most important metric for the business. For websites, a common SLI is the percentage of requests responded to healthily. For other types of services, an SLI can be a performance indicator, such as the percentage of search results returned in under 100 milliseconds. To be clear, an SLI is a metric behind these goals. Request count would be the SLI behind percentage of requests responded to healthily. Search result request duration would be the SLI behind percentage of search results returned in under 100 milliseconds. For background services, you could do something like percentage of jobs run on time, or percentage of jobs that completed successfully. For some sites, the only metric recorded is the response of a GET request sent every five minutes by a service like Pingdom. This metric of did we even respond to the request? can be used as a rough approximation for the backing of a goal like percentage of requests we responded to.

Note

A request sent to your service that is identical every time and sent at a regular interval is often called a health check. Many load balancers use this to determine whether a server should continue receiving traffic. We talk more about the basics of load balancers in Chapter 10, Linux and Cloud Foundations.

A Service Level Objective (SLO) is a goal built around an SLI. It is usually a percentage and is tied to a period. It is typically measured in a number of nines. For example, time periods are the last thirty days, the last 24 hours, the current financial quarter, and so on. When someone says they have a number of nines of uptime, they mean a percentage of time when their service is available. Without a time-based metric, it is a relatively useless statement, so I usually assume people mean a rolling cumulative average over the past thirty days.

The following are some examples of nines:

  • 90% (one nine of uptime): Meaning you were down for 10% of the period. This means you were down for three days out of the last thirty days.
  • 99% (two nines of uptime): Meaning 1% or 7.2 hours of downtime over the last thirty days.
  • 99.9% (three nines of uptime): Meaning 0.1% or 43.2 minutes of downtime.
  • 99.95% (three and a half nines of uptime): Meaning 0.05% or 21.6 minutes of downtime.
  • 99.99% (four nines of uptime): Meaning 0.01% or 4.32 minutes of downtime.
  • 99.999% (five nines of uptime): Meaning 26 seconds or 0.001% of downtime.

In our modern world of distributed systems, with cheap hardware and constant iteration, rarely do systems operate with reliability better than five nines. Some examples of things that are built with better reliability than this are satellites, airplanes, and missile guidance systems. The point of counting the number of nines, though, is that you usually want to think of things in terms of orders of magnitude.

Note

Orders of magnitude are usually a multiple of ten. We usually think of things in orders of magnitude because it is at that scale that we start running into issues. For example, if you can do 10 of a thing, you can probably do 40, but it is notably different to do 100 or 1000 of that same thing.

Finally, a Service Level Agreement (SLA) is a legal agreement a business publishes around an SLO. Often, it provides monetary compensation to customers for the time a business is outside of their defined SLO. This is most common in Software as a Service (SaaS) companies, but could exist in other places as well.

Companies that publish SLAs often also have internal SLOs that are tighter or more restrictive than the published SLA, so the company can operate assuming some risk but without having to be at the edge of needing to pay customers every time it steps past that limit.

Error budgets

Once you have a rough idea of your SLO, and you have measured it for a bit, you can use it to make decisions on how to do things that are risky. An example of this is something I worked on at First Look Media. We had no significant downtime for a few months and had been sitting at 100% uptime for the previous 30 days. Our SLO had been 99.9% of HTTP GET requests, on a one-minute interval, from an external polling service that needed to return the status code 200. So, we did something risky that we had been putting off because the possibility of downtime was high—reconfigured our services to share clusters, instead of each having their own clusters. In this case, a cluster was a pool of shared servers and each service was a collection of Docker containers scheduled with Amazon's Elastic Container Service. We did this one service at a time, and only had downtime on one service, but we felt comfortable doing this because we had an agreement with our business team as to what was acceptable downtime. The inverse of this was that when we had months when we had many outages and didn't meet our agreed upon SLO, we stopped deploying.

Note

This is a quick overview of deployments. We go into more details on deploying, releasing, and release agreements in Chapter 5, Testing and Releasing.

Not every service needs SLOs and not every organization needs error budgets, but they can be useful tools for working with people when trying to decide when it is safe to take risks. Sometimes, people will say, "Trust but verify," or "Have data to back your claims." When taking risks that could affect your business, this is important. Just like you want to know if and why your human resources team is going to change your health insurance, your coworkers want to know that you are only taking risks in a way that the business can handle. It is not confidence inspiring if you walk into a room with no data and say, "Yeah, everything will be fine." It's much better if you can show up and say, "You can trust us because we have made 100 deploys in the past month with no downtime, and if this does fail, we can roll back and will still be within our agreed SLO. If we do not meet our SLO this month due to a failure, we will stop deploying until things have stabilized and we are back under our agreement."

Services without SLOs run toward the possibility of having customers define what is an acceptable level of performance. That being said, having an SLO does not guarantee that your users will see your service's availability as correct, even if you publish an SLO. For example, Amazon Web Services has an SLA for most services it sells, but customers are often frustrated with the downtime they see. This may be illogical, as customers are using a product that they are being told won't perform at the level they want, and they continue paying for it. However, unless Amazon sees a significant change in user behavior, it has no incentive to update its SLA (and the underlying SLO), because the more restrictive the SLO, the higher the chance it will have to give customers money back. If Amazon were to see a significant number of users leaving, it then might change. If you are seeing your revenue decline due to an SLA, or if your SLO is so loose that customers are implementing solutions to deal with your differing expectations, it might be time to either define an SLO, if your service lacks one, or update your existing one to appease your customers' needs. An SLO can then help you to define the prioritization of work.

Note

In Chapter 7, Building Tools we will talk more about the prioritization of work. SLOs should never be the only reason work is prioritized. If you are chasing just SLOs, your team may find themselves very disconnected from the business and its needs. If this happens, take a step back and work with the team and management to better define your place in the organization.

As we stated at the beginning of the chapter, monitoring is used to show change over time. People find it useful to prove a history of consistency. Sadly, due to the constraints of technology, we cannot store everything forever easily without a lot of money or resources. So, let's talk about how we collect and store this information.