Skip to content

Instantly share code, notes, and snippets.

@davidporter-id-au
Last active May 9, 2018 13:46
Show Gist options
  • Save davidporter-id-au/feeda612cbbf46b61e6994eec6fdfbcf to your computer and use it in GitHub Desktop.
Save davidporter-id-au/feeda612cbbf46b61e6994eec6fdfbcf to your computer and use it in GitHub Desktop.

Metrics and alerting

Why monitor?

We need to present a system which is responsible, consistent and reliable to our users. We need therefore, to observe what they are seeing, and be made aware when the services we build are not available.

What to monitor?

In the heavily virtualised, containerised, autoscaling and frequently unreliable commodity systems, the number of things we can measure is overwhelming. So we need to focus on the signals which will be most useful ensure the service is up for our users, and do so without taking up all of our attention.

There are two broad categories of metrics:

  • metrics for seeing what the customer is seeing
  • metrics for seeing what your system is doing.

The first set are most appropriate to watch and alert off. Watching the second without knowing what to look for is infeasibly time-consuming and without a formalised approach can yield very little useful information.

The top metrics to watch are service specific, but usually fall into:

  • error-rates (HTTP, application, consumer)
  • latency (percentiles)
  • traffic throughput

These categories of errors are roughly what is reflective of the customer experience.

Where to get the data:

  • HTTP edge logs and metrics (Usually an API Gateway's total request time, http status codes)
  • Load balancer metrics:
    • status codes
    • latencies
    • surge queues
    • unhealthy hosts
  • AWS Lambda invocation errors
  • Application custom metrics
  • DynamoDB, kinesis and sqs usage metrics
  • Mobile appdex and error reporting (datadog, newrelic etc)

What to monitor and alert on:

Some examples:

  • http 5XX errors
  • Lambda invocation errors
  • unhealthy ECS hosts
  • Dynamodb rate-limit throttling
  • latency thresholds
  • messages landing in SQS bad-data and redrive queues
  • failure to see service heartbeats

Tracing and 'observability'

Metrics are often lossy and orientated around previous known problems, or operate at percentiles. What happend to a particular customer's request is not typically retrievable from the server-side (though client-side trackers may be able to capture it from the exterior; see Raygun, Crashalytics etc).

'Observability', or the practice of and tracing individual requests throughout distributed systems is a interrelated emerging field, but well worthwhile looking into:

  • Microservices tend to fan-out traffic or push it around asyncronously in queues. Tracking high-value flows with SAAS or AWS Xray can help debug specific customer complaints
  • Traffic and error rates often are too difficult to alert off with simple thresholds. More sophisticated anomaly detection algorithms can help spot these outages before phone-calls from customers outages significantly quicker than simple metrics systems.

Other considerations:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment