metrics.md

Metrics and alerting

Why monitor?

We need to present a system which is responsible, consistent and reliable to our users. We need therefore, to observe what they are seeing, and be made aware when the services we build are not available.

What to monitor?

In the heavily virtualised, containerised, autoscaling and frequently unreliable commodity systems, the number of things we can measure is overwhelming. So we need to focus on the signals which will be most useful ensure the service is up for our users, and do so without taking up all of our attention.

There are two broad categories of metrics:

metrics for seeing what the customer is seeing
metrics for seeing what your system is doing.

The first set are most appropriate to watch and alert off. Watching the second without knowing what to look for is infeasibly time-consuming and without a formalised approach can yield very little useful information.

The top metrics to watch are service specific, but usually fall into:

error-rates (HTTP, application, consumer)
latency (percentiles)
traffic throughput

These categories of errors are roughly what is reflective of the customer experience.

Where to get the data:

HTTP edge logs and metrics (Usually an API Gateway's total request time, http status codes)
Load balancer metrics:
- status codes
- latencies
- surge queues
- unhealthy hosts
AWS Lambda invocation errors
Application custom metrics
DynamoDB, kinesis and sqs usage metrics
Mobile appdex and error reporting (datadog, newrelic etc)

What to monitor and alert on:

Some examples:

http 5XX errors
Lambda invocation errors
unhealthy ECS hosts
Dynamodb rate-limit throttling
latency thresholds
messages landing in SQS bad-data and redrive queues
failure to see service heartbeats

Tracing and 'observability'

Metrics are often lossy and orientated around previous known problems, or operate at percentiles. What happend to a particular customer's request is not typically retrievable from the server-side (though client-side trackers may be able to capture it from the exterior; see Raygun, Crashalytics etc).

'Observability', or the practice of and tracing individual requests throughout distributed systems is a interrelated emerging field, but well worthwhile looking into:

Microservices tend to fan-out traffic or push it around asyncronously in queues. Tracking high-value flows with SAAS or AWS Xray can help debug specific customer complaints
Traffic and error rates often are too difficult to alert off with simple thresholds. More sophisticated anomaly detection algorithms can help spot these outages before phone-calls from customers outages significantly quicker than simple metrics systems.

Other considerations:

How is this different to logging? @copyconstruct has all the answers.
Never forget that a service's uptime is determined by its dependencies.