Why monitor?
We need to present a system which is responsible, consistent and reliable to our users. We need therefore, to observe what they are seeing, and be made aware when the services we build are not available.
In the heavily virtualised, containerised, autoscaling and frequently unreliable commodity systems, the number of things we can measure is overwhelming. So we need to focus on the signals which will be most useful ensure the service is up for our users, and do so without taking up all of our attention.
There are two broad categories of metrics:
- metrics for seeing what the customer is seeing
- metrics for seeing what your system is doing.
The first set are most appropriate to watch and alert off. Watching the second without knowing what to look for is infeasibly time-consuming and without a formalised approach can yield very little useful information.
The top metrics to watch are service specific, but usually fall into:
- error-rates (HTTP, application, consumer)
- latency (percentiles)
- traffic throughput
These categories of errors are roughly what is reflective of the customer experience.
- HTTP edge logs and metrics (Usually an API Gateway's total request time, http status codes)
- Load balancer metrics:
- status codes
- latencies
- surge queues
- unhealthy hosts
- AWS Lambda invocation errors
- Application custom metrics
- DynamoDB, kinesis and sqs usage metrics
- Mobile appdex and error reporting (datadog, newrelic etc)
Some examples:
- http 5XX errors
- Lambda invocation errors
- unhealthy ECS hosts
- Dynamodb rate-limit throttling
- latency thresholds
- messages landing in SQS bad-data and redrive queues
- failure to see service heartbeats
Metrics are often lossy and orientated around previous known problems, or operate at percentiles. What happend to a particular customer's request is not typically retrievable from the server-side (though client-side trackers may be able to capture it from the exterior; see Raygun, Crashalytics etc).
'Observability', or the practice of and tracing individual requests throughout distributed systems is a interrelated emerging field, but well worthwhile looking into:
- Microservices tend to fan-out traffic or push it around asyncronously in queues. Tracking high-value flows with SAAS or AWS Xray can help debug specific customer complaints
- Traffic and error rates often are too difficult to alert off with simple thresholds. More sophisticated anomaly detection algorithms can help spot these outages before phone-calls from customers outages significantly quicker than simple metrics systems.
- How is this different to logging? @copyconstruct has all the answers.
- Never forget that a service's uptime is determined by its dependencies.