Monitoring in a Reliability Engineering World
Prev: mysql-architecture Next: performance-schema
SRE’s, popularized by Google, now lets teams think about operational work in a different light:
- Are we providing an acceptable customer experience?
- Should we focus on reliability and resilience work?
- How do we balance new features against toil?
SRE metrics include the following:
- Service Level Indicators (SLI): How do I measure whether my customers are happy?
- Service Level Objective (SLO): What is the minimum I can allow my SLI to be to ensure that my customers are happy?
- Service Level Agreement (SLA): What SLO am I willing ot agree to that has consequences?
After this, we can select an SLI/SLO/SLA to meet our customers with, and monitor those.
Monitoring Availability
Talking about availability for an online store, what features are nonnegotiable and what are nice to have:
What features being down are catastrophic?
What is the shortest possible mean time to recovery (MTTR) we can promise?
Monitoring Errors
Lock wait timeouts can be a sign of escalating row-lock contention, and can signal downtime later on:
Aborted connections can be an indicator of lots of client-side retries, which consumes resources.
Proactive monitoring
Steady state monitoring (monitoring for differences from the normal state) lets us know if something unexpected is happening to the system.
Disk growth
- Taking up too much space on disk can lead to disk space errors and degraded performance. Try to monitor the growth of disk space on instances.
Connection growth
- MySQL can handle a finite pool of connections, which is set by
configuring
max_connections. If this grows above 100, you can usept-killto kill stray connections.
Replication Lag
If an application is replicating too slowly, it can point to too many writes being handled by the system, and necessitate an architectural improvement.
I/O Utilization
If you find that I/O utilization is high, or close to 100%, it can indicate inefficient queries, like full table scans or not hitting enough indexes.
Auto-increment space
Primary Keys in MySQL are signed integers, and can quickly run out of space. Be careful about this, maybe using BigInt types?
Backup creation/restore time
Make sure to utilize backups for recovery purposes, and maybe consider only backing up business critical parts of your database.
Percentiles are your friend: Don’t use average, use medians for SLIs!
Prev: mysql-architecture Next: performance-schema