Prev: mysql-architecture Next: performance-schema
SRE’s, popularized by Google, now lets teams think about operational work in a different light:
SRE metrics include the following:
After this, we can select an SLI/SLO/SLA to meet our customers with, and monitor those.
Talking about availability for an online store, what features are nonnegotiable and what are nice to have:
What features being down are catastrophic?
What is the shortest possible mean time to recovery (MTTR) we can promise?
Lock wait timeouts can be a sign of escalating row-lock contention, and can signal downtime later on:
Aborted connections can be an indicator of lots of client-side retries, which consumes resources.
Steady state monitoring (monitoring for differences from the normal state) lets us know if something unexpected is happening to the system.
max_connections
. If this grows above 100, you
can use pt-kill
to kill stray connections.If an application is replicating too slowly, it can point to too many writes being handled by the system, and necessitate an architectural improvement.
If you find that I/O utilization is high, or close to 100%, it can indicate inefficient queries, like full table scans or not hitting enough indexes.
Primary Keys in MySQL are signed integers, and can quickly run out of space. Be careful about this, maybe using BigInt types?
Make sure to utilize backups for recovery purposes, and maybe consider only backing up business critical parts of your database.
Percentiles are your friend: Don’t use average, use medians for SLIs!
Prev: mysql-architecture Next: performance-schema