, Scalable and Maintainable Applications”
Reliable, Scalable, and Maintainable Applications
Next: data-models-and-query-languages
Data Intensive Applications use the following building blocks:
- Storing Data (Databases)
- Cache Data (Caches)
- Index Data (Search Index)
- Send Messages (Stream Processing)
- Crunch Data (Batch Processing)
Aspects of software systems
- Reliability
- The system should continue to work correctly (even with software, hardware and human fault)
- Scalability
- As the system grows (in data volume, traffic volume, or complexity) there should be reasonable ways of dealing with that growth.
- Maintainability
- Over time, many people will be working on the system. They should be able to make changes (fixes and features) productively.
Reliability
Faults: faults occur when one part of the system misbehaves.
Failure: the system as a whole misbehaves.
Hardware faults:
- RAM becomes faulty
- The grid blacks/browns out
- The internet stops working
- Hard disks fail
Hard disks have a mean time to failure of 10 to 50 years. Therefore, on a cluster with 10,000 disks, one disk should die per day.
Software Errors:
- A leap year second bug in 2012 which caused applications to hang in the linux kernel
- An infinite loop eats up all of a shared resource, starving other processes.
- An external service slows down
- Cascading failure
Human Errors:
- Make interfaces hard to use wrong.
- Create sandbox environments
- Test thoroughly at all levels
- Allow quick recovery (rolling back)
- Add lots of metrics
Scalability
Scalability is generally in response to one part of the load increasing. That could be read or write volume, or read or write throughput.
Describing Load
- Posting a tweet:
A User can publish a new message to their followers (4.6k rps, 12krps @peak)
- Viewing Home Timeline:
A user can view tweets posted by the people they follow (300k rps)
How do we handle both operations?
- Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2, you could write a query such as:
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
- Maintain a cache for each user’s home timeline---like a mailbox of tweets for each recipient user (see Figure 1-3). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.
Twitter does the second approach these days. This is because the read volume is much higher than the write volume, but this has the tradeoff of increasing write volume (if everyone has on average 75 followers, 4.6k * 75 == 345k writes per second).
Some celebrities might have 1M+ followers. In that case, twitter employs the first approach, as every time a celebrity would write a tweet, 1M writes in a few seconds would overload the writers to this service.
Describing Performance
- When you increase a load parameter and keep the system resources the same, how is performance affected?
- When you increase a load parameter, how much do you need to increase resources to keep performance unchanged?
Normally we use p90, p95, p99, p999 (90th, 95th, 99th, 99.9th) percentiles to measure performance.
Amazon for example optimized for the p999 customer’s experience, because they spent a lot on the platform, and were more likely to spend more. However, the p9999 experience didn’t matter as much, so they stopped optimizing at that point.
SLAs (service level agreements) are a contractual bound where customers can find out what is the expectation for the service to be up.
Approaches for Coping with Load
An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. For a fast growing service, you’ll want to rethink your architecture every so often to deal with that.
To create systems like this, you may want to go for a shared-nothing architecture, which increases load horizontally.
Remember that a lot of this is reliant on system load; a system built to handle 100,000 rps with a payload of 1kB is different from one with 3 rps with a 2GB payload, even though both systems have the same throughput.
Maintainability
- Operability
- Make it easy for operations teams to keep the system running smoothly.
- Simplicity
- Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
- Evolvability
- Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.
Operability: Making life easy for Operations
- The operations team should make the following easy:
- Monitoring the health of the system
- Making the platform easy to debug
- Keeping software up to date
- Anticipating future capacity problems before they occur
- Establishing good practices for deployment and configuration management
- Maintaining processes of the system
- Preserving the organization’s knowledge about the system.
Simplicity: Managing Complexity
Making a system simpler often means reducing accidental complexity, and finding good abstractions.
Evolvability: Making Change Easy
Following agile, TDD, allowing the system to be more modular through the use of interface segregation.