Maintaining Consistency
- Timestamps won’t work because this is a distributed system.
- We want the server that processes messages to increment a number to preserve ordering for the same client.
500 million daily active users. Each user sends 40 messages daily.
20B messages per day.
A message is 100 bytes. So per day, we need 2TB of storage.
For about 5 years of chat history, we’d need 3.5 PB of storage.
This costs us about 25MB of incoming data per second upstream and downstream (2TB / 86400)
(We can always compress and decompress data. Given textual data, assume a 2-4x compression rate, so we can support 6MB-25MB up and downstream).
Can we archive some of this data/hold it in non-hot storage?
What is the cost of this?
1GB/Month costs about 1 cent for Disk. 1GB/Month costs about 10 cents for SSD. 1GB/Month costs about $1 for RAM storage.
About 5M a month to store all in RAM.
user_id
)message_id
, from_id
,
to_id
, message) (index on from_id
,
to_id
), but how do we make updates fast (use pagination,
not offsets here).How do we guarantee linearizability and stronger consistency? What should we do when certain nodes aren’t reachable?
Use pagination. For range queries. Limit/Offset related stuff doesn’t work.
Pull is old school (not all devices support bidirectional sockets, a la HTTP 1.2), so we may have to support this for very old devices.
Once per second = 500M queries per second to web services.
Push is more efficient, since the overhead of opening a connection in HTTP can be amoritized over the cost of a session. As well, Server Side Events are very useful for this.
Most smartphones.
user_id
, so we can
fetch all messages for a user.