Knowing thing
- Can post tweet
- Timeline
- Follow
- ...
- Work solution
- Analysis and communication
- Tradeoff Pros/Cons
- Following the requirement solving problem
- Knowledge base
- Where should you put the cache
- How to updat the cache
- Understanding requirement (Clarify the requirement)
- Capacity Estimation
- System APIs
- High-level System Design
- Data Storage
- Scalability
Don't follow the thing in you mind, Ask the client what that need. The main goals
- Requirement
- Traffic size (e.g., Daily Active User)
- Tweet
- Create
- Delete
- Timelime/Feed
- Home ( overall content )
- User ( Focus at User )
- Follow a user
- Like a tweet
- Search tweets
- ...
- Consistency
- Get recently contents
- Ex. when Other person post a new tweet, can be real time showing
- Sacrifice: Eventual consistency
- Reason: not effect the final User Experience
- Get recently contents
- Availavility
- Every request receive a (non-error) response, without the guarantee that it contains the most recent write
- System must be Scalabble
- Performance: low Latency
- Ex. High loading
- Partition tolerance (Fault Tolerance)
- The system continues to operate despite and arbitrary nubmer of message being dropped (or delayed) by the network between nodes.
The main point of those analysis
- Understand where is the bottleneck
- 200 million DAU, 100 million new tweets
- Each user: visit home timeline 5 time; other user timeline 3 times
- Each timeline/page has 20 tweets
- Each tweet has size 280 (104 characters) bytes, metadata 30bytes
- Per photo: 200KB, 20% tweets have images
- Per video: 2MB, 10% tweets have video, 30% videos will be watched
- Write size daily:
- Text: 100M new tweet * (280 + 30) Bytes/tweet = 31GB/day
- Image: 100M new tweets * 20% has image * 20KB per image = 4TB/day
- Videos: 100M new tweets * 10% has videos * 2MB per videos = 20TB/day
- Totals:
- 31GB + 4TB + 20TB = 24TB/day
Daily read tweets volume
- 200 M * ( 5 home visit + 3 user visit ) * 20 tweets / page = 32B tweets/day
Daily Read bandwidth
- Text: 32B * 280 bytes / 86400 = 100MB/s
- Image: 32B * 20% tweets has image * 200 KB per image / 86400 = 14GB/s
- Videos: 32B * 10% tweets has videos * 30% got watched * 2MB per video / 86400 = 20GB/s
- Total: 35GB/s
Notes: [opt] is optional
- String userToken
- for authorized use
- Int tweet_id
- for search which tweets
- String tweet ( contents )
- content from user
- Int pageSize ( IDK )
- bool like
- user like or unlike
- [opt] String pageToken
- using for turning page
- postTweet( userToken, tweet )
- deleteTweet( userToken, tweet_id )
- likeOrUnlikeTweet( userToken, tweet_id, like )
- readHomeTimeline( userToken, pageSize, [opt] pageToken )
- readUserTimeline( userToken, pageSize, [opt] pageToken )
![CC_HLSD_Post_Tweet.drawio (2)](./CC_HLSD_Post_Tweet.drawio (2).png)
![CC_HLSD_UserTimeline.drawio (1)](./CC_HLSD_UserTimeline.drawio (1).png)
- Fan out on write ( Update cache with User post a new tweets )
It's only a method twitter do ( Warning: this description may be wrong )
- Get followers info
- Write a list for relative users ( Home timeline )
- With user read the recent timeline, will return the thing at cache.
![CC_HLSD_HomeTimeline.drawio (2)](./CC_HLSD_HomeTimeline.drawio (2).png)
- How
- Fetch tweets from N followers from DB, merge and return
- Pros
- Write is fast:
O(1)
- Write is fast:
- Cons
- Read is slow:
O(N)
DB reads
- Read is slow:
- How
- Maintain a feed list in cache for each user
- Fanout on write
- Pros
- Read is fast:
O(1)
from the feed list in cache
- Read is fast:
- Cons
- Write needs more efforts: O(N) write for each new tweet
- Async tasks
- delay in showing lastest tweets
(evetual consistency)
- Write needs more efforts: O(N) write for each new tweet
- Non-realtime users ( Ex. Zoombie or Offline Users )
- With write take
O(N)
, the most needed user can not been update as realtime - Solution: Hybrid Solution
- With write take
- Non-hot users:
- fan out on write ( Push ): write to user timeline cache
- do not fanout on non-active users
- Hot users:
- fan in on read ( Pull ): read during timeline request from tweets cache , and aggregate with results from non-hot users
Notes: PK Primary Key
PK | userID: Integer |
---|---|
name: varchar(256) | |
email: varchar(100) | |
creationTime: DateTime | |
lastLogin: DateTime | |
isHotUser: Boolean |
PK | tweetID: Integer |
---|---|
userID: Integer ( relative ) | |
creationTime: DateTime | |
content: varchar(140) | |
... |
PK | Columns |
---|---|
userID1: Integer | |
userID2: Integer |
- SQL database
- relational data
- E.g., user table
- Non-relational data
- E.g., timeline
- File System
- E.g., Media or Large Files
- Media files: image, audio, video ...
- E.g., Media or Large Files
- relational data
- Identify porential bottlenecks
- Discussion solutions, focusing on tradeoffs
- Data sharding
- Data store, cache
- Load balancing
- E,g,m
User <-> application server; appication server <-> cache server; application server <-> DB
- E,g,m
- Data caching
- Read heavy
- Data sharding
- Why?
- Impossible to store/process all data in a single machine
- How?
- Break large tables into smaller shards on multiple servers
- Pros
- Horizontal scaling
- Cons
- Complexity ( distributed query, resharding )
Options: Shard by tweet's Creation Time
-
Pros:
- Limited shards to query
-
Cons:
- Hot/Cold data issue
- New shards fill up quickly
![hot_cold-tables.drawio (1)](./hot_cold-tables.drawio (3).png)
Option 2: Shard by hash ( userid )
- Pros
- Simple
- Query user timeine is straightforward
- Cons
- Home timeline still needs to query multiple shards
- Non-uniform distribution of storage
- User data might not be able to fit into a single shard
- Hot users
- Availability ( Hot data )
Option 3: Shard by hash ( tweetid )
- Pros
- Uniform distribution
- High availability
- Cons
- Need to query all shards in order to generate user / home timeline
- Why
- Distributed queries can be slow and costly
- heavy read traffic ( at qaaSocial Network )
- How
- Store hot / precomputed data in memory, reads can be much faster
- Timeline service
- User timeline: user_id → {tweet_id} # Varies a lot = 1k~100k, Trump: ~60k
- Home timeline: user_id → {tweet_id} # This can be huge, sum(followee' tweets)
- Tweets: tweet_id → tweet # Common data that can be shared
- Topic
- Caching policy
- Sharding
- Performance