Book notes: System Design Interview: Volume 1
A 2020 cool book by Alex Xu to get hands-on with system design interviews, with real-world scalable systems examples.
More about System Design
Here are my notes:
📃 Content:
- 1. System Design Basics
- 2. Estimations
- 3. Framework for System Design Interviews
- 4. Example: Design A Rate Limiter
- 5. Design Consistent Hashing
- 6. Example: Design A Key-value Store
- 7. Distributed Unique ID Generation
- 8. Example: URL Shortener
- 9. Example: Web Crawler
- 10. Example: Notification System
- 11. Example: News Feed System
- 12. Example: Search Autocomplete System
- 13. Example: Chat System
- 14. Example: Youtube
- 15. Example: Google Drive
1. System Design Basics
How to scale from zero to millions of users:
- Start simple, a single server, and iterate only when needed.
- Metrics: user base, traffic, use case latency, costs…
- Which Database to use?
- Relational (RDBMS) by default. Worked well for over 40 years.
- If the use case is not suitable, ask:
- Does the application require super-low latency?
- Unstructured data or non‑relational data?
- Do you only need to serialize and deserialize data (JSON, XML, etc.)?
- Do you need to store a massive amount of data?
- Solution: Non-Relational (NoSQL)
- Key-value (Redis, Amazon DynamoDB…)
- Graph (Neo4j…)
- Column (Cassandra, HBase…)
- Document (MongoDB, CouchDB…)
- Time series (Prometheus…)
- Scalability
- Vertical: scale-up, CPU, RAM, etc.
- Simple, and good for low traffic.
- Limited by hardware in a single server.
- No failover or redundancy. Server down means business down.
- Horizontal: scale-out, more servers to the pool.
- Vertical: scale-up, CPU, RAM, etc.
- Load Balancer
- If one server goes down, the balancer redirects traffic to available servers.
- If traffic grows, more servers can be added without changing the request flow.
- Database Replication
- Master (original) / slave (copies), only the master accepts writes.
- In general systems do many more reads than writes, so there are more slaves than master.
- Better performance: more queries running in parallel.
- More reliability: if one database is destroyed, data is preserved due to replication.
- Disaster recovery:
- Slave down: traffic goes to another slave or even to the master.
- Master down: a slave is promoted to be a new master (data might need to be restored).
- Other option: multi-master and circular replication.
- Cache
- Temporary storage, much faster than the database, that stores expensive responses of frequently accessed data in memory.
- Use when data is accessed frequently but modified infrequently.
- Cache is volatile; important data must be persisted in data stores.
- Define an expiration policy; it is a good practice.
- Not too short (to avoid unnecessary data reload).
- Not too long (data can become stale or unused).
- Consistency trade-off: sync cache with data store; they are not in the same transaction.
- Possible Single point of failure (SPOF).
- Recommend multiple cache servers across data centers.
- Overprovision the required memory by certain percentages.
- Eviction policy: when cache is full, requests to add fail.
- Least-recently-used (LRU) policy (most common).
- Least-frequently-used (LFU) policy.
- First-in-first-out (FIFO) policy.
- Content delivery network (CDN)
- Content is delivered from the closest (lowest latency) CDN server.
- Cache only relevant and frequent data (to avoid unnecessary costs).
- Define cache expiration.
- Not too long if data is time-sensitive; data needs to be fresh.
- Not too short; to avoid repeated reloading.
- Fallback: what happens if CDN is not working? Get data directly from the origin.
- Invalidate files:
- Call the CDN API for specific object invalidation.
- Object versioning: e.g. image.png?v=2.
- Stateless web servers
- Apply REST principles.
- Store sessions in persistent storage.
- Sticky Sessions can be a problem.
- Multiple data centers/zones
- Geo-routed worldwide, routes client traffic to the nearest data center.
- In failover: data is replicated in multiple data centers, and traffic routed to the available one.
- For example: GeoDNS
- Data should be geo‑replicated in order to achieve multi-data-center consistency.
- Automate deployments and tests in multiple data centers to guarantee consistency.
- Message queue
- Enables asynchronous communication and asynchronous request processing.
- Decouples requests from processing, which adds more reliability to the system.
- If workers are unavailable, producers can still publish messages to the queues.
- Workers can work when producers are not available.
- The number of workers can be scaled depending on queue size.
- Logging & Metrics for Observability
- Monitor error logs at the server level or use tools to aggregate them to a centralized service for easy search and view.
- Metrics:
- Host level: CPU, memory, disk I/O, etc.
- Aggregated level of tiers: database, web servers, cache.
- Key business metrics: users, retention, revenue, etc.
- Database scaling
- Vertical: more CPU, RAM.
- Single point of failure.
- High cost.
- Horizontal scaling: Sharding
- Same schema, but database is partitioned in smaller, easy-to-manage parts (shards).
- Define a good partition key to distribute data
- One or more columns that determine how data is distributed physically in shards.
- Challenges:
- Resharding data
- One shard is full, and the sharding function needs to change to balance data between shards.
- Solution: Consistent hashing
- Celebrity problem (hotspot key problem).
- Some data inside a shard receives a disproportionate amount of traffic (celebrity).
- We may need to allocate dedicated shards for each celebrity.
- Join and de-normalization
- Data is split across shards, making join operations difficult.
- Solution: De-normalize the database to perform queries on a single table.
- Resharding data
- Scaling is an iterative process
- For example, the business may realize it needs to split services into smaller microservices.
2. Estimations
-
- 2^3 ≈ 8b = 1 Byte (1 ASCII character)
- 2^10 ≈ 1024B (Thousand) = 1 Kilobyte (1KB)
- 2^20 ≈ 1024KB (Million) = 1 Megabyte (1MB)
- 2^30 ≈ 1024MB (Billion) = 1 Gigabyte (1GB)
- 2^40 ≈ 1024GB (Trillion) = 1 Terabyte (1TB)
- 2^50 ≈ 1024TB (Quadrillion) = 1 Petabyte (1PB)
Latency Numbers Every Programmer Should Know
- L1 cache reference 0.5 ns
- Branch mispredict 5 ns
- L2 cache reference 7 ns
- Mutex lock/unlock 100 ns
- Main memory reference 100 ns
- Compress 1K bytes with Zippy 10,000 ns = 10 µs
- Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs
- Read 1 MB sequentially from memory 250,000 ns = 250 µs
- Round trip within the same datacenter 500,000 ns = 500 µs
- Disk seek 10,000,000 ns = 10 ms
- Read 1 MB sequentially from the network 10,000,000 ns = 10 ms
- Read 1 MB sequentially from disk 30,000,000 ns = 30 ms
- Send packet CA (California) ->Netherlands->CA 150,000,000 ns = 150 ms
High availability numbers (1 year)
- 99% - 3.65 days
- 99.9% - 8.77 hours
- 99.99% - 52.60 minutes
- 99.999% - 5.26 minutes
- 99.9999% - 31.56 seconds
Queries Per Second (QPS) and QPS peak
Tips:
- Round and approximate; precision is not expected.
- Write down your assumptions.
- Practice commonly asked estimations: QPS, peak QPS, storage, cache, number of servers, etc.
3. Framework for System Design Interviews
- Understand the problem and establish design scope (3 - 10min)
- Understand the problem and establish design scope (3 - 10min)
- What specific features are we going to build?
- How many users does the product have?
- How fast does the company anticipate scaling up? What are the anticipated scales in 3 months, 6 months, and a year?
- What is the company’s technology stack? What existing services might you leverage to simplify the design?
- Propose high-level design and iterate (10 - 15min)
- Come up with an initial design
- Ask for feedback
- Treat your interviewer as a teammate and work together
- Draw box diagrams with key components on the whiteboard or paper (clients (mobile/web), APIs, web servers, data stores, cache, CDN, message queue)
- Do calculations to evaluate if your design fits the scale constraints
- Think out loud.
- Communicate with your interviewer if calculations are necessary before starting.
- Design deep dive (10 - 25min)
- Work with the interviewer to identify and prioritize components in the architecture:
- High-level design
- Or performance characteristics
- Or bottlenecks and resource estimations
- etc.
- Time management is essential:
- Demonstrate your abilities
- You must be armed with signals to show your interviewer
- Try not to get into unnecessary details
- Work with the interviewer to identify and prioritize components in the architecture:
- Wrap up (3 - 5min)
- If you have the freedom to discuss additional points:
- Identify the system bottlenecks and discuss potential improvements
- Never say your design is perfect and nothing can be improved; show your critical thinking.
- Error cases (server failure, network loss, etc.)
- Next scale curve 100k => 1 million => 10 million users
4. Example: Design A Rate Limiter
-
- Prevent Denial of Service (DoS) attack.
- Reduce cost by reducing the number of servers.
- Prevent server overload.
- Requirements:
- Limit excessive requests.
- Low latency. The rate limiter should not slow down HTTP response time.
- Use as little memory as possible.
- Distributed rate limiting. The rate limiter can be shared across multiple servers or processes.
- Exception handling. Show clear exceptions to users when their requests are throttled.
- High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system.
- Design
- Building your own rate limiting service takes time. If the effort is too high, a commercial API Gateway is a better option.
- Authentication and authorization
- Service discovery integration
- Response caching
- Retry policies, circuit breaker, and QoS
- Rate limiting and throttling
- Load balancing
- Logging, tracing, correlation
- Headers, query strings, and claims transformation
- IP allow-listing
- SSL termination
- Serving static content.
- Responses will return: HTTP 429 Too Many Requests.
- Rate Limiting Algorithms (To determine when to block requests):
- Token bucket
- Easy to implement.
- Memory efficient.
- Cons: two parameters, bucket size and refill rate.
- Leaky bucket
- Memory efficient (limits queue size).
- Requests are processed at a fixed rate.
- Cons: if too many old requests fill the queue, newer requests get rejected.
- Cons: two parameters, bucket size and refill rate.
- Fixed window
- Memory efficient.
- Easy to understand.
- Cons: spike in traffic at the edge of a window can cause more requests than allowed.
- Sliding window log
- Very accurate. In any rolling window, requests will not exceed the rate limit.
- Cons: consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.
- Sliding window counter
- Combines the fixed window counter and sliding window log.
- It smooths out spikes in traffic because the rate is based on the average rate of the previous window.
- Cons: it only works for not-so-strict look-back windows. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. Cloudflare: only 0.003% of requests are wrongly allowed or rate limited among 400 million requests.
- Token bucket
- Add database (Redis, for example) to store parameters.
- Building your own rate limiting service takes time. If the effort is too high, a commercial API Gateway is a better option.
- Deep dive
- How are rate rules defined? Where are they stored?
- Load and cache rules from the rate limit service.
- Define rules and sync with cache.
- How to handle requests that are rate limited?
- Response code
- Headers
- Rate limiter in a distributed environment
- Race conditions (when updating the request counter in Redis from multiple threads).
- Locks slow down the system.
- Common Solutions: Lua script or Redis Sorted Sets
- Synchronization issues (when scaling with multiple rate limiter instances, recommended approach: centralized data store for counters, e.g. Redis).
- Race conditions (when updating the request counter in Redis from multiple threads).
- How are rate rules defined? Where are they stored?
5. Design Consistent Hashing
A method that helps distribute data among servers during horizontal scaling.
Define hashing function.
Simple hashing:
- serverIndex = hash(key) % N
- Is OK when pool size is fixed and data distribution is even.
- If N = 3, we might not get a good distribution.
- Problem when adding or removing servers: most keys need to be redistributed.
- Not ideal.
Consistent hashing
- Hash space and hash ring
- Example: SHA-1 as hash function.
- => 0,…,2^160 - 1 hash space
- Ring with Servers and keys using the hash space
- Servers: S0, S1, S2, S3 (using hashing function with server name or IP address).
- Hash keys with the same function: k0, k1, k2, k3.
- k0 is stored in the first server encountered moving clockwise in the ring.
- If we add S4 between S3 and S0, ONLY the keys in that segment need to be redistributed to S4.
- The same applies when removing a server.
- This algorithm only requires redistributing a fraction of the keys.
- Cons: servers are hashed into different fractions of the hashing space.
=> Solution: virtual nodes. Servers appear in multiple locations in the ring, for example S0_1, S0_2, S0_3.
=> Virtual nodes should cover multiple different segments of the hashing space.
=> More virtual nodes => distribution is more balanced, but it takes more space to store metadata.
=> Virtual node number can be tuned. - (Example: Amazon DynamoDB or Cassandra) uses this for data partitioning
=> Helps to minimize data movement while doing rebalancing. - (Example: CDNs)
=> Distribute web content evenly between servers. - (Example: load balancers like Google Load Balancers)
=> Distribute persistent connections evenly.
6. Example: Design A Key-value Store
Non-relational database where data is stored and identified by unique key. Example: Memcached, Redis, etc.
Short keys work better.
- Requirements:
- The size of a key-value pair is small: less than 10 KB.
- Ability to store big data.
- High availability: The system responds quickly, even during failures.
- High scalability: The system can be scaled to support large data sets.
- Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
- Tunable consistency.
- Low latency.
- Design:
Initial approach: hash table in memory.
- Improvements
- Data compression.
- Store only frequently used data in memory; store the rest on disk.
CAP theorem
- It is impossible for a distributed system to simultaneously provide more than two of the following three guarantees:
- Consistency
- Availability
- Partition Tolerance
- Possibilities:
- CA => Network failure can always happen, so network partition must always be tolerated. CA cannot exist in real-world distributed systems.
- CP => Consistency and Partition Tolerance (Sacrifices Availability).
- AP => Availability and Partition Tolerance (Sacrifices Consistency).
- Distributed system components
- Data partition
- All data does not fit in a single server.
- Challenges:
- Distributing data across multiple servers evenly.
- Minimizing data movement when nodes are added or removed.
- Consistent hashing is good to solve these problems.
- Pros:
- Automatic scaling based on load.
- Heterogeneity: Servers with higher capacity can have more virtual nodes.
- Data replication
- Data must be replicated across multiple data centers to prevent data loss during an outage, disaster, or network failure.
- Consistency
- Replicas must be synchronized.
- Quorum consensus can guarantee consistency for read and write (N,W,R)
- N => Number of replicas.
- W => Number of ACKs from replicas needed to consider write successful.
- R => Number of ACKs from replicas needed to consider read successful.
- R = 1 & W = N: the system is optimized for fast read.
- W = 1 & R = N: the system is optimized for fast write.
- W + R > N: strong consistency is guaranteed (usually N = 3, W = R = 2).
- W + R <= N: strong consistency is not guaranteed.
- Model:
- Strong consistency: The client never sees out-of-date data.
=> Achieved by blocking new reads/writes until all replicas have acknowledged the current write.
=> Blocks operations. - Weak consistency: Requests may not see the most updated value.
- Eventual consistency: A specific type of weak consistency; given enough time, all updates are propagated, and replicas become consistent.
=> Example: Dynamo and Cassandra adopt eventual consistency.
=> Allows inconsistent values to enter the system, forcing the client to perform reconciliation on read.
- Strong consistency: The client never sees out-of-date data.
- Inconsistency resolution
- Solved with versioning and vector clocks.
- Versioning: Treating each data modification as a new immutable version of the data (v1, v2, …).
- A vector clock is a [server, version] pair associated with a data item. It can be used to check if one version precedes, succeeds, or is in conflict with others.
- Adds complexity to the client because it needs to implement conflict resolution logic.
- Vector clock could grow rapidly. Use a threshold for the length, and if it exceeds the limit, the oldest pairs are removed (solution used by Dynamo without problems in production).
- Handling failures
- Detecting if a server is down requires at least two independent sources of information.
- Multicasting is inefficient when there are many servers.
- Better to use the gossip protocol:
- A failure detection method.
- Each node maintains a node membership list, which contains member IDs and heartbeat counters.
- Each node periodically increments its heartbeat counter.
- Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.
- When nodes receive heartbeats, the membership list is updated with the latest info.
- If a member’s heartbeat has not increased for a predefined period, it is considered offline.
- Temporary failures: If the system uses strict quorum, reads and writes could be blocked during failures.
=> To improve availability, use sloppy quorum (the system chooses healthy W & R servers).
=> This allows the write to be temporarily stored on an available node (hinted handoff).
=> When the server is back up, changes are pushed to it to achieve data consistency. - Permanent failures: To keep replicas in sync, use the anti-entropy protocol.
=> Compares each piece of data on replicas and updates each replica to the newest version.
=> Uses Merkle trees for efficient data comparison; the amount of data synchronized is proportional to the differences between the two replicas.
=> Example: Apache Cassandra, Amazon DynamoDB, Amazon S3, Google Cloud Storage, Redis, Memcached, etc. - Data center outage:
=> Replicate data across multiple data centers.
- System architecture diagram.
- Key-value store communication with APIs: get(key) and put(key, value).
- A coordinator node acts as a proxy between the client and the key-value store.
- Nodes are distributed on a ring using consistent hashing.
- The system is completely decentralized, so adding and removing nodes can be automatic.
- Data is replicated at multiple nodes.
- There is no single point of failure as every node has the same set of responsibilities.
- Read path (For this specific problem)
- Read from the memory cache.
- If not found in cache or Bloom filter, read from persistence: a sorted-string table (SSTable), which is a sorted list of “<key, value>” pairs.
- Data partition
7. Distributed Unique ID Generation
- Requirements:
- Requirements:
- IDs must be unique.
- IDs are numerical values only.
- IDs fit into 64 bits.
- IDs are ordered by date.
- Generate 10,000 unique IDs per second.
- Design:
- Multi-master replication
- AUTO_INCREMENT
- Hard to scale due to synchronization issues.
- Can be predictable, posing a possible security issue.
- Universally unique identifier (UUID)
- Not numeric; 128-bit.
- No coordination, so no synchronization issues.
- Easy to scale.
- Ticket server
- Generates distributed primary keys (centralized AUTO_INCREMENT).
- Numeric IDs.
- SPOF (Single Point of Failure).
- Data synchronization issues if scaled.
- Twitter snowflake approach
- Numeric; 64-bit.
- Divide ID into:
- Sign bit: 1 bit.
- Timestamp: 41 bits (milliseconds since epoch).
=> Allows for around 69 years. - Datacenter ID: 5 bits (2^5 => 32 possible data centers).
- Machine ID: 5 bits (2^5 => 32 possible machines).
- Sequence number: 12 bits (incremented by 1 in every machine/process, reset to 0 every millisecond).
=> Maximum of 2^12 = 4096 IDs per millisecond per machine.
- Easy to scale.
- No coordination, so no synchronization issues.
- Considerations:
- Clock synchronization => Network Time Protocol (NTP) between machines.
- High availability: ID generation is mission-critical.
8. Example: URL Shortener
Short URL: https://www.aaaaaaaaaaa.com/q=chatsystem&c=loggedin&v=v3&l=long => https://bbbbbbb.com/y7keocwj
Hash table: shortUrl -> longUrl.
[0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters.
62^7 => 3,521,614,606,208 possible URLs with 7 characters (few collisions).
MD5 or SHA-1 produces too long a hash.
=> Low collision probability.
=> Fixed size.
=> Secure; difficult to predict the next URL.Base 62 conversion: 11157 (using ID in database) => 2TX.
=> Variable length.
=> Easy to predict URLs; not secure.Unique ID generator
=> Example: Snowflake => 2009215674938 (13 digits).
=> ID in base 62 => zn9edcu.Improvements for scalability:
- Relational database (id, shortUrl, longUrl)
- Clustered index on shortUrl.
- Database scaling.
- Bloom filter.
- Cache.
- Load balancer.
=> Scales web servers. - Rate limiter.
- Availability, consistency, reliability.
- Relational database (id, shortUrl, longUrl)
9. Example: Web Crawler
A robot or spider starts by collecting a few web pages and then follows links on those pages to collect new content.
- Requirements:
- Scalability: millions of web pages.
- Robustness: not all links are good.
- Politeness: Do not make too many requests to a website within a short time interval.
- Extensibility: Minimal changes needed to support new content types, for example, image collection.
- Initial Design:
- URL Frontier.
- HTML downloader.
- Content parser (save if not seen).
- Link extractor.
- URL Filter (save if not seen).
- Return to URL Frontier.
- Design:
- The Web can be considered a graph.
- DFS vs BFS
- BFS commonly used.
- Problem: Links from the same web page often link back to the same host. Parallel downloads can be inefficient and trigger DoS protection.
=> Solution: Queues for politeness. Maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs from that queue. - Does not take priority into consideration (e.g., home page > blog post).
=> Solution: Prioritize URLs based on usefulness (PageRank, traffic, update frequency, etc.).
- Robots.txt.
- Extension: At the link extractor level, add workers to look for other kinds of URLs.
- Consistent hashing to distribute load across multiple instances.
- Database replication and sharding.
- Server-side rendering to support dynamic links generated by JS.
- Remove redundant content using checksums.
- Avoid infinite loops by setting a max URL length.
- Exclude noise, such as advertisements.
10. Example: Notification System
- Requirements:
- Push notifications, emails, SMS.
- Millions of notifications per day.
- Design
- Push:
- iOS: APNs
- (Device Token)
- Payload
- Android: FCM
- (Device Token)
- Payload
- iOS: APNs
- SMS
- Twilio
- Nexmo
- Email
- Sendgrid
- Mailchimp
- Architecture
- Load balancer
- API servers (expose APIs for communication)
- Enable horizontal scaling
- Implement data validation
- External DB (contains device tokens, phone numbers, emails)
- A user can have multiple devices
- Cache
- Triggers:
- Can be on-demand (service call), cron job, etc.
- Examples: Item purchase, reminder, order status change.
- Add queues by notification type to avoid SPOF and bottlenecks.
- A worker reads from the queue and performs the final sync call to the third-party notification service.
- Also, implement retry-on-error patterns.
- Push:
- Design
- Reliability
- How to prevent data loss?
- Notifications can be reordered or delayed, but never lost.
- Data needs to be persisted to guarantee delivery.
- DB notification log to mark notifications as sent or pending.
- Receive only one notification?
- Checking if an ID has been sent should work.
- But not always: You Cannot Have Exactly-Once Delivery.
- Other important topics
- Template reuse
- Settings (e.g., contact only by email, legal)
- Rate limiting
- Security (manage AppKey/appSecret)
- Monitor queues and notifications
- Event tracking (states: start, pending, sent, error, delivered, click, unsubscribe)
- Reliability
11. Example: News Feed System
- Requirements:
- Mobile and web news feed.
- Publish a post and see friends’ posts on the news feed page.
- 5,000 friends and 10 million DAU.
- Media files: Images and videos.
- Design
- For publishing:
- Load balancer.
- API endpoints
- POST /v1/me/feed
- Post Service
- Post cache.
- Post DB.
- Fan-out Service (pushes new content to friends’ feeds)
- News Feed cache.
- Notification Service (informs friends that new content is available).
- For reading news feeds:
- Load balancer.
- API endpoints
- GET /v1/me/feed
- News Feed Service
- News feed cache.
- For publishing:
- Design
- Add authentication and rate limiting (to publish posts).
- Fan-out service:
Model
- Fan-out on write (push model)
- Feeds are pre-computed during write.
- Update all friends’ caches immediately after publication.
- Fetching is fast because feeds are pre-computed.
- Cons:
- Hotkey problem: many friends => many pre-calculations.
- Wasted computation for inactive users.
- Fan-out on read (pull model)
- Reads are generated on-demand; recent posts are pulled when the user loads the homepage.
- No hotkey problem.
- Inactive users do not consume computation resources.
- Cons:
- Fetching new feeds is slow because they are not pre-computed.
- Fan-out on write (push model)
Solution: hybrid approach
- Push model for the majority of users (pre-computation).
- For celebrities, use the pull model (on-demand).
- Consistent hashing is useful to mitigate the hotkey problem.
Get friends’ IDs using a Graph DB e.g. Neo4j.
Get friends’ data from User cache/DB.
- e.g., a friend is muted, is an inactive account, etc.
Message queue:
=> Fan-out workers.
=> Update news feed cache.
* Only IDs are stored in memory (post_id, user_id).
* Memory size is small => set a configurable limit (users are usually only interested in the latest content, so cache miss rate is low).
- News feed retrieval
- Add authentication and rate limiting (to read posts).
- Add CDN.
- Read news feed cache (post_id, user_id) => this gives the most recent posts.
- Read user cache.
- Read post cache.
- Cache architecture in a news feed system.
- 5 layers of cache:
- News feed (stores only IDs).
- Content (hot cache for popular posts, normal cache for others).
- Social graph (follower, following).
- Action (liked, replied, etc.).
- Counters (like counter, reply counter, etc.).
- Other topics
- Vertical scaling vs Horizontal scaling
- SQL vs NoSQL
- Master-slave replication
- Read replicas
- Consistency models
- Database sharding
- Keep web tier stateless
- Cache data as much as you can
- Support multiple data centers
- Loosely coupled components with message queues
- Monitor key metrics. For instance, QPS during peak hours and latency while users refresh their news feed.
12. Example: Search Autocomplete System
Known as: Design top-K most searched queries.
- Requirements:
- 5 results, ordered by popularity.
- Decide whether to send request on every character, every 3 characters, or wait a few ms since the last keystroke.
- Design
- Store word–frequency table.
- Search with SQL using “WHERE word LIKE ‘prefix%’”.
- Good solution for small datasets.
- Design
- Trie data structure
- Designed for string retrieval operations.
- Each node represents a character or a prefix.
- Steps:
- O(p): Find a prefix in the trie.
- O(c): Get all the final nodes of the prefix’s subtree.
- O(c log c): Results are in alphabetical order; we need to reorder by frequency.
- Good algorithm, but too slow because we need to traverse the entire subtree to get top K.
- Optimizations:
- Limit max length of a prefix.
- Cache top search queries at each node (e.g., top 5).
- O(1) if cached. This requires significant space.
- Real time:
- Updating the trie on every query (billions) slows down the query service.
- Top suggestions may not change much once the trie is built, so frequent updates are not necessary.
- Solution: Data is cached; frequency and trie are updated weekly (for example) by workers using logs and analytics.
- Logs should be aggregated first to feed the workers that will update the trie.
- Store the trie
- Serialize and store in a document DB like MongoDB.
- Key-value store (prefix -> data).
- Load balancer
- API server reads directly from trie cache; if not found, reads from DB and stores in the cache to avoid cache misses for subsequent queries.
- Web uses Ajax for background fetching of results.
- Browser cache: Suggestions do not change much (e.g., Google caches for 1 hour for a single user, “cache-control: private, max-age=3600”).
- Use data sampling for logging.
- Update the trie
- Recreate full trie and replace.
- Update individual nodes => Avoid because it requires updating all parents up to the root (and updating the cache in every node).
- Add a filter layer to remove unwanted words from the trie.
- Consider sharding the trie: e.g., a–z across 26 servers. However, it is better to create a “sharding manager” to balance based on size and frequency (e.g., words starting with “a” are much more common than x, y, z).
- Multiple languages? Consider supporting Unicode.
- Different countries? Maybe different tries by country, depending on usage.
- What about trending real-time search?
- Use sharding.
- Change weight system and give more value to recently searched queries.
- Real-time processing requires streaming data processing (Hadoop MapReduce, Spark, Kafka, etc.).
- Trie data structure
13. Example: Chat System
Requirements:
- 1:1 chat with low delivery latency.
- Small group chat (max 100 people).
- Guarantee message delivery and save messages if offline for a limited time.
- Online presence.
- Multiple device support: The same account can be logged in to multiple devices simultaneously.
- Push notifications.
Design:
Sender
Chat services
- Store message
- Relay message
Receiver
Protocol
- Option: HTTP with keep-alive header for sender; polling, long polling, or WebSocket for receiver.
- Polling (short)
- Wastes resources.
- Long polling
- Longer connection time = fewer requests.
- WebSocket
- Handshake and then bi-directional channel connection.
- More difficult to scale; connection persists and needs to be managed in the server.
Solution: Hybrid approach: HTTP for sending messages and WebSockets for receiving them.
Components:
- Stateless services (load balancer, authentication, profile, group management).
- Push notification (third party, do not reinvent).
- FCM, APNS, etc.
- Stateful services (WebSockets and chat services).
- Modern servers can handle 100,000 connections.
Online/offline presence can be handled by presence servers (real-time layer like chats).
Normal data in relational database.
Chat data in key-value stores.
- Easy to scale horizontally.
- Low latency.
- Relational databases have a long-tail problem for random access (expensive).
- For searching in the chat or jumping to specific messages.
- Facebook uses HBase and Discord uses Cassandra.
Data model:
Message:
* message_id (pk) bigint
* message_from bigint
* message_to bigint
* content text
* created_at timestamp
Group_Message:
* channel_id (pk) bigint
* message_id (pk) bigint
* user_id bigint
* content text
* created_at timestamp
* message_id is used to order the messages because they can be created at the same time.
* ID: Local sequence number (local to the group or chat), much easier to implement than a global unique ID like Snowflake’s solution.
Design:
- Service Discovery to assign the best server to the user based on geographical location, server capacity, etc. Example: Apache Zookeeper.
- Send message while offline:
- Inbox pattern.
- What happens if sender and receiver are connected to different servers?
- Need a synchronization queue (stored in K/V store) to send the message to another server and finally reach the client. Example: etcd for distributed synchronization.
- Communication between these services could be RPC.
- What happens if the receiver is not online?
=> Inbox pattern using a message queue. When online again, messages are delivered; once delivered, they are removed from the inbox. - How to deliver a message to 100 users in a group:
=> Fan-out pattern- Online users get the message; offline members get the message in their inbox.
- Online presence:
- We cannot rely on active WebSocket connections because the app could be in the background, sleeping, etc.
- API sends ping or keep-alive to the server. If no ping is received in 1 min, the user is marked as offline.
Warm up:
- Adding media to chats: Thumbnails, compression.
- End-to-end encryption.
- Caching messages on the client.
- Message resend mechanism.
- IDs,
- Timestamps
- Vector clocks.
14. Example: Youtube
- Requirements
- Ability to upload videos fast.
- Smooth video streaming.
- Ability to change video quality.
- Low infrastructure cost.
- High availability, scalability, and reliability requirements.
- Clients supported: Mobile apps, web browsers, and smart TVs.
- Design
- Use Blob storage or CDNs.
- Upload video flow:
- Load balancer.
- API servers.
- Metadata cache/DB.
- Original storage (e.g., blob storage).
- Binary large objects (BLOB) are a collection of binary data stored as a single entity in a database management system.
- Transcoding servers (encoding).
- Transform video to other formats.
- Queue to completion handlers (also update metadata).
- Transcoding storage to CDNs.
- Video streaming flow:
- Your device continuously receives video streams from remote source videos.
- Video Streaming Protocols:
- MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for “Dynamic Adaptive Streaming over HTTP”.
- Apple HLS. HLS stands for “HTTP Live Streaming”.
- Microsoft Smooth Streaming.
- Adobe HTTP Dynamic Streaming (HDS).
- Directly from the closest server using CDN.
- Design:
- Transcoding
- Different bitrates/qualities/resolutions require different internet speeds.
- Raw video consumes a large amount of storage.
- Compatibility with devices.
- Deliver high resolution if high bandwidth is available.
- Switch video quality depending on network conditions.
- Different containers (avi, mp4, webm, etc.) and codecs for compression and decompression (H.264, VP9, HEVC, 4K ProRes, etc.); this will depend on devices (e.g., old phones).
- Directed acyclic graph (DAG) model.
- Transcoding a video is computationally expensive and time-consuming.
- Execute sequential or parallel tasks during encoding (e.g., Facebook DAG model).
- Safety optimization: Pre-signed upload URL (e.g., Amazon S3) or Shared Access Signature (e.g., Azure Blob Storage).
- This speeds up upload times because content is delivered directly to blob storage.
- Enable multipart upload (multiple chunks) in parallel.
- For performance: Transcode and distribute segments while uploading.
- Cost-saving.
- More-viewed videos go to CDNs.
- Less-viewed videos go to video servers.
- Less-viewed videos do not need all encodings or can be processed on-demand.
- Popular videos by region do not require global distribution.
- Errors
- Implement recoverable error strategies like retry.
- Live streaming
- Even lower latency is required, so another protocol is likely needed.
- .mpd (for DASH manifest)
- .m3u8 (for HLS manifest)
- The browser will decide, depending on status, which segment to continue playing.
- Transcoding
15. Example: Google Drive
Requirementsle Drive
Requirements
- Upload, download files.
- File sync.
- Notifications.
- Encrypted.
- All file formats.
- Data loss is not an option.
- High availability and scalability.
Design
- Single server
- Upload endpoint.
- Normal upload for small files.
- Resumable upload for big files.
- Download.
- All protected by SSL using HTTPS.
- Considering FTP?
- Upload endpoint.
- Use Amazon S3.
- Fast, secure, and cheap storage.
- Bucket redundancy in the same region.
- Bucket redundancy in multiple regions.
- Load balancer.
- Multiple web servers for horizontal scaling.
- Sync conflicts:
- Two users modify the same file or folder.
- Solution: First modification is accepted; the second one gets a conflict.
- File.txt
- File (1).txt
- Differential Synchronization for multiple users.
- Divide and conquer: Use block servers.
- Files are split into blocks.
- Blocks get a content hash.
- Block size of 4MB (like Dropbox).
- To get files, we get blocks and then sort them.
- Files get a checksum.
- Chunking, compression, and encryption are implemented in a single server (in a single language/platform, instead of having this responsibility in the clients, for security and maintainability).
- Inactive blocks eventually go to cold storage.
- Metadata database.
- User info, blocks, versions, upload timestamp.
- Notification service.
- With offline backup queue.
- Changes will be synced when the client is online again.
- Notify clients when files are added, modified, or deleted.
- Long polling with clients (like Dropbox).
- Clients try to stay connected.
- WebSockets (no need for a bidirectional channel, and files do not change frequently).
- Download file.
- For sync, get changes triggered by notification and only download changes.
- Save storage:
- De-duplicate data blocks among file versions. Redundant blocks are not necessary; only changes are stored.
- Limit the number of versions.
- Keep valuable versions only.
- Move infrequently used data to cold storage (e.g., Amazon S3 Glacier).
- Single server
Design
Delta synchronization:
- Only modified blocks are synced instead of the whole file using a sync algorithm.
-
- Compress blocks to reduce data size.
- Compression algorithm depends on file type.
Block encryption before sending to cloud storage.
High consistency.
- Data in cache replicas and the master is consistent.
- Invalidate cache on database writes.
- In relational databases, consistency is easy because of ACID; in NoSQL, ACID properties need to be programmatically incorporated.
Error handling.
- Load balancer failure => A secondary becomes active.
- Block server failure => Other servers pick up unfinished or pending jobs.
- S3 failure => Another copy is used because of geographical and regional redundancy.
- API server failure => Stateless API failure, so traffic is redirected to another active instance.
- Metadata cache failure => Automatic access to another one; caches are replicated and failed ones will be replaced.
- Metadata DB failure.
- Master down: Promote a slave to temporarily be a new master, create a new slave.
- Slave down: Use another slave, create a new slave.
- Notifications service failure => Long-poll connections are closed; client connections will try to eventually reconnect to another healthy server.
- Offline backup queue failure => Queues are replicated; consumers need to re-subscribe to the backup queue.