📚 Book notes: System Design Interview: Volume 1

Posted on 2025-11-28 In Software Architecture

A 2020 cool book by Alex Xu to get hands-on with system design interviews, with real-world scalable systems examples.

More about System Design

system-design-primer

Here are my notes:

📃 Content:

1. System Design Basics
2. Estimations
3. Framework for System Design Interviews
4. Example: Design A Rate Limiter
5. Design Consistent Hashing
6. Example: Design A Key-value Store
7. Distributed Unique ID Generation
8. Example: URL Shortener
9. Example: Web Crawler
10. Example: Notification System
11. Example: News Feed System
12. Example: Search Autocomplete System
13. Example: Chat System
14. Example: Youtube
15. Example: Google Drive

1. System Design Basics

How to scale from zero to millions of users:

Start simple, a single server, and iterate only when needed.
- Metrics: user base, traffic, use case latency, costs…
Which Database to use?
- Relational (RDBMS) by default. Worked well for over 40 years.
- If the use case is not suitable, ask:
  - Does the application require super-low latency?
  - Unstructured data or non‑relational data?
  - Do you only need to serialize and deserialize data (JSON, XML, etc.)?
  - Do you need to store a massive amount of data?
  - Solution: Non-Relational (NoSQL)
    - Key-value (Redis, Amazon DynamoDB…)
    - Graph (Neo4j…)
    - Column (Cassandra, HBase…)
    - Document (MongoDB, CouchDB…)
    - Time series (Prometheus…)
Scalability
- Vertical: scale-up, CPU, RAM, etc.
  - Simple, and good for low traffic.
  - Limited by hardware in a single server.
  - No failover or redundancy. Server down means business down.
- Horizontal: scale-out, more servers to the pool.
Load Balancer
- If one server goes down, the balancer redirects traffic to available servers.
- If traffic grows, more servers can be added without changing the request flow.
Database Replication
- Master (original) / slave (copies), only the master accepts writes.
- In general systems do many more reads than writes, so there are more slaves than masters.
- Better performance: more queries running in parallel.
- More reliability: if one database is destroyed, data is preserved due to replication.
- Disaster recovery:
  - Slave down: traffic goes to another slave or even to the master.
  - Master down: a slave is promoted to be a new master (data might need to be restored).
    - Other option: multi-master and circular replication.
Cache
- Temporary storage, much faster than the database, that stores expensive responses of frequently accessed data in memory.
- Use when data is accessed frequently but modified infrequently.
- Cache is volatile; important data must be persisted in data stores.
- Define an expiration policy; it is a good practice.
  - Not too short (to avoid unnecessary data reload).
  - Not too long (data can become stale or unused).
- Consistency trade-off: sync cache with data store; they are not in the same transaction.
- Possible Single point of failure (SPOF).
  - Recommend multiple cache servers across data centers.
  - Overprovision the required memory by certain percentages.
- Eviction policy: when cache is full, requests to add fail.
  - Least-recently-used (LRU) policy (most common).
  - Least-frequently-used (LFU) policy.
  - First-in-first-out (FIFO) policy.
Content delivery network (CDN)
- Content is delivered from the closest (lowest latency) CDN server.
- Cache only relevant and frequent data (to avoid unnecessary costs).
- Define cache expiration.
  - Not too long if data is time-sensitive; data needs to be fresh.
  - Not too short; to avoid repeated reloading.
- Fallback: what happens if CDN is not working? Get data directly from the origin.
- Invalidate files:
  - Call the CDN API for specific object invalidation.
  - Object versioning: e.g. image.png?v=2.
Stateless web servers
- Apply REST principles.
- Store sessions in persistent storage.
- Sticky Sessions can be a problem.
Multiple data centers/zones
- Geo-routed worldwide, routes client traffic to the nearest data center.
- In failover: data is replicated in multiple data centers, and traffic routed to the available one.
- For example: GeoDNS
- Data should be geo‑replicated in order to achieve multi-data-center consistency.
- Automate deployments and tests in multiple data centers to guarantee consistency.
Message queue
- Enables asynchronous communication and asynchronous request processing.
- Decouples requests from processing, which adds more reliability to the system.
- If workers are unavailable, producers can still publish messages to the queues.
- Workers can work when producers are not available.
- The number of workers can be scaled depending on queue size.
Logging & Metrics for Observability
- Monitor error logs at the server level or use tools to aggregate them to a centralized service for easy search and view.
- Metrics:
  - Host level: CPU, memory, disk I/O, etc.
  - Aggregated level of tiers: database, web servers, cache.
  - Key business metrics: users, retention, revenue, etc.
Database scaling
- Vertical: more CPU, RAM.
- Single point of failure.
- High cost.
- Horizontal scaling: Sharding
  - Same schema, but database is partitioned in smaller, easy-to-manage parts (shards).
  - Define a good partition key to distribute data
    - One or more columns that determine how data is distributed physically in shards.
  - Challenges:
    - Resharding data
      - One shard is full, and the sharding function needs to change to balance data between shards.
      - Solution: Consistent hashing
    - Celebrity problem (hotspot key problem).
      - Some data inside a shard receives a disproportionate amount of traffic (celebrity).
      - We may need to allocate dedicated shards for each celebrity.
    - Join and de-normalization
      - Data is split across shards, making join operations difficult.
      - Solution: De-normalize the database to perform queries on a single table.
Scaling is an iterative process
- For example, the business may realize it needs to split services into smaller microservices.

2. Estimations

Powers of two as bytes:
- 2^3 ≈ 8b = 1 Byte (1 ASCII character)
- 2^10 ≈ 1024B (Thousand) = 1 Kilobyte (1KB)
- 2^20 ≈ 1024KB (Million) = 1 Megabyte (1MB)
- 2^30 ≈ 1024MB (Billion) = 1 Gigabyte (1GB)
- 2^40 ≈ 1024GB (Trillion) = 1 Terabyte (1TB)
- 2^50 ≈ 1024TB (Quadrillion) = 1 Petabyte (1PB)
Latency Numbers Every Programmer Should Know
- L1 cache reference 0.5 ns
- Branch mispredict 5 ns
- L2 cache reference 7 ns
- Mutex lock/unlock 100 ns
- Main memory reference 100 ns
- Compress 1K bytes with Zippy 10,000 ns = 10 µs
- Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs
- Read 1 MB sequentially from memory 250,000 ns = 250 µs
- Round trip within the same datacenter 500,000 ns = 500 µs
- Disk seek 10,000,000 ns = 10 ms
- Read 1 MB sequentially from the network 10,000,000 ns = 10 ms
- Read 1 MB sequentially from disk 30,000,000 ns = 30 ms
- Send packet CA (California) ->Netherlands->CA 150,000,000 ns = 150 ms
High availability numbers (1 year)
- 99% - 3.65 days
- 99.9% - 8.77 hours
- 99.99% - 52.60 minutes
- 99.999% - 5.26 minutes
- 99.9999% - 31.56 seconds
Daily Active Users (DAU)
Queries Per Second (QPS) and QPS peak
Tips:
- Round and approximate; precision is not expected.
- Write down your assumptions.
- Practice commonly asked estimations: QPS, peak QPS, storage, cache, number of servers, etc.

3. Framework for System Design Interviews

Understand the problem and establish design scope (3 - 10min)
- What specific features are we going to build?
- How many users does the product have?
- How fast does the company anticipate scaling up? What are the anticipated scales in 3 months, 6 months, and a year?
- What is the company’s technology stack? What existing services might you leverage to simplify the design?
Propose high-level design and iterate (10 - 15min)
- Come up with an initial design
- Ask for feedback
- Treat your interviewer as a teammate and work together
- Draw box diagrams with key components on the whiteboard or paper (clients (mobile/web), APIs, web servers, data stores, cache, CDN, message queue)
- Do calculations to evaluate if your design fits the scale constraints
- Think out loud.
- Communicate with your interviewer if calculations are necessary before starting.
Design deep dive (10 - 25min)
- Work with the interviewer to identify and prioritize components in the architecture:
  - High-level design
  - Or performance characteristics
  - Or bottlenecks and resource estimations
  - etc.
- Time management is essential:
  - Demonstrate your abilities
  - You must be armed with signals to show your interviewer
  - Try not to get into unnecessary details
Wrap up (3 - 5min)
- If you have the freedom to discuss additional points:
- Identify the system bottlenecks and discuss potential improvements
- Never say your design is perfect and nothing can be improved; show your critical thinking.
- Error cases (server failure, network loss, etc.)
- Next scale curve 100k => 1 million => 10 million users

4. Example: Design A Rate Limiter

Video
Rate limiter
- Prevent Denial of Service (DoS) attack.
- Reduce cost by reducing the number of servers.
- Prevent server overload.

Requirements:
- Limit excessive requests.
- Low latency. The rate limiter should not slow down HTTP response time.
- Use as little memory as possible.
- Distributed rate limiting. The rate limiter can be shared across multiple servers or processes.
- Exception handling. Show clear exceptions to users when their requests are throttled.
- High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system.
Design
- Building your own rate limiting service takes time. If the effort is too high, a commercial API Gateway is a better option.
  - Authentication and authorization
  - Service discovery integration
  - Response caching
  - Retry policies, circuit breaker, and QoS
  - Rate limiting and throttling
  - Load balancing
  - Logging, tracing, correlation
  - Headers, query strings, and claims transformation
  - IP allow-listing
  - SSL termination
  - Serving static content.
- Responses will return: HTTP 429 Too Many Requests.
- Rate Limiting Algorithms (To determine when to block requests):
  - Token bucket
    - Easy to implement.
    - Memory efficient.
    - Cons: two parameters, bucket size and refill rate.
  - Leaky bucket
    - Memory efficient (limits queue size).
    - Requests are processed at a fixed rate.
    - Cons: if too many old requests fill the queue, newer requests get rejected.
    - Cons: two parameters, bucket size and refill rate.
  - Fixed window
    - Memory efficient.
    - Easy to understand.
    - Cons: spike in traffic at the edge of a window can cause more requests than allowed.
  - Sliding window log
    - Very accurate. In any rolling window, requests will not exceed the rate limit.
    - Cons: consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.
  - Sliding window counter
    - Combines the fixed window counter and sliding window log.
    - It smooths out spikes in traffic because the rate is based on the average rate of the previous window.
    - Cons: it only works for not-so-strict look-back windows. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. Cloudflare: only 0.003% of requests are wrongly allowed or rate limited among 400 million requests.
- Add database (Redis, for example) to store parameters.
Deep dive
- How are rate rules defined? Where are they stored?
  - Load and cache rules from the rate limit service.
  - Define rules and sync with cache.
- How to handle requests that are rate limited?
  - Response code
  - Headers
- Rate limiter in a distributed environment
  - Race conditions (when updating the request counter in Redis from multiple threads).
    - Locks slow down the system.
    - Common Solutions: Lua script or Redis Sorted Sets
  - Synchronization issues (when scaling with multiple rate limiter instances, recommended approach: centralized data store for counters, for example Redis).

5. Design Consistent Hashing

Video
A method that helps distribute data among servers during horizontal scaling.
Define hashing function.
Simple hashing:
- serverIndex = hash(key) % N
- Is OK when pool size is fixed and data distribution is even.
  - If N = 3, we might not get a good distribution.
  - Problem when adding or removing servers: most keys need to be redistributed.
  - Not ideal.
Consistent hashing
- Hash space and hash ring
- Example: SHA-1 as hash function.
- => 0,…,2^160 - 1 hash space
- Ring with Servers and keys using the hash space
- Servers: S0, S1, S2, S3 (using hashing function with server name or IP address).
- Hash keys with the same function: k0, k1, k2, k3.
- k0 is stored in the first server encountered moving clockwise in the ring.
- If we add S4 between S3 and S0, ONLY the keys in that segment need to be redistributed to S4.
- The same applies when removing a server.
- This algorithm only requires redistributing a fraction of the keys.
- Cons: servers are hashed into different fractions of the hashing space.
  => Solution: virtual nodes. Servers appear in multiple locations in the ring, for example S0_1, S0_2, S0_3.
  => Virtual nodes should cover multiple different segments of the hashing space.
  => More virtual nodes => distribution is more balanced, but it takes more space to store metadata.
  => Virtual node number can be tuned.
- (Example: Amazon DynamoDB or Cassandra) uses this for data partitioning
  => Helps to minimize data movement while doing rebalancing.
- (Example: CDNs)
  => Distribute web content evenly between servers.
- (Example: load balancers like Google Load Balancers)
  => Distribute persistent connections evenly.

6. Example: Design A Key-value Store

Video
Non-relational database where data is stored and identified by unique key. Example: Memcached, Redis, etc.
Short keys work better.

Requirements:
- The size of a key-value pair is small: less than 10 KB.
- Ability to store big data.
- High availability: The system responds quickly, even during failures.
- High scalability: The system can be scaled to support large data sets.
- Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
- Tunable consistency.
- Low latency.
Design:
- Initial approach: hash table in memory.
  - Improvements
  - Data compression.
  - Store only frequently used data in memory; store the rest on disk.
- CAP theorem
  - It is impossible for a distributed system to simultaneously provide more than two of the following three guarantees:
  - Consistency
  - Availability
  - Partition Tolerance
  - Possibilities:
    - CA => Network failure can always happen, so network partition must always be tolerated. CA cannot exist in real-world distributed systems.
    - CP => Consistency and Partition Tolerance (Sacrifices Availability).
    - AP => Availability and Partition Tolerance (Sacrifices Consistency).
Distributed system components
- Data partition
  - All data does not fit in a single server.
  - Challenges:
  - Distributing data across multiple servers evenly.
  - Minimizing data movement when nodes are added or removed.
  - Consistent hashing is good to solve these problems.
  - Pros:
  - Automatic scaling based on load.
  - Heterogeneity: Servers with higher capacity can have more virtual nodes.
- Data replication
  - Data must be replicated across multiple data centers to prevent data loss during an outage, disaster, or network failure.
- Consistency
  - Replicas must be synchronized.
  - Quorum consensus can guarantee consistency for read and write (N,W,R)
    - N => Number of replicas.
    - W => Number of ACKs from replicas needed to consider write successful.
    - R => Number of ACKs from replicas needed to consider read successful.
    - R = 1 & W = N: the system is optimized for fast read.
    - W = 1 & R = N: the system is optimized for fast write.
    - W + R > N: strong consistency is guaranteed (usually N = 3, W = R = 2).
    - W + R <= N: strong consistency is not guaranteed.
  - Model:
    - Strong consistency: The client never sees out-of-date data.
      => Achieved by blocking new reads/writes until all replicas have acknowledged the current write.
      => Blocks operations.
    - Weak consistency: Requests may not see the most updated value.
    - Eventual consistency: A specific type of weak consistency; given enough time, all updates are propagated, and replicas become consistent.
      => Example: Dynamo and Cassandra adopt eventual consistency.
      => Allows inconsistent values to enter the system, forcing the client to perform reconciliation on read.
- Inconsistency resolution
  - Solved with versioning and vector clocks.
  - Versioning: Treating each data modification as a new immutable version of the data (v1, v2, …).
  - A vector clock is a [server, version] pair associated with a data item. It can be used to check if one version precedes, succeeds, or is in conflict with others.
  - Adds complexity to the client because it needs to implement conflict resolution logic.
  - Vector clock could grow rapidly. Use a threshold for the length, and if it exceeds the limit, the oldest pairs are removed (solution used by Dynamo without problems in production).
- Handling failures
  - Detecting if a server is down requires at least two independent sources of information.
  - Multicasting is inefficient when there are many servers.
  - Better to use the gossip protocol:
    - A failure detection method.
    - Each node maintains a node membership list, which contains member IDs and heartbeat counters.
    - Each node periodically increments its heartbeat counter.
    - Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.
    - When nodes receive heartbeats, the membership list is updated with the latest info.
    - If a member’s heartbeat has not increased for a predefined period, it is considered offline.
  - Temporary failures: If the system uses strict quorum, reads and writes could be blocked during failures.
    => To improve availability, use sloppy quorum (the system chooses healthy W & R servers).
    => This allows the write to be temporarily stored on an available node (hinted handoff).
    => When the server is back up, changes are pushed to it to achieve data consistency.
  - Permanent failures: To keep replicas in sync, use the anti-entropy protocol.
    => Compares each piece of data on replicas and updates each replica to the newest version.
    => Uses Merkle trees for efficient data comparison; the amount of data synchronized is proportional to the differences between the two replicas.
    => Example: Apache Cassandra, Amazon DynamoDB, Amazon S3, Google Cloud Storage, Redis, Memcached, etc.
  - Data center outage:
    => Replicate data across multiple data centers.
- System architecture diagram.
- Key-value store communication with APIs: get(key) and put(key, value).
- A coordinator node acts as a proxy between the client and the key-value store.
- Nodes are distributed on a ring using consistent hashing.
  - The system is completely decentralized, so adding and removing nodes can be automatic.
  - Data is replicated at multiple nodes.
  - There is no single point of failure as every node has the same set of responsibilities.
- Read path (For this specific problem)
  - Read from the memory cache.
  - If not found in cache or Bloom filter, read from persistence: a sorted-string table (SSTable), which is a sorted list of “<key, value>” pairs.

7. Distributed Unique ID Generation

Requirements:
Requirements:
- IDs must be unique.
- IDs are numerical values only.
- IDs fit into 64 bits.
- IDs are ordered by date.
- Generate 10,000 unique IDs per second.
Design:

**Multi-master replication
- AUTO_INCREMENT**
- Hard to scale due to synchronization issues.
- Can be predictable, posing a possible security issue.
Universally unique identifier (UUID)
- Not numeric; 128-bit.
- No coordination, so no synchronization issues.
- Easy to scale.
Ticket server
- Generates distributed primary keys (centralized AUTO_INCREMENT).
- Numeric IDs.
- SPOF (Single Point of Failure).
- Data synchronization issues if scaled.
Twitter snowflake approach
- Numeric; 64-bit.
- Divide ID into:
  - Sign bit: 1 bit.
  - Timestamp: 41 bits (milliseconds since epoch).
    => Allows for around 69 years.
  - Datacenter ID: 5 bits (2^5 => 32 possible data centers).
  - Machine ID: 5 bits (2^5 => 32 possible machines).
  - Sequence number: 12 bits (incremented by 1 in every machine/process, reset to 0 every millisecond).
    => Maximum of 2^12 = 4096 IDs per millisecond per machine.
- Easy to scale.
- No coordination, so no synchronization issues.
- Considerations:
  - Clock synchronization => Network Time Protocol (NTP) between machines.
  - High availability: ID generation is mission-critical.

8. Example: URL Shortener

Video
Short URL: https://www.aaaaaaaaaaa.com/q=chatsystem&c=loggedin&v=v3&l=long => https://bbbbbbb.com/y7keocwj
Hash table: shortUrl -> longUrl.
[0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters.
62^7 => 3,521,614,606,208 possible URLs with 7 characters (few collisions).
Hash funciton iteration:
MD5 or SHA-1 produces too long a hash.
=> Low collision probability.
=> Fixed size.
=> Secure; difficult to predict the next URL.
Base 62 conversion: 11157 (using ID in database) => 2TX.
=> Variable length.
=> Easy to predict URLs; not secure.
Unique ID generator
=> Example: Snowflake => 2009215674938 (13 digits).
=> ID in base 62 => zn9edcu.
=> Not predictable
=> Not collision
Improvements for scalability:
- Relational database (id, shortUrl, longUrl)
  - Clustered index on shortUrl.
  - Database scaling.
  - Bloom filter.
- Cache.
- Load balancer.
  => Scales web servers.
- Rate limiter.
- Availability, consistency, reliability.

9. Example: Web Crawler

Video

A robot or spider starts by collecting a few web pages and then follows links on those pages to collect new content.

Requirements:
- Scalability: millions of web pages.
- Robustness: not all links are good.
- Politeness: Do not make too many requests to a website within a short time interval.
- Extensibility: Minimal changes needed to support new content types, for example, image collection.
Initial Design:
- URL Frontier.
- HTML downloader.
- Content parser (save if not seen).
- Link extractor.
- URL Filter (save if not seen).
- Return to URL Frontier.
Design:
- The Web can be considered a graph.
- DFS vs BFS
  - BFS commonly used.
  - Problem: Links from the same web page often link back to the same host. Parallel downloads can be inefficient and trigger DoS protection.
    => Solution: Queues for politeness. Maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs from that queue.
  - Does not take priority into consideration (e.g., home page > blog post).
    => Solution: Prioritize URLs based on usefulness (PageRank, traffic, update frequency, etc.).
- Robots.txt.
- Extension: At the link extractor level, add workers to look for other kinds of URLs.
- Consistent hashing to distribute load across multiple instances.
- Database replication and sharding.
- Server-side rendering to support dynamic links generated by JS.
- Remove redundant content using checksums.
- Avoid infinite loops by setting a max URL length.
- Exclude noise, such as advertisements.

10. Example: Notification System

Requirements:
- Push notifications, emails, SMS.
- Millions of notifications per day.
Design
- Push:
  - iOS: APNs
    - (Device Token)
    - Payload
  - Android: FCM
    - (Device Token)
    - Payload
- SMS
  - Twilio
  - Nexmo
- Email
  - Sendgrid
  - Mailchimp
- Architecture
  - Load balancer
  - API servers (expose APIs for communication)
    - Enable horizontal scaling
    - Implement data validation
  - External DB (contains device tokens, phone numbers, emails)
    - A user can have multiple devices
    - Cache
  - Triggers:
    - Can be on-demand (service call), cron job, etc.
    - Examples: Item purchase, reminder, order status change.
  - Add queues by notification type to avoid SPOF and bottlenecks.
    - A worker reads from the queue and performs the final sync call to the third-party notification service.
    - Also, implement retry-on-error patterns.
Design
- Reliability
  - How to prevent data loss?
  - Notifications can be reordered or delayed, but never lost.
  - Data needs to be persisted to guarantee delivery.
    - DB notification log to mark notifications as sent or pending.
  - Receive only one notification?
    - Checking if an ID has been sent should work.
    - But not always: You Cannot Have Exactly-Once Delivery.
  - Other important topics
    - Template reuse
    - Settings (e.g., contact only by email, legal)
    - Rate limiting
    - Security (manage AppKey/appSecret)
    - Monitor queues and notifications
    - Event tracking (states: start, pending, sent, error, delivered, click, unsubscribe)

11. Example: News Feed System

Requirements:
- Mobile and web news feed.
- Publish a post and see friends’ posts on the news feed page.
- 5,000 friends and 10 million DAU.
- Media files: Images and videos.
Design
- For publishing:
  - Load balancer.
  - API endpoints
    - POST /v1/me/feed
  - Post Service
    - Post cache.
    - Post DB.
  - Fan-out Service (pushes new content to friends’ feeds)
    - News Feed cache.
  - Notification Service (informs friends that new content is available).
- For reading news feeds:
  - Load balancer.
  - API endpoints
    - GET /v1/me/feed
  - News Feed Service
    - News feed cache.
Design
- Add authentication and rate limiting (to publish posts).
- Fan-out service:
  - Model
    - Fan-out on write (push model)
      - Feeds are pre-computed during write.
      - Update all friends’ caches immediately after publication.
      - Fetching is fast because feeds are pre-computed.
      - Cons:
        
        Hotkey problem: many friends => many pre-calculations.
        
        Wasted computation for inactive users.
    - Fan-out on read (pull model)
      - Reads are generated on-demand; recent posts are pulled when the user loads the homepage.
      - No hotkey problem.
      - Inactive users do not consume computation resources.
      - Cons:
        
        Fetching new feeds is slow because they are not pre-computed.
  - Solution: hybrid approach
    - Push model for the majority of users (pre-computation).
    - For celebrities, use the pull model (on-demand).
      - Consistent hashing is useful to mitigate the hotkey problem.
  - Get friends’ IDs using a Graph DB e.g. Neo4j.
  - Get friends’ data from User cache/DB.
    - e.g., a friend is muted, is an inactive account, etc.
  - Message queue:
    => Fan-out workers.
    => Update news feed cache.
    * Only IDs are stored in memory (post_id, user_id).
    * Memory size is small => set a configurable limit (users are usually only interested in the latest content, so cache miss rate is low).
- News feed retrieval
  - Add authentication and rate limiting (to read posts).
  - Add CDN.
  - Read news feed cache (post_id, user_id) => this gives the most recent posts.
  - Read user cache.
  - Read post cache.
- Cache architecture in a news feed system.
  - 5 layers of cache:
  - News feed (stores only IDs).
  - Content (hot cache for popular posts, normal cache for others).
  - Social graph (follower, following).
  - Action (liked, replied, etc.).
  - Counters (like counter, reply counter, etc.).
Other topics
- Vertical scaling vs Horizontal scaling
- SQL vs NoSQL
- Master-slave replication
- Read replicas
- Consistency models
- Database sharding
- Keep web tier stateless
- Cache data as much as you can
- Support multiple data centers
- Loosely coupled components with message queues
- Monitor key metrics. For instance, QPS during peak hours and latency while users refresh their news feed.

12. Example: Search Autocomplete System

Known as: Design top-K most searched queries.

Requirements:
- 5 results, ordered by popularity.
- Decide whether to send request on every character, every 3 characters, or wait a few ms since the last keystroke.
Design
- Store word–frequency table.
- Search with SQL using “WHERE word LIKE ‘prefix%’”.
- Good solution for small datasets.
Design
- Trie data structure
  - Designed for string retrieval operations.
  - Each node represents a character or a prefix.
  - Steps:
    - O(p): Find a prefix in the trie.
    - O(c): Get all the final nodes of the prefix’s subtree.
    - O(c log c): Results are in alphabetical order; we need to reorder by frequency.
  - Good algorithm, but too slow because we need to traverse the entire subtree to get top K.
- Optimizations:
  - Limit max length of a prefix.
  - Cache top search queries at each node (e.g., top 5).
    - O(1) if cached. This requires significant space.
- Real time:
  - Updating the trie on every query (billions) slows down the query service.
  - Top suggestions may not change much once the trie is built, so frequent updates are not necessary.
  - Solution: Data is cached; frequency and trie are updated weekly (for example) by workers using logs and analytics.
    - Logs should be aggregated first to feed the workers that will update the trie.
- Store the trie
  - Serialize and store in a document DB like MongoDB.
  - Key-value store (prefix -> data).
- Load balancer
- API server reads directly from trie cache; if not found, reads from DB and stores in the cache to avoid cache misses for subsequent queries.
- Web uses Ajax for background fetching of results.
- Browser cache: Suggestions do not change much (e.g., Google caches for 1 hour for a single user, “cache-control: private, max-age=3600”).
- Use data sampling for logging.
- Update the trie
  - Recreate full trie and replace.
  - Update individual nodes => Avoid because it requires updating all parents up to the root (and updating the cache in every node).
- Add a filter layer to remove unwanted words from the trie.
- Consider sharding the trie: e.g., a–z across 26 servers. However, it is better to create a “sharding manager” to balance based on size and frequency (e.g., words starting with “a” are much more common than x, y, z).
- Multiple languages? Consider supporting Unicode.
- Different countries? Maybe different tries by country, depending on usage.
- What about trending real-time search?
  - Use sharding.
  - Change weight system and give more value to recently searched queries.
  - Real-time processing requires streaming data processing (Hadoop MapReduce, Spark, Kafka, etc.).

13. Example: Chat System

Video

Requirements:
- 1:1 chat with low delivery latency.
- Small group chat (max 100 people).
- Guarantee message delivery and save messages if offline for a limited time.
- Online presence.
- Multiple device support: The same account can be logged in to multiple devices simultaneously.
- Push notifications.
Design:
- Sender
- Chat services
  - Store message
  - Relay message
- Receiver
- Protocol
  - Option: HTTP with keep-alive header for sender; polling, long polling, or WebSocket for receiver.
  - Polling (short)
    - Wastes resources.
  - Long polling
    - Longer connection time = fewer requests.
  - WebSocket
    - Handshake and then bi-directional channel connection.
    - More difficult to scale; connection persists and needs to be managed in the server.
- Solution: Hybrid approach: HTTP for sending messages and WebSockets for receiving them.
- Components:
  - Stateless services (load balancer, authentication, profile, group management).
  - Push notification (third party, do not reinvent).
    - FCM, APNS, etc.
  - Stateful services (WebSockets and chat services).
    - Modern servers can handle 100,000 connections.
- Online/offline presence can be handled by presence servers (real-time layer like chats).
- Normal data in relational database.
- Chat data in key-value stores.
  - Easy to scale horizontally.
  - Low latency.
  - **Relational databases have a long-tail problem for random access (expensive).
    - For searching in the chat or jumping to specific messages.**
  - Facebook uses HBase and Discord uses Cassandra.
- Data model:
  Message:
  * message_id (pk) bigint
  * message_from bigint
  * message_to bigint
  * content text
  * created_at timestamp
  Group_Message:
  * channel_id (pk) bigint
  * message_id (pk) bigint
  * user_id bigint
  * content text
  * created_at timestamp
  * message_id is used to order the messages because they can be created at the same time.
  * ID: Local sequence number (local to the group or chat), much easier to implement than a global unique ID like Snowflake’s solution.
Design:
- Service Discovery to assign the best server to the user based on geographical location, server capacity, etc. Example: Apache Zookeeper.
- Send message while offline:
  - Inbox pattern.
- What happens if sender and receiver are connected to different servers?
  - Need a synchronization queue (stored in K/V store) to send the message to another server and finally reach the client. Example: etcd for distributed synchronization.
  - Communication between these services could be RPC.
- What happens if the receiver is not online?
  => Inbox pattern using a message queue. When online again, messages are delivered; once delivered, they are removed from the inbox.
- How to deliver a message to 100 users in a group:
  => Fan-out pattern
  - Online users get the message; offline members get the message in their inbox.
- Online presence:
  - We cannot rely on active WebSocket connections because the app could be in the background, sleeping, etc.
  - API sends ping or keep-alive to the server. If no ping is received in 1 min, the user is marked as offline.
Warm up:
- Adding media to chats: Thumbnails, compression.
- End-to-end encryption.
- Caching messages on the client.
- Message resend mechanism.
  - IDs,
  - Timestamps
  - Vector clocks.

14. Example: Youtube

Video

Requirements
- Ability to upload videos fast.
- Smooth video streaming.
- Ability to change video quality.
- Low infrastructure cost.
- High availability, scalability, and reliability requirements.
- Clients supported: Mobile apps, web browsers, and smart TVs.
Design
- Use Blob storage or CDNs.
- Upload video flow:
  - Load balancer.
  - API servers.
  - Metadata cache/DB.
  - Original storage (e.g., blob storage).
    - Binary large objects (BLOB) are a collection of binary data stored as a single entity in a database management system.
  - Transcoding servers (encoding).
    - Transform video to other formats.
    - Queue to completion handlers (also update metadata).
  - Transcoding storage to CDNs.
- Video streaming flow:
  - Your device continuously receives video streams from remote source videos.
  - Video Streaming Protocols:
    - MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for “Dynamic Adaptive Streaming over HTTP”.
    - Apple HLS. HLS stands for “HTTP Live Streaming”.
    - Microsoft Smooth Streaming.
    - Adobe HTTP Dynamic Streaming (HDS).
  - Directly from the closest server using CDN.
Design:
- Transcoding
  - Different bitrates/qualities/resolutions require different internet speeds.
  - Raw video consumes a large amount of storage.
  - Compatibility with devices.
  - Deliver high resolution if high bandwidth is available.
  - Switch video quality depending on network conditions.
  - Different containers (avi, mp4, webm, etc.) and codecs for compression and decompression (H.264, VP9, HEVC, 4K ProRes, etc.); this will depend on devices (e.g., old phones).
  - Directed acyclic graph (DAG) model.
  - Transcoding a video is computationally expensive and time-consuming.
  - Execute sequential or parallel tasks during encoding (e.g., Facebook DAG model).
- Safety optimization: Pre-signed upload URL (e.g., Amazon S3) or Shared Access Signature (e.g., Azure Blob Storage).
  - This speeds up upload times because content is delivered directly to blob storage.
  - Enable multipart upload (multiple chunks) in parallel.
  - For performance: Transcode and distribute segments while uploading.
- Cost-saving.
  - More-viewed videos go to CDNs.
  - Less-viewed videos go to video servers.
  - Less-viewed videos do not need all encodings or can be processed on-demand.
  - Popular videos by region do not require global distribution.
- Errors
  - Implement recoverable error strategies like retry.
- Live streaming
  - Even lower latency is required, so another protocol is likely needed.
  - .mpd (for DASH manifest)
  - .m3u8 (for HLS manifest)
  - The browser will decide, depending on status, which segment to continue playing.

15. Example: Google Drive

Requirements
- Upload, download files.
- File sync.
- Notifications.
- Encrypted.
- All file formats.
- Data loss is not an option.
- High availability and scalability.
Design
- Single server
  - Upload endpoint.
    - Normal upload for small files.
    - Resumable upload for big files.
  - Download.
  - All protected by SSL using HTTPS.
    - Considering FTP?
- Use Amazon S3.
  - Fast, secure, and cheap storage.
  - Bucket redundancy in the same region.
  - Bucket redundancy in multiple regions.
- Load balancer.
- Multiple web servers for horizontal scaling.
- Sync conflicts:
  - Two users modify the same file or folder.
  - Solution: First modification is accepted; the second one gets a conflict.
    - File.txt
    - File (1).txt
  - Differential Synchronization for multiple users.
- Divide and conquer: Use block servers.
  - Files are split into blocks.
  - Blocks get a content hash.
  - Block size of 4MB (like Dropbox).
  - To get files, we get blocks and then sort them.
  - Files get a checksum.
  - Chunking, compression, and encryption are implemented in a single server (in a single language/platform, instead of having this responsibility in the clients, for security and maintainability (Only one implementation in one languaje)).
- Inactive blocks eventually go to cold storage.
- Metadata database.
  - User info, blocks, versions, upload timestamp.
- Notification service.
  - With offline backup queue.
  - Changes will be synced when the client is online again.
  - Notify clients when files are added, modified, or deleted.
  - Long polling with clients (like Dropbox).
    - Clients try to stay connected.
  - WebSockets (no need for a bidirectional channel, and files do not change frequently).
- Download file.
  - For sync, get changes triggered by notification and only download changes.
- Save storage:
  - De-duplicate data blocks among file versions. Redundant blocks are not necessary; only changes are stored.
  - Limit the number of versions.
  - Keep valuable versions only.
  - Move infrequently used data to cold storage (e.g., Amazon S3 Glacier).
Design

Delta synchronization:
- **Only modified blocks are synced instead of the whole file using a sync algorithm.
Data compression
- Compress blocks to reduce data size.
- Compression algorithm depends on file type.
Block encryption before sending to cloud storage.
High consistency.
- Data in cache replicas and the master is consistent.
- Invalidate cache on database writes.
- In relational databases, consistency is easy because of ACID; in NoSQL, ACID properties need to be programmatically incorporated.
Error handling:
- Load balancer failure => A secondary becomes active.
- Block server failure => Other servers pick up unfinished or pending jobs.
- S3 failure => Another copy is used because of geographical and regional redundancy.
- API server failure => Stateless API failure, so traffic is redirected to another active instance.
- Metadata cache failure => Automatic access to another one; caches are replicated and failed ones will be replaced.
- Metadata DB failure.
  - Master down: Promote a slave to temporarily be a new master, create a new slave.
  - Slave down: Use another slave, create a new slave.
- Notifications service failure => Long-poll connections are closed; client connections will try to eventually reconnect to another healthy server.
- Offline backup queue failure => Queues are replicated; consumers need to re-subscribe to the backup queue.