Book notes: System Design Interview: Volume 1

A 2020 cool book by Alex Xu to get hands-on with system design interviews, with real-world scalable systems examples.

More about System Design

Here are my notes:

📃 Content:

1. System Design Basics

How to scale from zero to millions of users:

  • Start simple, a single server, and iterate only when needed.
    • Metrics: user base, traffic, use case latency, costs…
  • Which Database to use?
    • Relational (RDBMS) by default. Worked well for over 40 years.
    • If the use case is not suitable, ask:
      • Does the application require super-low latency?
      • Unstructured data or non‑relational data?
      • Do you only need to serialize and deserialize data (JSON, XML, etc.)?
      • Do you need to store a massive amount of data?
      • Solution: Non-Relational (NoSQL)
  • Scalability
    • Vertical: scale-up, CPU, RAM, etc.
      • Simple, and good for low traffic.
      • Limited by hardware in a single server.
      • No failover or redundancy. Server down means business down.
    • Horizontal: scale-out, more servers to the pool.
  • Load Balancer
    • If one server goes down, the balancer redirects traffic to available servers.
    • If traffic grows, more servers can be added without changing the request flow.
  • Database Replication
    • Master (original) / slave (copies), only the master accepts writes.
    • In general systems do many more reads than writes, so there are more slaves than master.
    • Better performance: more queries running in parallel.
    • More reliability: if one database is destroyed, data is preserved due to replication.
    • Disaster recovery:
      • Slave down: traffic goes to another slave or even to the master.
      • Master down: a slave is promoted to be a new master (data might need to be restored).
        • Other option: multi-master and circular replication.
  • Cache
    • Temporary storage, much faster than the database, that stores expensive responses of frequently accessed data in memory.
    • Use when data is accessed frequently but modified infrequently.
    • Cache is volatile; important data must be persisted in data stores.
    • Define an expiration policy; it is a good practice.
      • Not too short (to avoid unnecessary data reload).
      • Not too long (data can become stale or unused).
    • Consistency trade-off: sync cache with data store; they are not in the same transaction.
    • Possible Single point of failure (SPOF).
      • Recommend multiple cache servers across data centers.
      • Overprovision the required memory by certain percentages.
    • Eviction policy: when cache is full, requests to add fail.
      • Least-recently-used (LRU) policy (most common).
      • Least-frequently-used (LFU) policy.
      • First-in-first-out (FIFO) policy.
  • Content delivery network (CDN)
    • Content is delivered from the closest (lowest latency) CDN server.
    • Cache only relevant and frequent data (to avoid unnecessary costs).
    • Define cache expiration.
      • Not too long if data is time-sensitive; data needs to be fresh.
      • Not too short; to avoid repeated reloading.
    • Fallback: what happens if CDN is not working? Get data directly from the origin.
    • Invalidate files:
      • Call the CDN API for specific object invalidation.
      • Object versioning: e.g. image.png?v=2.
  • Stateless web servers
    • Apply REST principles.
    • Store sessions in persistent storage.
    • Sticky Sessions can be a problem.
  • Multiple data centers/zones
    • Geo-routed worldwide, routes client traffic to the nearest data center.
    • In failover: data is replicated in multiple data centers, and traffic routed to the available one.
    • For example: GeoDNS
    • Data should be geo‑replicated in order to achieve multi-data-center consistency.
    • Automate deployments and tests in multiple data centers to guarantee consistency.
  • Message queue
    • Enables asynchronous communication and asynchronous request processing.
    • Decouples requests from processing, which adds more reliability to the system.
    • If workers are unavailable, producers can still publish messages to the queues.
    • Workers can work when producers are not available.
    • The number of workers can be scaled depending on queue size.
  • Logging & Metrics for Observability
    • Monitor error logs at the server level or use tools to aggregate them to a centralized service for easy search and view.
    • Metrics:
      • Host level: CPU, memory, disk I/O, etc.
      • Aggregated level of tiers: database, web servers, cache.
      • Key business metrics: users, retention, revenue, etc.
  • Database scaling
    • Vertical: more CPU, RAM.
    • Single point of failure.
    • High cost.
    • Horizontal scaling: Sharding
      • Same schema, but database is partitioned in smaller, easy-to-manage parts (shards).
      • Define a good partition key to distribute data
        • One or more columns that determine how data is distributed physically in shards.
      • Challenges:
        • Resharding data
          • One shard is full, and the sharding function needs to change to balance data between shards.
          • Solution: Consistent hashing
        • Celebrity problem (hotspot key problem).
          • Some data inside a shard receives a disproportionate amount of traffic (celebrity).
          • We may need to allocate dedicated shards for each celebrity.
        • Join and de-normalization
          • Data is split across shards, making join operations difficult.
          • Solution: De-normalize the database to perform queries on a single table.
  • Scaling is an iterative process
    • For example, the business may realize it needs to split services into smaller microservices.

2. Estimations

  • Powers of two as bytes:

    • 2^3 ≈ 8b = 1 Byte (1 ASCII character)
    • 2^10 ≈ 1024B (Thousand) = 1 Kilobyte (1KB)
    • 2^20 ≈ 1024KB (Million) = 1 Megabyte (1MB)
    • 2^30 ≈ 1024MB (Billion) = 1 Gigabyte (1GB)
    • 2^40 ≈ 1024GB (Trillion) = 1 Terabyte (1TB)
    • 2^50 ≈ 1024TB (Quadrillion) = 1 Petabyte (1PB)
  • Latency Numbers Every Programmer Should Know

    • L1 cache reference 0.5 ns
    • Branch mispredict 5 ns
    • L2 cache reference 7 ns
    • Mutex lock/unlock 100 ns
    • Main memory reference 100 ns
    • Compress 1K bytes with Zippy 10,000 ns = 10 µs
    • Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs
    • Read 1 MB sequentially from memory 250,000 ns = 250 µs
    • Round trip within the same datacenter 500,000 ns = 500 µs
    • Disk seek 10,000,000 ns = 10 ms
    • Read 1 MB sequentially from the network 10,000,000 ns = 10 ms
    • Read 1 MB sequentially from disk 30,000,000 ns = 30 ms
    • Send packet CA (California) ->Netherlands->CA 150,000,000 ns = 150 ms
  • High availability numbers (1 year)

    • 99% - 3.65 days
    • 99.9% - 8.77 hours
    • 99.99% - 52.60 minutes
    • 99.999% - 5.26 minutes
    • 99.9999% - 31.56 seconds
  • Daily Active Users (DAU)

  • Queries Per Second (QPS) and QPS peak

  • Tips:

    • Round and approximate; precision is not expected.
    • Write down your assumptions.
    • Practice commonly asked estimations: QPS, peak QPS, storage, cache, number of servers, etc.

3. Framework for System Design Interviews

  1. Understand the problem and establish design scope (3 - 10min)
  2. Understand the problem and establish design scope (3 - 10min)
    • What specific features are we going to build?
    • How many users does the product have?
    • How fast does the company anticipate scaling up? What are the anticipated scales in 3 months, 6 months, and a year?
    • What is the company’s technology stack? What existing services might you leverage to simplify the design?
  3. Propose high-level design and iterate (10 - 15min)
    • Come up with an initial design
    • Ask for feedback
    • Treat your interviewer as a teammate and work together
    • Draw box diagrams with key components on the whiteboard or paper (clients (mobile/web), APIs, web servers, data stores, cache, CDN, message queue)
    • Do calculations to evaluate if your design fits the scale constraints
    • Think out loud.
    • Communicate with your interviewer if calculations are necessary before starting.
  4. Design deep dive (10 - 25min)
    • Work with the interviewer to identify and prioritize components in the architecture:
      • High-level design
      • Or performance characteristics
      • Or bottlenecks and resource estimations
      • etc.
    • Time management is essential:
      • Demonstrate your abilities
      • You must be armed with signals to show your interviewer
      • Try not to get into unnecessary details
  5. Wrap up (3 - 5min)
    • If you have the freedom to discuss additional points:
    • Identify the system bottlenecks and discuss potential improvements
    • Never say your design is perfect and nothing can be improved; show your critical thinking.
    • Error cases (server failure, network loss, etc.)
    • Next scale curve 100k => 1 million => 10 million users

4. Example: Design A Rate Limiter

  1. Requirements:
    • Limit excessive requests.
    • Low latency. The rate limiter should not slow down HTTP response time.
    • Use as little memory as possible.
    • Distributed rate limiting. The rate limiter can be shared across multiple servers or processes.
    • Exception handling. Show clear exceptions to users when their requests are throttled.
    • High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system.
  2. Design
    • Building your own rate limiting service takes time. If the effort is too high, a commercial API Gateway is a better option.
      • Authentication and authorization
      • Service discovery integration
      • Response caching
      • Retry policies, circuit breaker, and QoS
      • Rate limiting and throttling
      • Load balancing
      • Logging, tracing, correlation
      • Headers, query strings, and claims transformation
      • IP allow-listing
      • SSL termination
      • Serving static content.
    • Responses will return: HTTP 429 Too Many Requests.
    • Rate Limiting Algorithms (To determine when to block requests):
      • Token bucket
        • Easy to implement.
        • Memory efficient.
        • Cons: two parameters, bucket size and refill rate.
      • Leaky bucket
        • Memory efficient (limits queue size).
        • Requests are processed at a fixed rate.
        • Cons: if too many old requests fill the queue, newer requests get rejected.
        • Cons: two parameters, bucket size and refill rate.
      • Fixed window
        • Memory efficient.
        • Easy to understand.
        • Cons: spike in traffic at the edge of a window can cause more requests than allowed.
      • Sliding window log
        • Very accurate. In any rolling window, requests will not exceed the rate limit.
        • Cons: consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.
      • Sliding window counter
    • Add database (Redis, for example) to store parameters.
  3. Deep dive
    • How are rate rules defined? Where are they stored?
      • Load and cache rules from the rate limit service.
      • Define rules and sync with cache.
    • How to handle requests that are rate limited?
      • Response code
      • Headers
    • Rate limiter in a distributed environment
      • Race conditions (when updating the request counter in Redis from multiple threads).
      • Synchronization issues (when scaling with multiple rate limiter instances, recommended approach: centralized data store for counters, e.g. Redis).

5. Design Consistent Hashing

  • Video

  • A method that helps distribute data among servers during horizontal scaling.

  • Define hashing function.

  • Simple hashing:

    • serverIndex = hash(key) % N
    • Is OK when pool size is fixed and data distribution is even.
      • If N = 3, we might not get a good distribution.
      • Problem when adding or removing servers: most keys need to be redistributed.
      • Not ideal.
  • Consistent hashing

    • Hash space and hash ring
    • Example: SHA-1 as hash function.
    • => 0,…,2^160 - 1 hash space
    • Ring with Servers and keys using the hash space
    • Servers: S0, S1, S2, S3 (using hashing function with server name or IP address).
    • Hash keys with the same function: k0, k1, k2, k3.
    • k0 is stored in the first server encountered moving clockwise in the ring.
    • If we add S4 between S3 and S0, ONLY the keys in that segment need to be redistributed to S4.
    • The same applies when removing a server.
    • This algorithm only requires redistributing a fraction of the keys.
    • Cons: servers are hashed into different fractions of the hashing space.
      => Solution: virtual nodes. Servers appear in multiple locations in the ring, for example S0_1, S0_2, S0_3.
      => Virtual nodes should cover multiple different segments of the hashing space.
      => More virtual nodes => distribution is more balanced, but it takes more space to store metadata.
      => Virtual node number can be tuned.
    • (Example: Amazon DynamoDB or Cassandra) uses this for data partitioning
      => Helps to minimize data movement while doing rebalancing.
    • (Example: CDNs)
      => Distribute web content evenly between servers.
    • (Example: load balancers like Google Load Balancers)
      => Distribute persistent connections evenly.

6. Example: Design A Key-value Store

  • Video

  • Non-relational database where data is stored and identified by unique key. Example: Memcached, Redis, etc.

  • Short keys work better.

  1. Requirements:
    • The size of a key-value pair is small: less than 10 KB.
    • Ability to store big data.
    • High availability: The system responds quickly, even during failures.
    • High scalability: The system can be scaled to support large data sets.
    • Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
    • Tunable consistency.
    • Low latency.
  2. Design:
    • Initial approach: hash table in memory.

      • Improvements
      • Data compression.
      • Store only frequently used data in memory; store the rest on disk.
    • CAP theorem

      • It is impossible for a distributed system to simultaneously provide more than two of the following three guarantees:
      • Consistency
      • Availability
      • Partition Tolerance
      • Possibilities:
        • CA => Network failure can always happen, so network partition must always be tolerated. CA cannot exist in real-world distributed systems.
        • CP => Consistency and Partition Tolerance (Sacrifices Availability).
        • AP => Availability and Partition Tolerance (Sacrifices Consistency).
  3. Distributed system components
    • Data partition
      • All data does not fit in a single server.
      • Challenges:
      • Distributing data across multiple servers evenly.
      • Minimizing data movement when nodes are added or removed.
      • Consistent hashing is good to solve these problems.
      • Pros:
      • Automatic scaling based on load.
      • Heterogeneity: Servers with higher capacity can have more virtual nodes.
    • Data replication
      • Data must be replicated across multiple data centers to prevent data loss during an outage, disaster, or network failure.
    • Consistency
      • Replicas must be synchronized.
      • Quorum consensus can guarantee consistency for read and write (N,W,R)
        • N => Number of replicas.
        • W => Number of ACKs from replicas needed to consider write successful.
        • R => Number of ACKs from replicas needed to consider read successful.
        • R = 1 & W = N: the system is optimized for fast read.
        • W = 1 & R = N: the system is optimized for fast write.
        • W + R > N: strong consistency is guaranteed (usually N = 3, W = R = 2).
        • W + R <= N: strong consistency is not guaranteed.
      • Model:
        • Strong consistency: The client never sees out-of-date data.
          => Achieved by blocking new reads/writes until all replicas have acknowledged the current write.
          => Blocks operations.
        • Weak consistency: Requests may not see the most updated value.
        • Eventual consistency: A specific type of weak consistency; given enough time, all updates are propagated, and replicas become consistent.
          => Example: Dynamo and Cassandra adopt eventual consistency.
          => Allows inconsistent values to enter the system, forcing the client to perform reconciliation on read.
    • Inconsistency resolution
      • Solved with versioning and vector clocks.
      • Versioning: Treating each data modification as a new immutable version of the data (v1, v2, …).
      • A vector clock is a [server, version] pair associated with a data item. It can be used to check if one version precedes, succeeds, or is in conflict with others.
      • Adds complexity to the client because it needs to implement conflict resolution logic.
      • Vector clock could grow rapidly. Use a threshold for the length, and if it exceeds the limit, the oldest pairs are removed (solution used by Dynamo without problems in production).
    • Handling failures
      • Detecting if a server is down requires at least two independent sources of information.
      • Multicasting is inefficient when there are many servers.
      • Better to use the gossip protocol:
        • A failure detection method.
        • Each node maintains a node membership list, which contains member IDs and heartbeat counters.
        • Each node periodically increments its heartbeat counter.
        • Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.
        • When nodes receive heartbeats, the membership list is updated with the latest info.
        • If a member’s heartbeat has not increased for a predefined period, it is considered offline.
      • Temporary failures: If the system uses strict quorum, reads and writes could be blocked during failures.
        => To improve availability, use sloppy quorum (the system chooses healthy W & R servers).
        => This allows the write to be temporarily stored on an available node (hinted handoff).
        => When the server is back up, changes are pushed to it to achieve data consistency.
      • Permanent failures: To keep replicas in sync, use the anti-entropy protocol.
        => Compares each piece of data on replicas and updates each replica to the newest version.
        => Uses Merkle trees for efficient data comparison; the amount of data synchronized is proportional to the differences between the two replicas.
        => Example: Apache Cassandra, Amazon DynamoDB, Amazon S3, Google Cloud Storage, Redis, Memcached, etc.
      • Data center outage:
        => Replicate data across multiple data centers.
    • System architecture diagram.
    • Key-value store communication with APIs: get(key) and put(key, value).
    • A coordinator node acts as a proxy between the client and the key-value store.
    • Nodes are distributed on a ring using consistent hashing.
      • The system is completely decentralized, so adding and removing nodes can be automatic.
      • Data is replicated at multiple nodes.
      • There is no single point of failure as every node has the same set of responsibilities.
    • Read path (For this specific problem)
      • Read from the memory cache.
      • If not found in cache or Bloom filter, read from persistence: a sorted-string table (SSTable), which is a sorted list of “<key, value>” pairs.

7. Distributed Unique ID Generation

  1. Requirements:
  2. Requirements:
    • IDs must be unique.
    • IDs are numerical values only.
    • IDs fit into 64 bits.
    • IDs are ordered by date.
    • Generate 10,000 unique IDs per second.
  3. Design:
  • Multi-master replication
    • AUTO_INCREMENT
    • Hard to scale due to synchronization issues.
    • Can be predictable, posing a possible security issue.
  • Universally unique identifier (UUID)
    • Not numeric; 128-bit.
    • No coordination, so no synchronization issues.
    • Easy to scale.
  • Ticket server
    • Generates distributed primary keys (centralized AUTO_INCREMENT).
    • Numeric IDs.
    • SPOF (Single Point of Failure).
    • Data synchronization issues if scaled.
  • Twitter snowflake approach
    • Numeric; 64-bit.
    • Divide ID into:
      • Sign bit: 1 bit.
      • Timestamp: 41 bits (milliseconds since epoch).
        => Allows for around 69 years.
      • Datacenter ID: 5 bits (2^5 => 32 possible data centers).
      • Machine ID: 5 bits (2^5 => 32 possible machines).
      • Sequence number: 12 bits (incremented by 1 in every machine/process, reset to 0 every millisecond).
        => Maximum of 2^12 = 4096 IDs per millisecond per machine.
    • Easy to scale.
    • No coordination, so no synchronization issues.
    • Considerations:

8. Example: URL Shortener

  • Video

  • Short URL: https://www.aaaaaaaaaaa.com/q=chatsystem&c=loggedin&v=v3&l=long => https://bbbbbbb.com/y7keocwj

  • Hash table: shortUrl -> longUrl.

  • [0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters.

  • 62^7 => 3,521,614,606,208 possible URLs with 7 characters (few collisions).

  • MD5 or SHA-1 produces too long a hash.
    => Low collision probability.
    => Fixed size.
    => Secure; difficult to predict the next URL.

  • Base 62 conversion: 11157 (using ID in database) => 2TX.
    => Variable length.
    => Easy to predict URLs; not secure.

  • Unique ID generator
    => Example: Snowflake => 2009215674938 (13 digits).
    => ID in base 62 => zn9edcu.

  • Improvements for scalability:

    • Relational database (id, shortUrl, longUrl)
      • Clustered index on shortUrl.
      • Database scaling.
      • Bloom filter.
    • Cache.
    • Load balancer.
      => Scales web servers.
    • Rate limiter.
    • Availability, consistency, reliability.

9. Example: Web Crawler

A robot or spider starts by collecting a few web pages and then follows links on those pages to collect new content.

  1. Requirements:
    • Scalability: millions of web pages.
    • Robustness: not all links are good.
    • Politeness: Do not make too many requests to a website within a short time interval.
    • Extensibility: Minimal changes needed to support new content types, for example, image collection.
  2. Initial Design:
    • URL Frontier.
    • HTML downloader.
    • Content parser (save if not seen).
    • Link extractor.
    • URL Filter (save if not seen).
    • Return to URL Frontier.
  3. Design:
    • The Web can be considered a graph.
    • DFS vs BFS
      • BFS commonly used.
      • Problem: Links from the same web page often link back to the same host. Parallel downloads can be inefficient and trigger DoS protection.
        => Solution: Queues for politeness. Maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs from that queue.
      • Does not take priority into consideration (e.g., home page > blog post).
        => Solution: Prioritize URLs based on usefulness (PageRank, traffic, update frequency, etc.).
    • Robots.txt.
    • Extension: At the link extractor level, add workers to look for other kinds of URLs.
    • Consistent hashing to distribute load across multiple instances.
    • Database replication and sharding.
    • Server-side rendering to support dynamic links generated by JS.
    • Remove redundant content using checksums.
    • Avoid infinite loops by setting a max URL length.
    • Exclude noise, such as advertisements.

10. Example: Notification System

  1. Requirements:
    • Push notifications, emails, SMS.
    • Millions of notifications per day.
  2. Design
    • Push:
      • iOS: APNs
        • (Device Token)
        • Payload
      • Android: FCM
        • (Device Token)
        • Payload
    • SMS
      • Twilio
      • Nexmo
    • Email
      • Sendgrid
      • Mailchimp
    • Architecture
      • Load balancer
      • API servers (expose APIs for communication)
        • Enable horizontal scaling
        • Implement data validation
      • External DB (contains device tokens, phone numbers, emails)
        • A user can have multiple devices
        • Cache
      • Triggers:
        • Can be on-demand (service call), cron job, etc.
        • Examples: Item purchase, reminder, order status change.
      • Add queues by notification type to avoid SPOF and bottlenecks.
        • A worker reads from the queue and performs the final sync call to the third-party notification service.
        • Also, implement retry-on-error patterns.
  3. Design
    • Reliability
      • How to prevent data loss?
      • Notifications can be reordered or delayed, but never lost.
      • Data needs to be persisted to guarantee delivery.
        • DB notification log to mark notifications as sent or pending.
      • Receive only one notification?
      • Other important topics
        • Template reuse
        • Settings (e.g., contact only by email, legal)
        • Rate limiting
        • Security (manage AppKey/appSecret)
        • Monitor queues and notifications
        • Event tracking (states: start, pending, sent, error, delivered, click, unsubscribe)

11. Example: News Feed System

  1. Requirements:
    • Mobile and web news feed.
    • Publish a post and see friends’ posts on the news feed page.
    • 5,000 friends and 10 million DAU.
    • Media files: Images and videos.
  2. Design
    • For publishing:
      • Load balancer.
      • API endpoints
        • POST /v1/me/feed
      • Post Service
        • Post cache.
        • Post DB.
      • Fan-out Service (pushes new content to friends’ feeds)
        • News Feed cache.
      • Notification Service (informs friends that new content is available).
    • For reading news feeds:
      • Load balancer.
      • API endpoints
        • GET /v1/me/feed
      • News Feed Service
        • News feed cache.
  3. Design
    • Add authentication and rate limiting (to publish posts).
    • Fan-out service:
      • Model

        • Fan-out on write (push model)
          • Feeds are pre-computed during write.
          • Update all friends’ caches immediately after publication.
          • Fetching is fast because feeds are pre-computed.
          • Cons:
            • Hotkey problem: many friends => many pre-calculations.
            • Wasted computation for inactive users.
        • Fan-out on read (pull model)
          • Reads are generated on-demand; recent posts are pulled when the user loads the homepage.
          • No hotkey problem.
          • Inactive users do not consume computation resources.
          • Cons:
            • Fetching new feeds is slow because they are not pre-computed.
      • Solution: hybrid approach

        • Push model for the majority of users (pre-computation).
        • For celebrities, use the pull model (on-demand).
          • Consistent hashing is useful to mitigate the hotkey problem.
      • Get friends’ IDs using a Graph DB e.g. Neo4j.

      • Get friends’ data from User cache/DB.

        • e.g., a friend is muted, is an inactive account, etc.
      • Message queue:
        => Fan-out workers.
        => Update news feed cache.
        * Only IDs are stored in memory (post_id, user_id).
        * Memory size is small => set a configurable limit (users are usually only interested in the latest content, so cache miss rate is low).

    • News feed retrieval
      • Add authentication and rate limiting (to read posts).
      • Add CDN.
      • Read news feed cache (post_id, user_id) => this gives the most recent posts.
      • Read user cache.
      • Read post cache.
    • Cache architecture in a news feed system.
      • 5 layers of cache:
      • News feed (stores only IDs).
      • Content (hot cache for popular posts, normal cache for others).
      • Social graph (follower, following).
      • Action (liked, replied, etc.).
      • Counters (like counter, reply counter, etc.).
  4. Other topics
    • Vertical scaling vs Horizontal scaling
    • SQL vs NoSQL
    • Master-slave replication
    • Read replicas
    • Consistency models
    • Database sharding
    • Keep web tier stateless
    • Cache data as much as you can
    • Support multiple data centers
    • Loosely coupled components with message queues
    • Monitor key metrics. For instance, QPS during peak hours and latency while users refresh their news feed.

12. Example: Search Autocomplete System

Known as: Design top-K most searched queries.

  1. Requirements:
    • 5 results, ordered by popularity.
    • Decide whether to send request on every character, every 3 characters, or wait a few ms since the last keystroke.
  2. Design
    • Store word–frequency table.
    • Search with SQL using “WHERE word LIKE ‘prefix%’”.
    • Good solution for small datasets.
  3. Design
    • Trie data structure
      • Designed for string retrieval operations.
      • Each node represents a character or a prefix.
      • Steps:
        • O(p): Find a prefix in the trie.
        • O(c): Get all the final nodes of the prefix’s subtree.
        • O(c log c): Results are in alphabetical order; we need to reorder by frequency.
      • Good algorithm, but too slow because we need to traverse the entire subtree to get top K.
    • Optimizations:
      • Limit max length of a prefix.
      • Cache top search queries at each node (e.g., top 5).
        • O(1) if cached. This requires significant space.
    • Real time:
      • Updating the trie on every query (billions) slows down the query service.
      • Top suggestions may not change much once the trie is built, so frequent updates are not necessary.
      • Solution: Data is cached; frequency and trie are updated weekly (for example) by workers using logs and analytics.
        • Logs should be aggregated first to feed the workers that will update the trie.
    • Store the trie
      • Serialize and store in a document DB like MongoDB.
      • Key-value store (prefix -> data).
    • Load balancer
    • API server reads directly from trie cache; if not found, reads from DB and stores in the cache to avoid cache misses for subsequent queries.
    • Web uses Ajax for background fetching of results.
    • Browser cache: Suggestions do not change much (e.g., Google caches for 1 hour for a single user, “cache-control: private, max-age=3600”).
    • Use data sampling for logging.
    • Update the trie
      • Recreate full trie and replace.
      • Update individual nodes => Avoid because it requires updating all parents up to the root (and updating the cache in every node).
    • Add a filter layer to remove unwanted words from the trie.
    • Consider sharding the trie: e.g., a–z across 26 servers. However, it is better to create a “sharding manager” to balance based on size and frequency (e.g., words starting with “a” are much more common than x, y, z).
    • Multiple languages? Consider supporting Unicode.
    • Different countries? Maybe different tries by country, depending on usage.
    • What about trending real-time search?
      • Use sharding.
      • Change weight system and give more value to recently searched queries.
      • Real-time processing requires streaming data processing (Hadoop MapReduce, Spark, Kafka, etc.).

13. Example: Chat System

  1. Requirements:

    • 1:1 chat with low delivery latency.
    • Small group chat (max 100 people).
    • Guarantee message delivery and save messages if offline for a limited time.
    • Online presence.
    • Multiple device support: The same account can be logged in to multiple devices simultaneously.
    • Push notifications.
  2. Design:

    • Sender

    • Chat services

      • Store message
      • Relay message
    • Receiver

    • Protocol

      • Option: HTTP with keep-alive header for sender; polling, long polling, or WebSocket for receiver.
      • Polling (short)
        • Wastes resources.
      • Long polling
        • Longer connection time = fewer requests.
      • WebSocket
        • Handshake and then bi-directional channel connection.
        • More difficult to scale; connection persists and needs to be managed in the server.
    • Solution: Hybrid approach: HTTP for sending messages and WebSockets for receiving them.

    • Components:

      • Stateless services (load balancer, authentication, profile, group management).
      • Push notification (third party, do not reinvent).
        • FCM, APNS, etc.
      • Stateful services (WebSockets and chat services).
        • Modern servers can handle 100,000 connections.
    • Online/offline presence can be handled by presence servers (real-time layer like chats).

    • Normal data in relational database.

    • Chat data in key-value stores.

      • Easy to scale horizontally.
      • Low latency.
      • Relational databases have a long-tail problem for random access (expensive).
        • For searching in the chat or jumping to specific messages.
      • Facebook uses HBase and Discord uses Cassandra.
    • Data model:
      Message:
      * message_id (pk) bigint
      * message_from bigint
      * message_to bigint
      * content text
      * created_at timestamp
      Group_Message:
      * channel_id (pk) bigint
      * message_id (pk) bigint
      * user_id bigint
      * content text
      * created_at timestamp
      * message_id is used to order the messages because they can be created at the same time.
      * ID: Local sequence number (local to the group or chat), much easier to implement than a global unique ID like Snowflake’s solution.

  3. Design:

    • Service Discovery to assign the best server to the user based on geographical location, server capacity, etc. Example: Apache Zookeeper.
    • Send message while offline:
      • Inbox pattern.
    • What happens if sender and receiver are connected to different servers?
      • Need a synchronization queue (stored in K/V store) to send the message to another server and finally reach the client. Example: etcd for distributed synchronization.
      • Communication between these services could be RPC.
    • What happens if the receiver is not online?
      => Inbox pattern using a message queue. When online again, messages are delivered; once delivered, they are removed from the inbox.
    • How to deliver a message to 100 users in a group:
      => Fan-out pattern
      • Online users get the message; offline members get the message in their inbox.
    • Online presence:
      • We cannot rely on active WebSocket connections because the app could be in the background, sleeping, etc.
      • API sends ping or keep-alive to the server. If no ping is received in 1 min, the user is marked as offline.
  4. Warm up:

    • Adding media to chats: Thumbnails, compression.
    • End-to-end encryption.
    • Caching messages on the client.
    • Message resend mechanism.
      • IDs,
      • Timestamps
      • Vector clocks.

14. Example: Youtube

  1. Requirements
    • Ability to upload videos fast.
    • Smooth video streaming.
    • Ability to change video quality.
    • Low infrastructure cost.
    • High availability, scalability, and reliability requirements.
    • Clients supported: Mobile apps, web browsers, and smart TVs.
  2. Design
    • Use Blob storage or CDNs.
    • Upload video flow:
      • Load balancer.
      • API servers.
      • Metadata cache/DB.
      • Original storage (e.g., blob storage).
        • Binary large objects (BLOB) are a collection of binary data stored as a single entity in a database management system.
      • Transcoding servers (encoding).
        • Transform video to other formats.
        • Queue to completion handlers (also update metadata).
      • Transcoding storage to CDNs.
    • Video streaming flow:
      • Your device continuously receives video streams from remote source videos.
      • Video Streaming Protocols:
        • MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for “Dynamic Adaptive Streaming over HTTP”.
        • Apple HLS. HLS stands for “HTTP Live Streaming”.
        • Microsoft Smooth Streaming.
        • Adobe HTTP Dynamic Streaming (HDS).
      • Directly from the closest server using CDN.
  3. Design:
    • Transcoding
      • Different bitrates/qualities/resolutions require different internet speeds.
      • Raw video consumes a large amount of storage.
      • Compatibility with devices.
      • Deliver high resolution if high bandwidth is available.
      • Switch video quality depending on network conditions.
      • Different containers (avi, mp4, webm, etc.) and codecs for compression and decompression (H.264, VP9, HEVC, 4K ProRes, etc.); this will depend on devices (e.g., old phones).
      • Directed acyclic graph (DAG) model.
      • Transcoding a video is computationally expensive and time-consuming.
      • Execute sequential or parallel tasks during encoding (e.g., Facebook DAG model).
    • Safety optimization: Pre-signed upload URL (e.g., Amazon S3) or Shared Access Signature (e.g., Azure Blob Storage).
      • This speeds up upload times because content is delivered directly to blob storage.
      • Enable multipart upload (multiple chunks) in parallel.
      • For performance: Transcode and distribute segments while uploading.
    • Cost-saving.
      • More-viewed videos go to CDNs.
      • Less-viewed videos go to video servers.
      • Less-viewed videos do not need all encodings or can be processed on-demand.
      • Popular videos by region do not require global distribution.
    • Errors
      • Implement recoverable error strategies like retry.
    • Live streaming
      • Even lower latency is required, so another protocol is likely needed.
      • .mpd (for DASH manifest)
      • .m3u8 (for HLS manifest)
      • The browser will decide, depending on status, which segment to continue playing.

15. Example: Google Drive

  1. Requirementsle Drive

  2. Requirements

    • Upload, download files.
    • File sync.
    • Notifications.
    • Encrypted.
    • All file formats.
    • Data loss is not an option.
    • High availability and scalability.
  3. Design

    • Single server
      • Upload endpoint.
        • Normal upload for small files.
        • Resumable upload for big files.
      • Download.
      • All protected by SSL using HTTPS.
        • Considering FTP?
    • Use Amazon S3.
      • Fast, secure, and cheap storage.
      • Bucket redundancy in the same region.
      • Bucket redundancy in multiple regions.
    • Load balancer.
    • Multiple web servers for horizontal scaling.
    • Sync conflicts:
      • Two users modify the same file or folder.
      • Solution: First modification is accepted; the second one gets a conflict.
        • File.txt
        • File (1).txt
      • Differential Synchronization for multiple users.
    • Divide and conquer: Use block servers.
      • Files are split into blocks.
      • Blocks get a content hash.
      • Block size of 4MB (like Dropbox).
      • To get files, we get blocks and then sort them.
      • Files get a checksum.
      • Chunking, compression, and encryption are implemented in a single server (in a single language/platform, instead of having this responsibility in the clients, for security and maintainability).
    • Inactive blocks eventually go to cold storage.
    • Metadata database.
      • User info, blocks, versions, upload timestamp.
    • Notification service.
      • With offline backup queue.
      • Changes will be synced when the client is online again.
      • Notify clients when files are added, modified, or deleted.
      • Long polling with clients (like Dropbox).
        • Clients try to stay connected.
      • WebSockets (no need for a bidirectional channel, and files do not change frequently).
    • Download file.
      • For sync, get changes triggered by notification and only download changes.
    • Save storage:
      • De-duplicate data blocks among file versions. Redundant blocks are not necessary; only changes are stored.
      • Limit the number of versions.
      • Keep valuable versions only.
      • Move infrequently used data to cold storage (e.g., Amazon S3 Glacier).
  4. Design

  • Delta synchronization:

  • Data compression

    • Compress blocks to reduce data size.
    • Compression algorithm depends on file type.
  • Block encryption before sending to cloud storage.

  • High consistency.

    • Data in cache replicas and the master is consistent.
    • Invalidate cache on database writes.
    • In relational databases, consistency is easy because of ACID; in NoSQL, ACID properties need to be programmatically incorporated.
  • Error handling.

    • Load balancer failure => A secondary becomes active.
    • Block server failure => Other servers pick up unfinished or pending jobs.
    • S3 failure => Another copy is used because of geographical and regional redundancy.
    • API server failure => Stateless API failure, so traffic is redirected to another active instance.
    • Metadata cache failure => Automatic access to another one; caches are replicated and failed ones will be replaced.
    • Metadata DB failure.
      • Master down: Promote a slave to temporarily be a new master, create a new slave.
      • Slave down: Use another slave, create a new slave.
    • Notifications service failure => Long-poll connections are closed; client connections will try to eventually reconnect to another healthy server.
    • Offline backup queue failure => Queues are replicated; consumers need to re-subscribe to the backup queue.