Design Twitter

Designing Twitter: Introduction

When asked to design a particular application, it is not expected that you have to design the entire application, with detailed analysis of each and every component, because not only is that a herculean task, but also that such a task requires a extensive research, discussions and analysis, which simply cannot be covered in a single article.

Therefore, essentially we look at the minimum viable product(MVP), which encompasses of the app's core functionalities. After identifying the core functionalities, we look at the ways to implement the application. Keeping in mind the charactericstics (traffic,usage,etc) of the application we take design decisions to deliver optimal performance. Justification for design decisions is necessary.

MVP and Charactericstics of Twitter

Below are the core requirements of the twitter application:

User can:

Tweet

Follow

View Timelines: User timeline & Home timeline

Search

Characteristics of twitter:

300M daily users

6,000 tweets/s

Read heavy app

Data model

The use case for designing the mvp of twitter has relatively simple data tables. Because the data is structured and simple in nature, we can use relational database for persistance of the data. Below is the relational table design.

Users(user_id (primary key), name)

Tweets(tweet_id (primary key), content,user_id)

Followers(user_id (primary key), follower_id (foreign key))

The ER diagram for the same is as follows:

Feature 1: Writing tweets

Use case: User is able to post a short message(<=280 characters) which will be visible to the people following to him, as well as in the user's home timeline.

Flow:

User posts the tweet
The request hits the load balancer, which then redirects the request to a nearby database center
Request is handed to a database machine, which processes the request, writes into the tables.

Why is a load balancer needed?

A load balancer distributes the network traffic across a number of servers. The aim is increase requests concurrently handled, and make the application more reliable.

Increase requests concurrently handled:
- By effectively distributing load accross the servers
- Choosing the best server to serve a user using an effective load balancing algorithm
- Common algos for load balancing: Round-robin(sequential requesting), least connections (least busy server handles req), ip hash(geographically determine the server)
Make application more reliable:
Load balancer performs health checks and tracks status of all the servers. If a server is down, the client request is redirected to another server, making the application more relaible.

Examples of load balancers: Nginx, AWS ELB

Feature 2: Rendering Home timeline

Use case: Render the home timeline for the user. Home timeline is the feed of the user where they see tweets posted by the people they are following.

Naive approach:

Get the ids of people user is following
Get tweets people in step 1
Merge them in chronological order

Problem: This operation requires a join, which is very heavy in computation for large databases and does not scale.

Twitter's Fanout solution to rendering home timeline quickly:

The idea is to precompute the home timelines of all the users. Meaning, whenever a tweet arrives, do a lot of processing and figure out where the tweets should go. Here, twitter goes and fetches all the people who follow the person who tweeted, and adds the tweet in their home timeline. So updations are constantly being done for a user's home timeline, even when a user is not active.

This makes the part of fetching and reading the home timeline faster and easier with the tradeoff that writing tweets is slower. However, from the stats it is clear that twitter has a read-heavy consumption, with writes being much more scarce, making this tradeoff work for the better.

High level overview:

You create a request when you write a tweet
The request goes to load balancer
load balancer pushes your tweet into a redis cluster and is replicated 3 times to ensure high availability.

Note: It could be the case that the tweet is added in 1 redis machine faster than the other 2 machines, but eventually all 3 redis db machines will have the same tweets. We are trading off consistency for high availability.

Why use Redis for fanout?

Redis is an in-memory, key-value store.
In memory feature: None of the operations require a round trip to the disk, the data is directly accessed from memory, which dramatically speeds up the operations, and allows more operations to be done concurrently.
Key-value store: nature of data in twitter is simple, can be easily modelled into key value form.

Each user's home timeline sits in redis cluster with a maximum of 800 entries. Data structure for user's home timeline is native redis list, which inserts things in chronological order. Removal of oldest tweets is also easy. O(1) insert & delete.

Optimizing fanout

The redis cluster would require tremendous amount of memory to store all the tweets by the users. One way to optimize this is to store the tweets of the users who have been active in the past few days or weeks. So if a user has not been active in the last 2 weeks, dont keep them in the redis cluster, and avoid the excessive precomputation.

What to do if an inactive user logs in?

Use sql approach, and fetch the tweets into his home timeline. Have a rule which decides whether to classify the user as active or not. If active, add his entry in redis cluster.

Flow of operating

User goes to twitter via browser, sends a get request, which is handled by a load balancer
The load balancer reads and understands the nature of request (get) and queries the 3 redis databases which have the timeline of the user
The db with the quickest response sends the timeline to load balancer
Load balancer sends the response to browser & timeline is rendered to user

Note: The load balancer should know the 3 redis db's to send request to, out of potentially thousands of dbs. This can be maintained with the help of HashMaps. Key: user_id, value = [server1 , 2, 3]

Fanout challenge: What if a celebrity with millions of followers tweets?

Operations x3 their followers would have to take place, which is a huge computational overhead. There is significant lag in the updates as well - some people are people to see the tweet, while some people see it much later.

Solution: For celebrities, Use the sql based approach. In other words, load their tweets during the process of a follower requesting for his timeline.

Trade offs of fanning out

Eventual consistency for high availablity

Space consumption: High, but tweets are limited 280 characters, making this approach manageable in terms of space

Fast reads, slower writes

Feature 3: Following

Use case: Allow the users to follow/unfollow users & Fetch the followers of users for fanout.

A regular following table would suffice for this. Because the followers dont get updated very frequently, there is no need of in memory db.

The following table should be sufficient for this: Followers(user_id,follower_id), Relation: Many to many.

The design of table is simple. According to the requirements, the queries one the followers table will never require joins. Absolutely avoid unnecessary indexing.
MySQL has a limit of 64TB/table, PostGres: 32TB/table (can be changed manually), which enables a table to handle hundreds of billions of records without much difficulty, provided that the table is well designed.
For best performance you should have: sufficient disk space,good disk performance (high IOPS), and large amount of memory, which can be provided by a cloud provider.

Feature 4: Searching

Tradtional searching involves inverted indices and batch index updates. While twitter makes use of inverted indices, it also requires real-time index updates.

As per Twitter's whitepaper regarding search using Early Bird, Twitter's searching requirements are:

Low latency, high throughput query evaluation
>Ability to ingest content rapidly (and handle data spikes as well) and make it searchable immediately

As a result of 1 & 2, concurrent reads and writes and necessary. Index structures must be constantly updated as documents are ingested.

Timestamps are extremely important for ordering search results in twitter.

Distributed search engine

Twitter's requirements correspond to a distributed search engine, on top of which sits EarlyBird, twitter's real time retrieval search engine. Briefly, the search system is discussed.

Low latency and high throughput query evaluation can be achieved using a distributed search system, where the user queries are routed on the basis of server load and/or geograhpical location.

To ensure robustness and high availability, the database machines are organized in a replicated and broker(load balancer) coordinated manner. The document collection is partioned logically(sharded) into disjoint segments, and individual machines are responsible for serving indexes that correspond to these document partitions.

Flow of search:

User sends a search query which is received by a load balancer
Load balancer contacts the database servers that have the relevant shards of the documents requested in the query
Each database server processes and sends the result to load balancer
Load balancer aggregates and sends the result back to user

Filtering the documents according to the relevance is a 2-step process:

Phase 1: A fast, potentially cheap algorithm such as Page Rank is applied to generate a list of potentially relevant candidates
Phase 2: A more expensive, slower algorithm is applied to the list of candidates generated in phase 1

Twitter uses open source Lucene engine for achieving all this.

Early Bird: Twitter's real time search engine

Early bird processes tweets as soon as it is published. It is independent of fanout process. Following is how the early bird works:

Tweets enter the ingestion pipeline: Here the tweets are tokenized and annotated with addtional metadata(For eg, language). To handle large volumes, tweets are hash partioned across Early bird servers.
When a user searches for a query, a Blender (front end server) parses user's query and appends user's local social graph to multiple early bird servers.
The query is received by each server, and searching is done using Lucene. After that Early bird performs filtering and personalization based on a relevance score which is calculated using:
- Static signals added during indexing
- Resonance signals: Dynamically updated. (For eg, no. of retweets). Resonance signals are pushed by a component called Updater.
- Information about searcher.(Location, interests, social graph)
The most recent and relevant tweets and filtered and returned to the blender, which merges and reranks results from multiple Early bird servers.

Notes:

In production Early bird servers constantly receive requests to search, as well as requests to index simultaneously.
Early bird is completely written in Java because:
- Integrating it with Lucene was necessary
- Easy to understand concurrency model of Java

Conclusion

In this article, we talked about how one can design twitter. Starting with requirements and charactericstics, we were able to identify the core functional requirements, and the most common non functional requirements: scalability & reliability. Keeping these in mind, we are able to come up with high level designs, identify the painpoints, provide solutions to them. We are also able to justify the design decisions at each step.

Well then, until next time.