When asked to design a particular application, it is not expected that you have to design the entire application, with detailed analysis of each and every component, because not only is that a herculean task, but also that such a task requires a extensive research, discussions and analysis, which simply cannot be covered in a single article.
Therefore, essentially we look at the minimum viable product(MVP), which encompasses of the app's core functionalities. After identifying the core functionalities, we look at the ways to implement the application. Keeping in mind the charactericstics (traffic,usage,etc) of the application we take design decisions to deliver optimal performance. Justification for design decisions is necessary.
Below are the core requirements of the twitter application:
User can:Characteristics of twitter:
The use case for designing the mvp of twitter has relatively simple data tables. Because the data is structured and simple in nature, we can use relational database for persistance of the data. Below is the relational table design.
The ER diagram for the same is as follows:
Use case: User is able to post a short message(<=280 characters) which will be visible to the people following to him, as well as in the user's home timeline.
A load balancer distributes the network traffic across a number of servers. The aim is increase requests concurrently handled, and make the application more reliable.
Load balancer performs health checks and tracks status of all the servers. If a server is down, the client request is redirected to another server, making the application more relaible.
Examples of load balancers: Nginx, AWS ELB
Use case: Render the home timeline for the user. Home timeline is the feed of the user where they see tweets posted by the people they are following.
Problem: This operation requires a join, which is very heavy in computation for large databases and does not scale.
The idea is to precompute the home timelines of all the users. Meaning, whenever a tweet arrives, do a lot of processing and figure out where the tweets should go. Here, twitter goes and fetches all the people who follow the person who tweeted, and adds the tweet in their home timeline. So updations are constantly being done for a user's home timeline, even when a user is not active.
This makes the part of fetching and reading the home timeline faster and easier with the tradeoff that writing tweets is slower. However, from the stats it is clear that twitter has a read-heavy consumption, with writes being much more scarce, making this tradeoff work for the better.
Note: It could be the case that the tweet is added in 1 redis machine faster than the other 2 machines, but eventually all 3 redis db machines will have the same tweets. We are trading off consistency for high availability.
Each user's home timeline sits in redis cluster with a maximum of 800 entries. Data structure for user's home timeline is native redis list, which inserts things in chronological order. Removal of oldest tweets is also easy. O(1) insert & delete.
The redis cluster would require tremendous amount of memory to store all the tweets by the users. One way to optimize this is to store the tweets of the users who have been active in the past few days or weeks. So if a user has not been active in the last 2 weeks, dont keep them in the redis cluster, and avoid the excessive precomputation.
Use sql approach, and fetch the tweets into his home timeline. Have a rule which decides whether to classify the user as active or not. If active, add his entry in redis cluster.
Note: The load balancer should know the 3 redis db's to send request to, out of potentially thousands of dbs. This can be maintained with the help of HashMaps. Key: user_id, value = [server1 , 2, 3]
Operations x3 their followers would have to take place, which is a huge computational overhead. There is significant lag in the updates as well - some people are people to see the tweet, while some people see it much later.
Solution: For celebrities, Use the sql based approach. In other words, load their tweets during the process of a follower requesting for his timeline.
Use case: Allow the users to follow/unfollow users & Fetch the followers of users for fanout.
A regular following table would suffice for this. Because the followers dont get updated very frequently, there is no need of in memory db.
The following table should be sufficient for this: Followers(user_id,follower_id), Relation: Many to many.
Tradtional searching involves inverted indices and batch index updates. While twitter makes use of inverted indices, it also requires real-time index updates.
As per Twitter's whitepaper regarding search using Early Bird, Twitter's searching requirements are:
As a result of 1 & 2, concurrent reads and writes and necessary. Index structures must be constantly updated as documents are ingested.
Timestamps are extremely important for ordering search results in twitter.
Twitter's requirements correspond to a distributed search engine, on top of which sits EarlyBird, twitter's real time retrieval search engine. Briefly, the search system is discussed.
Low latency and high throughput query evaluation can be achieved using a distributed search system, where the user queries are routed on the basis of server load and/or geograhpical location.
To ensure robustness and high availability, the database machines are organized in a replicated and broker(load balancer) coordinated manner. The document collection is partioned logically(sharded) into disjoint segments, and individual machines are responsible for serving indexes that correspond to these document partitions.
Flow of search:
Filtering the documents according to the relevance is a 2-step process:
Twitter uses open source Lucene engine for achieving all this.
Early bird processes tweets as soon as it is published. It is independent of fanout process. Following is how the early bird works:
Notes:
In this article, we talked about how one can design twitter. Starting with requirements and charactericstics, we were able to identify the core functional requirements, and the most common non functional requirements: scalability & reliability. Keeping these in mind, we are able to come up with high level designs, identify the painpoints, provide solutions to them. We are also able to justify the design decisions at each step.
Well then, until next time.