Facebook Schema and Performance
From the article “Facebook shares some secrets on making MySql scale”
“800 million users and handling more than 60 million queries per second” …”4 million row changes per second.”
and that was almost a two years ago. Think what it’s like now!
Ever wonder why Facebook limits your friends to 5000? Does Facebook want to stop people from using it to promote themselves?
Ever see this message “There are no more posts to show right now” on Facebook?
I got this message when scrolled back in “friends” status updates. After scrolling back a few days I hit this message. The message is strange since thousands of more status updates that Facebook could have shown me. Why did I run into this message?
Is Facebook just being capricious or dictatorial in how it is used? I don’t know but I think the more likely answer is much more mundane and possibly quite interesting. The reason may be just simple technical limitations.
How could would/should/could status updates be stored on Facebook?
The first thing that comes to mind is something like these tables in a relational database:
friend’s id (contact id or c_id)
ID of the account making the update
date of status update
So if email@example.com logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:
Select id, status From updates where id in (select c_id from contacts where id=2) order by date
As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history. How can the response time for the retreval of updates of friends be kept at constant time ?
First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like
select * from ( Select id,status From updates where id in (select c_id from contacts where id=2) order by date) where rownum < 20;
But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:
select * from ( Select id,status From updates where id in (select c_id from contacts where id=2) and date <= current_date - 2_days order by date) where rownum < 20;
Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.
What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in friends instead of all the updates themselves as in:
Select status From updates where id in ( select c_id from (select c_id from contacts where id=2 order by date) where rownum < 20 ) order by date
Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.
Or one could go a step farther such that when you post a status update, a row gets inserted for each of your friends, such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2 a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp. So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.
What other kinds of optimizations could they do ? What would the pros be of a other methods? What are the cons?
This has already been solved a number of times at a number of places. I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.
Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends to 5000? Is it because they really care or is scaling issue?
Facebook lamp stack
how does Facebook do it
dealing with stale data