top of page
Writer's picturekyle Hailey

Facebook Schema and Performance


 

“800 million users and handling more than 60 million queries per second” …”4 million row changes per second.”

and that was almost a two years ago. Think what it’s like now!

Ever wonder why Facebook limits your friends to 5000?  Does Facebook want to stop people from using it to promote themselves?

Ever see this message “There are no more posts to show right now” on Facebook?


Notice it says “There are no more posts to show right now.”

I got this message when scrolled back in “friends” status updates. After scrolling back a few days I hit this message. The message is strange since  thousands of more status updates that Facebook could have shown me. Why did I run into this message?

Is Facebook just being capricious or dictatorial in how it is used? I don’t know but I think the more likely answer is much more mundane and possibly quite interesting. The reason may be just simple technical limitations.

How could would/should/could status updates be stored on Facebook?

The first thing that comes to mind is something like these tables in a relational database:

In the above design there are 3 tables

  1. Facebook accounts

  2. ID

  3. email

  4. other fields

  5. Friends

  6. user id

  7. friend’s id (contact id or c_id)

  8. Status Updates

  9. ID of the account making the update

  10. status update

  11. date of status update

So if sue@y.com logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:

    Select  id, status
    From updates
    where id in (select c_id from contacts where id=2)
    order by date

As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history.  How can the response time for  the retreval of updates of friends be kept at constant time ?

First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2)
      order by date)
   where rownum < 20;

But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2) and
      date <= current_date - 2_days
      order by date)
   where rownum < 20;

Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.

What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in  friends instead of all the updates themselves  as in:

Now the query becomes something like

   Select  status
   From updates
   where id in  (  select c_id from
                    (select c_id from contacts where id=2  order by date)
               where rownum < 20 )
   order by date

Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.

Or one could go a step farther such that when you post a status update,  a row gets inserted for each of your friends,  such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if  indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2  a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp.    So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.

What other kinds of optimizations could they do ?  What would the pros be of a other methods? What are the cons?

This has already been solved a number of times at a number of places.  I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.

Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends  to 5000? Is it because they really care or is scaling issue?

Related Reading:

Twitter

id generation

Facebook schema

Facebook lamp stack

how does Facebook do it

ebay

high scalability

scaling

Flickr

Myspace

dealing with stale data

Facebook schema

1 view0 comments

Commentaires


bottom of page