How Twitter (and Facebook) solve problems partially

Exp growth data

Status updates and realtime search are very hard problems, especially with tons of data and million/billion of updates. If you think about status updates then they grow exponentially with a very small factor as the social graph grows. Exponential growth is a _big_ problem and would require tons of hardware to scale. But luckily there is a partial solution for this.

I will tell you a little secret on how to solve the problem with exponential growth. The secret is simple: you only solve 80% of the problem! Now that you know this secret, then look at how Twizzle and Facebook has implemented their status updates... If you are lazy then I can tell you: they keep their status update list fixed. So what does this mean? This means that for every user they have a fixed list of latest updates, this list follows First In, First Out principle.

Want a proof of this? Check out how many pages you can go back in your tweet feed:

The problem domain gets _a lot_ simpler with fixed lists and you can do some crazy optimizations, since the data footprint is very small.

Twizzle's realtime search

Dali time

Realtime search is also a big problem. One can read more about the challenges in this blog post:

How does Twitter (and Facebook) solve this problem? By limiting their data. Twitter can only search 2 months back in time! Want a proof?:

Right now Plurk solves this without cheating. Twizzle (and Facebook) solve this by solving the problem partially by reducing the data they search in.

Are these issues?

It depends. It's clear that the focus of Twitter search is to have recent results, so if you are looking for recent stuff, then you don't have an issue. It's clear that Twitter has again solved the problem partially by focusing on following:

  • at least 80% of Twitter searches are interested in recent stuff
  • 20% of Twitter searches might be searches that are interested in older stuff

By not solving the last 20% they simplify their problem a lot, but of course, they haven't solved the problem completely.

So what can you learn from this? Sometimes you don't have to solve the problem 100% to succeed... You can solve it for 80% of the use cases and still be successful.

Code · Design · Plurk · Tips 22. Aug 2009
12 comments so far

I think it makes a lot of sense.

I think there can be irony in the fact that compared to larger search engines, smaller startups that don't have the resources can definitely give the big boys a run for their money because of the real-time, and possibly entirely in-memory design decisions.

let's say you have a backend that scales comfortably at millions of rows of persistent data from any point of time Versus a cyclic/ring buffer that just has data for the past 1 hour / day / week. or the index's could be persistent, while the inverted-index is built from data only from the past 1 hour/day/week for the topic election. The possibility that a search query/ topic would not be in that finite set is high, but if it does make a hit in this finite set - the relevence and possible real-time nature would i reckon outplay years of content.

eg: deciding to rather show 1 picture/updates from 10 friends, rather than 10 pictures/updates from 1 busy friend. or to show the most recent activity on elections ( recent is gifted since data freshness could expire after certain timeouts / rowsize / space)

infact we've been putting it to great use within our in-memory caches/data structures at 'hover.in'. what started as administrative tasks such as which are the 100 most recent global hits, or recent crawled pages from each user, or globally saved items. These are infact completely in-memory datastructures that served the purpose so well from an internal point of view - there was no reason to stop using similiar in-memory cyclic/ring buffers for live/user-facing trends,etc. other user-facing fun cyclic buffers could be - 100 most recent hot topics, most active users , most active items, etc after doing some number crunching.

I've fallen in love so much with the idea, that I've also been experimenting these with more persistent storage & really questioning why our crawler index should'nt be based on similiar ring buffers as well. it reduces the time to build index's provided you accomodate 2-4x the required sample-set. it helps me save space in times like this - makes it easier to optimize the backend eg: if i plan to store a million most recent rows, fix the bnum at 2-4M, and then plan on tweaking at that configuration. I've been tinkering with tokyocabinet/tyrants capnum and capsize flags as well.

I'm just getting done with a few benchmarking tests on the various flags, and fixed length tests on tokyocabinet. but i can vouch for the whole theory of how in-memory datastructures of fixed-length datasets may not be a substiture for critical metrics, but how it makes sense beyond time & space complexity and the inherent real-time frills :)

Keep Clicking,
Bhasker V Kode,
Co-founder & CTO , hover.in

Hi Amir, very nice write up, I can see that doing large scale applications are putting a bit pragmatism into you :)

In many cases the 100% solution, even if feasible, is usually an effect of our personal search for perfection and a lot less the answer to a real problem. I rarely see problems that requires the complete solution to "work", even in physics we only take it as far as it's practical, even more so out in the "real" world there in the world where time and money have a tighter relationship.

I would like to hear how you are doing, if you have the time feel free to write me a mail, its been way too long. :|

Bhasker:
Thanks for the comment. I mostly tinker with bnum and fpow, since we use the hash database - - and there is a lot to gain into tinkering with them, especially selecting a good fpow value. Do you run in memory only or do you run a memory master and one persistent?

Morten:
I have always been rather pragmatic :), but I find it quite funny that companies with shit loads of cash and people can't solve these problems properly. Anyway, sure let's talk, you may also want to register a Plurk account, that's where I usually hang out.

Social networking sites, such as Twitter and Facebook make it essentially easier for webmasters to communicate with their audience with a click of a button. It is also a great tool to make money online and invoke affiliate sales. That is why many web site owners and blog owners incorporate it besides just Feedburner.

Yousef Issa:
Yeah, I totaly agree with you!

I think your point is very valid, but I think you are characterizing Pareto. Perhaps more apt would be how Amdahl's law usually gets summarized as, "Make the frequent case fast!"

What can I say, there is nothing new under the sun or rather what was once old is now new again.

Oops... I meant that I think you slightly mischaracterized Pareto. :) I should read my comments more carefully before posting.

James:
I think you are right, but I also think that these solutions are inspired by Pareto principle, but I don't know if it's enough to say they only solve 20% of the problem domain. Anyhow, to not confuse too much I have removed this connection to the Pareto principle.

I like to think of an old saying I learned long ago:
It takes 20% of the time to do 80% of the work, and 80% of the time to do the last 20%.

I have such a problem with facebook. They said they sent me an email with a CODE needed to login. However, I never got it and it is impossible to go in without it. Why won't they respond? I just want my CODE!!!

Hi Amix,

Nice writeup. I never knew this. However, for kicks, I tried Ashton Kutch's Twitter account, since he has 3million + "followers".

How do you explain this ?

http://twitter.com/APlusK?max_...

I can go all the way upto page 160 in his history, not the 40 page limit you are mentioning...

Post a comment
Commenting on this post has expired.
© 2000-2009 amix. Powered by Skeletonz.