Fast polling using C, memcached, nginx and libevent

Comet

In this post I'll show you how to implement really fast polling using C and libevent, memcached and nginx. The performance of the server is over 2400 request pr. second on a not optimized Mac Book - that's 144.000 requests pr. minute.

At Plurk we use polling and we have thousands of live users hammering the service with poll requests. It's beginning to be pretty expensive so I set a goal to optimize it. We currently use this approach in production and it uses around 2% of CPU and _very_ little memory.

Choosing the stack

I could have chosen different stacks, but I chose following components:

  • C and libevent: C is as low as you get (if you don't want to code in assembler) and libevent implements asynchronous IO for C. libevent also features a pretty nifty HTTP library for building scaleable HTTP servers. libevent is used in memcached.
  • memached: Proven by tons of startups like Facebook. memcached is lightweight, scaleable and extremely optimized.
  • nginx: Russian engineering when it's best. Enough said.

All these technologies are non-blocking and have proven their performance and scalability.

The architecture of fast polling

This is the architecture of the fast polling:

Fast polling architecture

The thing to note is that we will only hit the Python cores if the cache is empty. This means that we populate the cache with an empty value when we don't have any new stuff for the client.

Hardest part, configuration of nginx

The polling has to support POST and GET requests and it seems that nginx favors GET requests. A problem I encountered was following:

  • nginx's error_page handler strips the POST data that is sent with the POST request...

This was a major problem and it took a bit hacking and reading of nginx's mailing list to figure out a solution. After some time and experimenting I found it:

location /Poll/ {
    proxy_pass http://localhost:8080/;
    error_page 501 404 502 = @fallback;
}

location @fallback {
    proxy_pass http://localhost:14002;
}

This means that we will first proxy to http://localhost:8080/, if this fails, we will use @fallback. In greater detail:

  • on cache hit, we simply proxy redirect to the poll server and return the result
  • on cache miss, we proxy to the Python core

Problem solved!

I didn't use nginx'es memcached module since it only supports one memcached server and only GET requests.

Benchmark

Before I show you the C code, I want to show you an Apache ab benchmark (around 2400 request pr. second on a Mac Book):

amixs-macbook:~ amix$ ab -c 50 -n 5000 http://127.0.0.1/Poll/getMessages
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests


Server Software:        nginx/0.7.44
Server Hostname:        127.0.0.1
Server Port:            80

Document Path:          /Poll/getMessages
Document Length:        5 bytes

Concurrency Level:      50
Time taken for tests:   2.068 seconds
Complete requests:      5000
Failed requests:        0
Write errors:           0
Total transferred:      735000 bytes
HTML transferred:       25000 bytes
Requests per second:    2417.43 [#/sec] (mean)
Time per request:       20.683 [ms] (mean)
Time per request:       0.414 [ms] (mean, across all concurrent requests)
Transfer rate:          347.03 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.6      0       7
Processing:     5   20   3.2     20      39
Waiting:        4   20   3.1     19      39
Total:          6   21   3.0     20      39

Percentage of the requests served within a certain time (ms)
  50%     20
  66%     21
  75%     21
  80%     22
  90%     25
  95%     26
  98%     29
  99%     32
 100%     39 (longest request)

Why not use Comet?

Comet is basically server pushing out updates. Comet may be the new buzz word, but I still think polling is easiest to scale and implement. From what I have seen, all comet solutions are very hacky, quirky and very hard to scale. The scaling is especially hard since the whole web-platform is built around the request-response model. This may change in the future, but until then I think polling is the safest way to go right now.

The C code

This is some of my first C code, so please report if you spot any problems or have suggestions. I actually enjoyed hacking this C program together!

#include <stdio.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/queue.h>
#include <string.h>
#include <stdlib.h>
#include <err.h>
#include <event.h>
#include <evhttp.h>
#include <libmemcached/memcached.h>


/*
 * Global 
 */
typedef struct {
    char* host;
    int port;
} MServer;

int NUM_OF_SERVERS = 0;
MServer memcache_servers[4];
memcached_st *tcp_client;


/*
 * Memcached info
 */
void add_mserver(char* host, int port) {
    memcache_servers[NUM_OF_SERVERS].host = host;
    memcache_servers[NUM_OF_SERVERS].port = port;
    NUM_OF_SERVERS++;
}

void init_memcache_servers() {
    add_mserver("localhost", 11211);
    add_mserver("192.168.0.51", 11211);

    //Add to memcached client
    tcp_client = memcached_create(NULL);

    int i;
    for(i=0; i < NUM_OF_SERVERS; i++) {
        memcached_server_add(tcp_client, 
                             memcache_servers[i].host,
                             memcache_servers[i].port); 
    }
}


/*
 * Takes an `uri` and strips the query arguments.
 */
char* parse_path(const char* uri) {
	char c, *ret;
	int i, j, in_query = 0;

	ret = malloc(strlen(uri) + 1);

	for (i = j = 0; uri[i] != '\0'; i++) {
		c = uri[i];
		if (c == '?') {
            break;
        }
        else {
		    ret[j++] = c;
        }
    }
    ret[j] = '\0';

    return ret;
}


/*
 * Checks if the path is in memcached, if not
 * the connection gets dropped without an answer
 * this is done so a proxy redirection 
 * can be dropped in nginx.
 */
void memcache_handler(struct evhttp_request *req, void *arg) {
    struct evbuffer *buf;
    buf = evbuffer_new();

    if (buf == NULL) {
        err(1, "failed to create response buffer");
    }

    char key[200];
    
    strcpy(key, "/Poll");

    char *request_uri = parse_path(req->uri);

    //Ensure no buffer overflows
    if(strlen(request_uri) > 125) {
        evhttp_connection_free(req->evcon);
    }
    else {
        strcat(key, request_uri);

        /* Fetch from memcached */
        memcached_return rc;

        char *cached;
        size_t string_length;
        uint32_t flags;

        cached = memcached_get(tcp_client, key, strlen(key),
                               &string_length, &flags, &rc);

        /* Return a result to the client */
        if(cached) {
            evbuffer_add_printf(buf, "%s", cached);
            evhttp_send_reply(req, HTTP_OK, "OK", buf);
            free(cached);
        }
        else {
            evhttp_connection_free(req->evcon);
        }
    }

    free(request_uri);
    evbuffer_free(buf);
}


int main(int argc, char **argv) {
    init_memcache_servers();

    struct evhttp *httpd;

    event_init();
    httpd = evhttp_start(argv[1], atoi(argv[2]));

    evhttp_set_gencb(httpd, memcache_handler, NULL);

    event_dispatch();

    evhttp_free(httpd);
    memcached_free(tcp_client);

    return 0;
}

I was a bit lazy to create a config reader, mostly because in Python I would have written open("server.config").readlines(), in C it seems a bit more cumbersome :-p

As always, happy hacking and I hope you enjoyed this post!

AJAX and comet · Benchmarks · Code · Python · Tips 24. Mar 2009
36 comments so far

Why limiting to 4 (overflow?)
MServer memcache_servers[4];

Pretty cool, the more I find out about the tech at Plurk the more I start to admire you guys, especially compared to the crap that twitter was running.

Here's a quick code review of your C code for security/safety reasons:

add_mserver -- Should have asserts protecting against NULL on the host param and that NUM_OF_SERVERS isn't out of range.

init_memcache_servers -- You don't validate the return of memcached_create.

parse_path -- You don't validate the return of malloc. Using strlen is a buffer overflow. Pass in the length of the string or use a lib like bstring. For loop uses \0 instead of a predetermined length, which you already have from strlen. This function is externally accessible so these are real problems.

memcache_handler -- Arbitrary fixed size for key. Using strcpy rather than a string with a predefined size. Ironically, your strlen checking for no buffer overflows has a buffer overflow. Use strnlen, actually just avoid all str functions that also don't have a length param. You don't check the return value of parse_path. You pass in key to
memcached_get but the str functions that make key are notorious for not honoring the \0. You should know the size of key after building and add your own \0 to double check it.

main -- You don't check return values of any function in here.

Finally, run this under valgrind and assume valgrind is always right. It sometimes isn't, but it's usually more right than a programmer so it's your job to prove why valgrind is wrong. Usually if you tricked valgrind then your code is too tricky. Quit doing that.

Cool, hope that helps clean it up. Feel free to shoot me edits for feedback.

In your path parsing bits, you could make use of standard functions like strchr() etc. to avoid over allocating the string.

Something like:

char* parse_path(const char *uri) {
  char *ret;
  int len;
  char *qmark = strchr(uri, '?');
  if (qmark) {
    len = qmark - uri;
  } else {
    len = strlen(uri);
  }
  ret = malloc(len + 1);
  memcopy(ret, uri, len);
  ret[len] = '\0';
  return ret;
}

Not a huge difference overall as your strings are short and short lived, but it's a bit more readable aside from the little bit of pointer math.

Is there a reason why you haven't used memcached support from nginx?

http://weichhold.com/2008/09/1...

Comet isn't especially hard, especially once you know how to use libevent.

Can you share any numbers? 2% cpu on a single server? How often a single web page polls for updates? (ie: what is the worse-scenario latency)

@Marek Majkowski:


I didn't use nginx'es memcached module since it only supports one memcached server and only GET requests.

You need http://www.reversehttp.net/

ret = malloc(strlen(uri) + 1);

=>

ret = strdup(uri);


... or you might consider modifying it in-place since you're not using req->uri elsewhere...

This could just be me, but why would you do this over a proxy setup with Varnish, Squid og Apache with mpm_event and mod_cache? Seems a little like you've reinvented the wheel.

Also, as a previous commenter has suggested, why did you not just use the memcached support in nginx, and if it was because of the lack of multiple memcache servers, why not patch it and send this patch back to the community?

Or I should perhaps read what the code is before suggesting something like that :-) Disregard (or delete if you're lenient) my comment

Zed Shaw:
Thanks for the great review. I'll update the code and re-publish it.

Brynjar:
It's arbitrary. I'll probably create a config reader in the next iteration and remove this limit.

Charlie:
Thanks for the suggestion.

Marek:
Every live user polls after updates every couple of seconds, so it's thousands of polling requests pr. second.

Mads Sülau Jørgensen:
There are two reasons why I didn't patch nginx with this support:

  • Separation of concerns: Running a separate polling server architecture I can monitor it and scale it independently
  • Patching nginx would not have been that easy

I prefer using memcached than Varnish or other caching options and memcached gives me a lot of options (deletion of multiple keys at once, consistent hashing and easy and proven scaling to terabytes of cache - - just to name a few).

Ah, but you could, easily, monitor a running nginx instance. The same goes for a running varnish instance. I'm not necessarily talking about running the nginx memcached module in the same process or the same server as your main proxy is running. So you'd archive the seperation by running another process or running the process on another box. Easy.

And the memcached module of nginx does not seem /that/ difficult to work with. Sure you'd have to do alot of handling of dead servers, and timeouts that could block the whole server. The current one probably handles this already. I don't use it, so I'm not one to say.

With your keys structured correctly, you can flush (purge) multiple keys easily in varnish. You can provide your own hashing function as well. You could also use nginx's hash balancer to send the request to the "correct" backend server.

The thing about hacking some server together with libevent in a language, that might not be onces main language is, that you'll end up missing a lot of things, and it may end up exploding in your face. From the top of my head, what happens if a memcached server dies, memory leaks, what happens if the process segfaults (see memory leak).

It might be super-duper fast, but is it stable :)

Mads Sülau Jørgensen:

plurk@web04:~$ ps aux | grep poll_server
plurk     29835  2.3  0.0   2376  1048 ?        
S    Mar24  27:23 /home/plurk/poll_server/poll_server 0.0.0.0 16000

It's pretty stable (has not crashed since I started it), uses very little CPU and very little memory. If I stop the server, the nginx server will just route everything to Python servers. I doubt Varnish or patching Nginx would have been faster or more stable - - and they would have added a big conceptual complexity.

Personally, I would much rather fix errors in a 100 line program than a thousands line program. It's separation of concerns and separation of complexity. So sure, I could have extended nginx+memcached or hacked Varnish - - but I chose to keep it simple, learn some new things and solve the problem in a very specialized way.

Amix, you should listen to these guys, they're right. They're not criticizing your work, but giving you high level advice. Your system might seem stable now but unfortunately that's not necessarily an indicator of its lack of defects or security.

By using an off the shelf battle tested solution like a varnish fronted HTTP cache through nginx, you're standing on the shoulders of giants. For this particular problem, it seems that this would work just fine.

One of the things I've learned over the years is that you should actually enjoy being a position when you can throw out code. One less thing to worry about, one less thing to go wrong. Another thing is that you can never be sure that there no bugs lurking in your polling server -- Zed outlined a few things and of course there are probably more. Eliminating this risk and shifting the burden onto more trustworthy components is not just the smart thing to do, but your obligation as an engineer it will result in better stability and security.

There is no hacking in the varnish solution what so ever. Varnish is a http proxy cache, it would be doing what is was designed to do. And my guess is, that it would be doing it well. My guess is also, that you'd end up using it alot, and to do more that using it with this one goal.

With a specialized solution, you have to handle a new process, you have to handle the errors that process can give, and you have to keep it running.

Using the tools already available makes more sense to me, and that's what I'm doing when I run into a problem. Others, more proficient programmers usually have had the same problem before me.

Hi Amix,

Regarding memcached and Facebook: they're not actually using memcached as is. As far as I know, they've modified it to run under UDP.

Ed, facebook contributed a lot of code back to memcached. They are responsible for the threading in memcached, and possibly the udp sockets as well. So they are in deed using a stock memcached.

Hey,

Very nice. I really like libevent. What about funky path like /../../etc/passwd and ~/.ssh/authorized_keys. You may already be sure you get clean uri's but you can never be too safe.

Martin

Mads Sülau Jørgensen:

You're wrong ;-) Check out their work at Github:

http://github.com/fbmarc/faceb...

They are certainly not using stock, they have a heavily modified version. They even removed a few files compared to stock.

I know memcached is fast, but this:

cached = memcached_get(tcp_client, key, strlen(key),
&string_length, &flags, &rc);

still seems like a costly operation to do inside a nonblocking
event loop.

Chris, So I am :)

I actually thought they'd gotten most of their changes comitted back into the main memcached. Seems I was a wee bit off ;)

Mads Sülau Jørgensen:
Varnish is a good HTTP proxy and cache, no doubt about that, but it lacks a lot of things to be used in my setup. For example, how can I invalidate 1000 keys at once and do it in a split of a second? Their purging capabilities seem to be pretty primitive...

Plus, who said that I won't iterate over this and go into a new direction - for example, benchmark this against a long-polling "comet" solution. Going into this direction would mean dropping Varnish - while I only need to change and add some lines of code to create a long polling solution of this.

Great stuff!

When talking about speed and async architectures I always find the Talkinator (Java, from the Mailinator guy) amazing.

"In my benchmarks now, on my quad-core desktop the talkinator server can push about 39000 messages per second."

Stephan
http://twitter.com/codemonkeyi...

Well, with a properly designed URL schema, you'd do something like this:

PURGE /poll/user/(?:mads|john|abc*)+/status

And then you'd have the VCL to handle that purge via purge_url(). The above will then purge the urls matching the regular expression, (so mads or john or starting with abc).

It actually works both fast, and well. It does not purge the cache when you tell it, but when you request an object. That makes invalidation _fast_.

See http://varnish.projects.linpro... for more information.

You can more or less configure varnish to do what you want it to do via it's configuration file.

As for the comet solution, your post more or less said that. Right about where it said something about push being hard to scale, which strikes me as a very odd comment. With push, you just have to be able to handle a horde of mostly idle connections, for which http://code.google.com/p/erlyc... or something like that, would probably be the way to go.

While Zed Shaw provided some very valuable feedback on your C code, I'll just nitpick here a bit:

If you know beforehand that request URIs aren't going to be longer than a certain amount, use a fixed size char array rather than mallocing/freeing on every request. Malloc/Free is *relatively* expensive and if you're handling that many requests per second and doing a malloc/free on each one, it will add up.

So, instead, define a max length constant and use that with your request URI buffer. No freeing necessary.

Mads Sülau Jørgensen:
I am pretty tired of keep repeating myself. So this will be the last comment I answer you.

Varnish is a good cache, but it's not really suited where I am headed. An example, how would you change the Varnish solution to support long polling? I can explore this option pretty easily with a basic understanding of how to build a custom server that can perform really good and that can be customized for my needs. I personally want to learn how to build things myself (also things that are complex and that require some thinking and engineering). I know I'll need this skill longer down the road and practice makes a master - so the more practice I have, the better I'll be at doing this.

Mitchell:
Thanks, great tip.

Stephan Schmidt:
I am unsure if Talkinator is using non-blocking IO, because of this:

he also did a great post on the architecture of Talkinator:

Both of these are very interesting and recommended reads.

I am unsure if this applies to C thought, especially since libevent is such an amazing library to work with.

Looks very nice. I am going to try this out on my Django site and see how it scales up.

"The performance of the server is over 2400 request pr. second on a not optimized Mac Book." - That is really brilliant!

I have a question, your example looks like you have say 1000 people looking for an update that is common to all. What if I have 1000 users looking for individual updates? This would fill up the cache with 1000 users data right? Will it scale as well?

I'd like to see your ab results with the -k flag. (Keep-alive)

1) Run ab test with more requests. Your test was only 2 sec long. It is not representative.
2) In cache-benchmark-test you should stuck in your network, which is not your case with your only 347.03 [Kbytes/sec], so keep optimising.
3) Out apache-memcache cache is about 12000 req\sec on one machine and with bigger file than yours (approx. 100 KiB).

Very interesting article for those of us working with nginx at scale. Thanks for taking the time to write it.

Please excuse me for a very beginner question:

How does the C code get invoked by Nginx?
Is it by FastCGI? or Is the code itself a Nginx module?

Thanks in advance.

btw, any good Nginx reference site u guys recommend?

@Frank

http://wiki.nginx.org/Main

it is not invoked by nginx. code creates little "http server" that listen to defined port. nginx just passes request to it and returns it's response.

Regarding nginx and multiple memcached's, there's a patch I've used in production for more than a year now:

http://openhack.ru/nginx-patch...

I've been told that's it's still meant to be rolled into memcached proper at some point, but not clear on the status of this.

Lars:
Great link, thanks!

Post a comment
Commenting on this post has expired.
© 2000-2009 amix. Powered by Skeletonz.