Consistent hashing: Improving distribution

Testing hash_ring's distribution of keys I found out that the distribution wasn't uniform. A big problem, since it would overload some nodes in the system.

I then took a look at how libketama implements hashing and found out that:

  • they add lots of virtual nodes. For example, adding 3 servers would add 360 points in the ring (120 pr. server). My implementation only replicated one node 4 times
  • they do binary operations to the md5 checksum which reduces the number of bits to around 10 (instead of around 40)

These two things messed up hash_ring's distribution. A simple test that tests random distribution of 100.000 words confirms this:

from hash_ring import HashRing

memcache_servers = ['192.168.0.246:11212',
                    '192.168.0.247:11212',
                    '192.168.0.249:11212']

ring = HashRing(memcache_servers)
iterations = 100000

def genCode(length=10):
    chars = string.letters + string.digits
    return ''.join([random.choice(chars) for i in range(length)])

def random_distribution():
    counts = {}
    for s in memcache_servers:
        counts[s] = 0

    for i in range(0, iterations):
        word = genCode(10)
        counts[ring.get_node(word)] += 1

    for k in counts:
        print '%s: %s' % (k, counts[k])

random_distribution()

Running it results in following distribution:

amixs$ python tests/random_distribution.py 
192.168.0.247:11212: 6868
192.168.0.246:11212: 20657
192.168.0.249:11212: 72475

Creating the ring with a lot more virtual points improves the distribution:

ring = HashRing(memcache_servers, replicas=120)

Re-running it yields:

amixs$ python tests/random_distribution.py 
192.168.0.247:11212: 35797
192.168.0.246:11212: 32258
192.168.0.249:11212: 31945

Much better. Performance gets worse as one of the lookups is O(n), where n is number of nodes in the ring. A fix thought is to use binary search from Python's bisect module, this eliminates the O(n) lookup and replaces it with a O(log(n)) lookup. On a 900 node ring the binary search improvement improves the speed from 40 seconds to only 5 seconds (testing again with 100.000 random words) - which is pretty nice speedup.

I have gone a bit longer and implemented the way libketama creates the ring. This makes hash_ring compatible with libketama, which is nice, since libketama is ported to Java, Erlang, Perl... This also means that one can assign node weights - this is pretty practical if you got a memcached server with 2GB of memory, while the others only have 1GB:

memcache_servers = ['192.168.0.246:11212',
                    '192.168.0.247:11212',
                    '192.168.0.249:11212']
weights = {
    '192.168.0.246:11212': 1,
    '192.168.0.247:11212': 2,
    '192.168.0.249:11212': 1
}

ring = HashRing(memcache_servers, weights)

Running the weighted ring yields following result:

amixs$ python tests/random_distribution.py 
192.168.0.247:11212: 50806
192.168.0.246:11212: 22244
192.168.0.249:11212: 26950

How adding/removing of servers affects distribution

The main reason we have consistent hashing is so removing / adding nodes does not re-assign all the keys, and this is crucial to test as well. I created another test that uses Peter Norvig's World's Longest Palindrome Sentence as the input data to test how the words get re-distributed as we add extra nodes to the system. Output of the test is following:

Distribution of server_sets_3::
192.168.0.247:11212: 4609
192.168.0.248:11212: 3961
192.168.0.246:11212: 3827

Distribution of server_sets_5::
192.168.0.247:11212: 2612
192.168.0.248:11212: 2472
192.168.0.250:11212: 2254
192.168.0.246:11212: 2525
192.168.0.249:11212: 2534

Testing how alike server_sets_3 are to the sets in server_sets_5::
192.168.0.247:11212: 4609 in init_set
192.168.0.247:11212: 2612 in new_set
192.168.0.247:11212: 2612 in both init_set and new_set

192.168.0.248:11212: 3961 in init_set
192.168.0.248:11212: 2472 in new_set
192.168.0.248:11212: 2472 in both init_set and new_set

192.168.0.246:11212: 3827 in init_set
192.168.0.246:11212: 2525 in new_set
192.168.0.246:11212: 2525 in both init_set and new_set

The distribution looks pretty good as we keep adding nodes! Testing code for this is following:

from hash_ring import HashRing

text = open('tests/palindromes.txt').read()
text = text.replace('\n', '').replace('a ', '').replace('an ', '')
palindromes = [t.strip() for t in text.split(',')]

#--- Helper functions ----------------------------------------------
def create_sets(servers):
    server_sets = {}
    for s in servers:
        server_sets[s] = set()

    ring = HashRing(servers)
    for word in palindromes:
        node = ring.get_node(word)
        server_sets[node].add(word)

    return server_sets

def print_distributions(name, server_sets):
    print '\nDistribution of %s::' % name
    for s in server_sets:
        print '%s: %s' % (s, len(server_sets[s]))

def print_set_info(servers_init, servers_new):
    for init_server in servers_init:
        init_set = servers_init[init_server]
        new_set = servers_new[init_server]

        print ''
        print '%s: %s in init_set' %\
                (init_server, len(init_set))
        print '%s: %s in new_set' %\
                (init_server, len(new_set))
        print '%s: %s in both init_set and new_set' %\
                (init_server, len(init_set.intersection(new_set)))

#--- Testing ----------------------------------------------
init_servers = ['192.168.0.246:11212',
                '192.168.0.247:11212',
                '192.168.0.248:11212']
server_sets_3 = create_sets(init_servers)

print_distributions('server_sets_3', server_sets_3)

extra_servers = ['192.168.0.246:11212',
                 '192.168.0.247:11212',
                 '192.168.0.248:11212',
                 '192.168.0.249:11212',
                 '192.168.0.250:11212']
server_sets_5 = create_sets(extra_servers)

print_distributions('server_sets_5', server_sets_5)

print_set_info(server_sets_3, server_sets_5)

The new version of hash_ring is out with these improvements.

Code · Python 23. Nov 2008
4 comments so far

A non-uniform distribution may lead to overload on some nodes, but a uniform distribution will only ensure an even load with a uniform demand for the content on each server.

The idea to distribute the data more evenly is probably still better than nothing as you reduce the randomness to the demand, which is probably quite even on sites like plurk. Problem is, that if the demand is not uniform enough, the solution wont really help you.

Have you looked into any solutions on that regard? I figure it would be possible to adapt to a less uniform but steady demand by having a dynamic weight based on load (enabled by the relatively cheap server add/removal), but its suddenly not so simple any more.

Spand:
You are right that one could end up with some overloaded nodes. Solving overloaded nodes is a simple problem thought for key-value stores as it's easy to add replication and even the load on these nodes. An example: there is recached which adds multi-master replication to memcached.

amix:
I guess that would solve it also :)

Hi,

When I do consistent hashing, I always use the "simple" method suggested in the very last paragraph of section 4 of the Karger e.a. paper:

"""
A simple approach to constructing a consistent hash function is
to assign random scores to buckets, independently for each item.
Sorting the scores defines a random permutation, and therefore has
the good properties proved in the this section. However, finding the
bucket an item belongs in requires computing all the scores. This
could be restrictivly slow for large bucket sets.
"""

In python:

def whichnode( nodes, item ):
hashes = [ ( hash((n,item)), n ) for n in nodes ]
return min(hashes)[1]

Now, the above implementation is very crude. A tight implementation should use a fast hash and calculate the minimum on the go. (You can calculate a hash for each node upon creation and a hash for the item outside the loop and just combine the stuff inside the loop.)

The remark that it may be "restrictively slow" feels like it is meant to scare people off this simpler implementation. For small sets (into the hundreds) it works like a charm.

Cheers,
Frank

Post a comment
Commenting on this post has expired.
© 2000-2009 amix. Powered by Skeletonz.