Hash performance improvements in Ruby 2.4

Some corrections and clarifications from this week’s talk on hashes in Ruby 2.4.

On converting hashval to index:

I gave an example of using modulus on a hashval (which might be huge) to an index in an array (like `152`), so it can be inserted in a table. Ruby does not actually do modulus for performance reasons; it actually takes some of the lower bits of the hashval instead (discarding the other bits). This is generally not a good idea, but apparently the alternatives are worse.

On collision resolution:

In MRI 2.4, collisions result in a secondary hash operation to find a new index, but I was incorrect when I said that Ruby did the equivalent of:

new_index = Hash(index)

This would result in all collisions continuing to collide until a free index was found. All else equal, this would almost certainly still be better than closed addressing, but we can do better. The algorithm is more accurately summarized as:

new_index = Hash(index + lower_five_bits(hashval))

which is much better than the behavior I described. When two keys collide (in bucket 152, for example), the next index to be generated via double hashing is likely to be distinct, because the hashvals of the colliding keys are probably distinct (after all, the conversion from hash to index involves discarding bits).

On guaranteeing that empty entries are found:

The mathematical theorem I was looking for was “Hull-Dobell theorem”, which guarantees a cycle linear congruential generator. Ruby uses this theorem to guarantee that during a collision the secondary hash function will eventually find an empty index, if it exists in the table.

On load factor:

In ruby trunk the load factor is currently 0.5, so if half of the table fills up, it should trigger a rehash.

Leave a Reply

Your email address will not be published. Required fields are marked *