Lately I’ve been listening to lectures from Prof. Erik Demaine and Prof. Charles Leiserson from their MIT Introduction to Algorithms course while I’m out taking a walk at night. These lectures are available as part of MIT’s OpenCourseWare project, which is opening content to ultimately all of MIT’s courses I think.
I know of Prof. Leiserson from the famous and fabulous textbook Introduction to Algorithms by Cormen, Leiserson, and Rivest, which is known colloquially as just CLR. I used this as a textbook for at least 4 classes in college and it’s still my first place to look on questions of algorithms, complexity theory, etc. The lectures from the course are every bit as good. A lot of it is review for me, but they all push my understanding further than it already was and get me thinking about things.
For example, I just recently listened to Lecture 7 on Hashing, which got me to wondering what hash function is used in java.util.HashMap and I did a little code investigation. If you look in there for where items are added the map, you’ll see that HashMap is applying it’s own supplemental hash function over the values that come out of hashCode(): [source:java]
/**
-
Applies a supplemental hash function to a given hashCode, which
-
defends against poor quality hash functions. This is critical
-
because HashMap uses power-of-two length hash tables, that
-
otherwise encounter collisions for hashCodes that do not differ
-
in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h »> 20) ^ (h »> 12);
return h ^ (h »> 7) ^ (h »> 4);
}
[/source]
Also of interest is the equivalent function in ConcurrentHashMap, which is different: [source:java]
/**
-
Applies a supplemental hash function to a given hashCode, which
-
defends against poor quality hash functions. This is critical
-
because ConcurrentHashMap uses power-of-two length hash tables,
-
that otherwise encounter collisions for hashCodes that do not
-
differ in lower or upper bits.
*/
private static int hash(int h) {
// Spread bits to regularize both segment and index locations,
// using variant of single-word Wang/Jenkins hash.
h += (h « 15) ^ 0xffffcd7d; h ^= (h »> 10);
h += (h « 3); h ^= (h »> 6);
h += (h « 2) + (h « 14); return h ^ (h »> 16);
}
[/source]
I would love to read more about how these functions were developed and why they differ, especially since I know it has improved over the years. Good hash functions are important as a hash table effectively turns from a map to a linked list in the worst case (all keys in the same bucket). There are also other considerations that come into play (with conflicting forces) such as the performance of hash calculation and the number of buckets. Listening to the lecture will give you a little more insight into why this is important and some of the challenges involved.
I did find some insight in this article from Dr. Heinz Kabutz of the well-known JavaSpecialists’ newsletter, particularly in relation to changes that occurred in JDK 1.4 to force hash tables to have 2^k buckets. And here’s a fun bug report that happened when all Doubles started hashing to the 0 bucket. It also mentions the source of the pre-1.4 hash function, which came from the Linux buffer cache.
Another interesting bug to look at describes some problems with the ConcurrentHashMap hash() function and talks a little about why it differs from the HashMap function – mainly that it wants not only to spread the keys evenly, but also spread them among the “segments” (the separately locked super-buckets) in the ConcurrentHashMap. Apparently this bug is fixed in Dolphin (aka Java 7) so my source references above (pointing to OpenJDK) are pointing to the fixed version.
If anyone in the know can point to discussion or talk more about the actual functions used above, please post a comment! I suspect the answers lie with people like Doug Lea and Josh Bloch….
FYI, you can also get to the lectures from the algorithms course as a podcast in iTunes.