Pure Danger Tech


clojure-conj: Parallel programming and fork-join

25 Oct 2010

I wasn’t sure what to expect from David Liebke’s talk on concurrency and parallelism in Clojure but I found myself leaning forward in my chair through the entire talk. I was unaware of much of the work going on to leverage fork-join and it blew my freaking mind.

First, David went through the existing pmap and it works in conjunction with lazy seqs to keep your processors humming. I refer you to the excellent pictures in David’s talk. My experience building presentations tells me this must have taken a ton of time but is really worthwhile from a learning perspective.

David also talked about the problem of handling uneven workloads and how that can lead to idle threads when using pmap. The solution is to chunk work, basically spreading the risk of a single big task holding up the line. The perf numbers for pmap show that as expected, on small tasks it is a bit slower than serial execution, on medium size tasks it’s a wash and on bigger tasks, it’s a big improvement. David’s numbers are hard to learn too much from as they are all on a 2 core machine, so at most you’d expect a 2x improvement. Really, these tests need to be run on a range of cores to see the real benefit.

That was cool stuff but the really interesting part of it to me was the new work on leveraging Doug Lea’s fork-join library which will be added in JDK 7. I describe fork-join a bit in my article about the Groovy concurrency lib Gpars here and there are more references at the end of that article. Fork-join applies a thread-pool to work on a network of fine-grained computational tasks that may have dependencies. In particular, it is ideal for divide-and-conquer style algorithms over arrays of data elements.

Doug Lea split the fork-join work into the core task processing / threading library and the parallel array work. The latter will not be included in JDK 7 but the latter won’t make it in until JDK 8 at the earliest. However, it is effectively being integrated directly in Rich’s work of fjvtree and the pvmap and pvreduce algorithms on top of fork-join. In other words, the ParallelArray ideas are effectively replaced by the fjvtree and pvmap/pvreduce functions.

The key thing here is that in Java, the ParallelArray implementation is an OO construct wrapped around the real data which must be held in a raw Java array so all indices can be accessed directly and independently. In Clojure it is implemented directly on the implementation of PersistentVector which already contains so much goodness in its persistence and immutability features. That means that you get all of the parallelizable benefits from the algorithms like pvmap and pvreduce while still working directly in the data structures already at the core of your existing code. I can’t underscore how important this is. Java, Scala, and Groovy all have ways to leverage the fork-join library but none of them (even Java!) taps into this power while still staying so integrated to the core data structures already pervasive in the language.

One key thing to know about the fork-join based solutions is that they are NOT suitable for lazy-style seqs or streams where pmap does quite well. You really want to throw all of the data at fork-join at once.

Perhaps the key choice available to you as a user of fork-join is when you stop dividing the problem (aka the task granularity). At some point, the overhead of dividing the problem up dominates the time it would take to actually do the work of N items. Choosing the right granularity is essential in maximizing the performance of fork-join. The fjvtree function leverages the existing chunks of 32 values already inherent inside the persistent vector implementation and basically hard-codes 32 as the granularity size.

Thus, the first question asked after the presentation was my first question as well: how do you influence the granularity? David and Rich have obviously thought about this issue and said the other obvious choice is to allow a granularity of 1 and rely on the caller to chunk the work such that each item of the sequence is one chunk. I think both of these choices are the right place to start (because they give you a good default aligned with the data structure and an escape hatch to get full control), but will not be sufficient in the long term.

Really, we will want and need the ability for the language to auto-tune granularity. I talked to Rich about this a little and he mentioned opportunities to start working on tasks, time them, and automatically determine an optimal chunk size at run-time. This is exactly the kind of optimization we’ll need long-term to make fork-join the killer solution for certain kinds of problems. From a performance perspective, Rich mentioned that he did this work in a couple hours just to give Doug Lea some feedback and he expects the constant-time factors to be dramatically better before this work becomes finalized.

This work also stands to benefit hugely from the work Rich has been doing on making primitive numeric access pervasive and uniform. Once we have true primitives everywhere, accessed via persistent vectors, and can operate on them with fork-join-based algorithms, we can leverage everything latent in the JVM. I suspect the next version of Clojure will have some truly jaw-dropping performance when this comes together (with the potential for even more as we learn to optimize key knobs like granularity). Having that cake while also eating it via our existing data structures and functions is imho what we need in a programming language.

This direction also ties in directly with what Guy Steele was talking about in his keynote at Strange Loop. His main point is that accumulator-style algorithms are inherently bad for parallelism and divide-and-conquer style algorithms provide the “wiggle room” for compilers and language implementations to optimize execution, exactly in choosing things like granularity.

To support these kinds of optimizations, we do need to be concerned about how to annotate and leverage key algebraic properties of our functions. David’s implementation of pvfilter over pvreduce highlighted the need for writing a reduce function that is associative to work well with fork-join. Associativity is one of the properties that Guy mentioned as being crucial, along with properties like commutativity, idempotence, identity, etc.

I suspect you’ll be hearing these ideas more and more in the coming years.