Pure Danger Tech


JavaOne: Cliff Click on a Scalable Non-Blocking Coding Style

09 May 2008

Cliff has been working for a while on developing highly concurrent data structures for use on the Azul hardware which supports 700+ hardware threads. We’re going through the transition right now from 1 to small numbers of cores. Cliff is trying to address the next order of magnitude.

A non-blocking algorithm means that stopping any particular thread does not prevent global progress. It means that no thread locks any resource, then gets preempted or blocked in I/O and also that there are no critical sections (synchronization). This is achieved by using CAS (compare and swap) hardware support via the Atomic* classes. These are fast with no contention and almost as fast even with contention, much much faster than synchronization.

More large CPU shared memory hardware systems allow for very fast concurrent read but still are limited to the speed of a cache miss on write. So, we must avoid all cpus writing to the same location. Even with reader-writer lock, it is not possible to scale past the 50-100 cpu range.

Cliff has worked through 2.5 different data structures, developing non blocking versions of them and this talk is a first attempt to pull out a process for developing such data structures. He did a talk last year at JavaOne on his non-blocking hash table implementation, which has since been simplified a bit and posted on sourceforge. He has also now developed a lock-free bit vector and is in the process of working out the details of a non-blocking queue.

The basic style consists of having an array (resizeable) to hold the shared data. Different threads will generally write to different slots in the array (to avoid write contention) the majority of the time. You then develop a finite state machine to model the state of the data and you must include in the FSM resizing of the array. Any time data is changed, you use CAS instructions via Atomic classes – this avoids ever locking. The only place you need a memory barrier is at one point during the array resize. The resize happens incrementally and never blocks everyone. The CAS reliance ensures that even when there is contention, SOME thread makes progress.

I won’t even try to repeat the details of the FSMs and other detailed descriptions… :)

One interesting result is that with the non-blocking hashtable he saw linear scaling on 768 CPUs maxing at 1 billion reads/sec or and 10 million updates/sec.

I asked a question about how this intersected with the JSR 166 work on fork-join and parallel ops. They definitely both seem to be attacking the same problem of lock contention with queue based work processing, although fork-join seems more focused on the 8-32 core range and Cliff is focused on the 100+ range. It didn’t sound like they had talked much.

Someone else asked about when you might want to use one of these data structures and it sounded like the main determinants were the # of cores and write contention. Cliff’s is slightly faster till you get out around 32 cores and then becomes much faster. Also, even as low as 4 cores if you have 50% write contention (admittedly rare), he can beat ConcurrentHashMap.