Pure Danger Tech


JavaOne: Cliff Click on Data Races

08 May 2008

Cliff’s talk focused on what data races are, why they happen, and how to debug them. A data race occurs when you have at least two threads accessing the same data and at least one of them is writing, and there is no ordering defined based on the Java memory model (by using synchronization, volatiles, Locks, etc). In this case, instruction reorderings can occur either in the hostpot JIT or in the hardware itself that will give you surprising results.

In fixing data races, a key thing to remember is that the memory barrier (synchronization, etc) must be present on all threads involved. If not, reorderings will still cause data races.

Cliff outlined the most common data races he sees:

  1. Double read with write in the middle: <pre>T1 T2 if(p != null) p = null p.field


This one is particularly nasty because on the left side, the read of p and the field actually will get done together in JIT&#8217;ed code most of the time so will show no issue. So, it only happens on infrequently used paths that happen to come under heavy contention. In other words, the NPE will show up just as a system comes under heavy load. </li> 

  * Two writes with read in the middle: 
    <pre>T1                                          T2
size = size * 2;
array = new[size]


  * Double-checked locking &#8211; horse.kill().beat()
  * Partial or no synchronization on data structures like HashMap &#8211; lots of subtle bugs if you believe you can use unsynchronized maps with single writer and cleverly catch exceptions</ul> 

Cliff then went through tools for debugging data races:

  * Visual inspection &#8211; current state of the practice, very slow, doesn&#8217;t scale, need an expert in both concurrency AND in the domain
  * Printing / logging &#8211; make noise at each read/write of shared variable (but be careful about avoiding string construction). Use per-thread ring buffer. When error occurs, timestamp all buffers so you can reconstruct. Usually the offending thread will have been messing with the data just before the error occurred. This works but is very invasive.
  * Tools &#8211; currently tools don&#8217;t scale into production. He did plug FindBugs as being able to detect many of the common data race problems (which matches my experience as well). 

He mentioned some flags from the Azul VM that might be incorporated to the Sun VM at some point for helping to detect problems &#8211; one would autolock unsynchronized collections (since they rarely have contention), and another would throw an exception if racing in collections.

In general, this talk was weird as it had two possible audiences and probably didn&#8217;t serve either too well. Those experienced in concurrency probably could have gotten what they needed in 10 minutes. And those inexperienced were probably completely lost in the data flow analysis as he went through them really fast. As it was, he hit his talk summary about 25 minutes in and opened for questions at 30 minutes which seemed like some strange pacing.