Pure Danger Tech


JavaOne: Concurrent garbage collectors

09 May 2008

This talk was by Gil Tene and Michael Wolf from Azul. Azul has their own concurrent garbage collector although this talk focused mostly on the ideas and concepts of concurrent collectors in general and didn’t really dive into their own collector in detail (my only real disappointment in an otherwise fascinating talk).

Concurrent garbage collectors are ones that run while your app is running. This is desirable because it allows your garbage to be cleaned up while minimizing stop-the-world pauses. This makes app performance more predictable. In a non-concurrent collector, failure consists solely of collecting something that’s not garbage (or other such bugs). In a concurrent collector, failure also happens if the pause time of a collector causes an application to fail to meet it’s necessary response times.

Concurrent collectors run alongside your app while it’s running and creating garbage. That means the concurrent collector has to keep up with the garbage to avoid a stop-the-world pause. The problem is that this keeping up usually requires some tuning and it also becomes very sensitive. If load increases or the app changes slightly, it can suddenly fall off the cliff with no warning.

Some terms… Mark is the processing of traversing references and determining live objects. Sweep is a phase to collect dead (non-marked) objects. An alternative to sweep is compaction, which should be obvious.

There was a lengthy discussion about different metrics to consider like heap population (the live set), allocation rate (new objects), mutation rate (modified references), cycle time, etc and a discussion of how these differ according to load. I don’t think I can do that discussion justice but it was pretty interesting.

In talking about testing, there were several important points. One was that all concurrent GCs have to deal with fragmentation (either by sweep or compaction). These are the most taxing parts of any GC (and often stop-the-world operations). Any test load that doesn’t experience those worst parts of the GC cycle are not really testing the garbage collector. It’s important a) not to engineer your test so that GC doesn’t occur (cause that’s what you want to understand) and b) to actually design your test to incur the worst GC quickly enough that you can rerun the test a lot. They gave 20-30 minutes as a good target for a stable test and you typically want to see at least 5 bad GCs during the test to make sure you’re ok.

They actually have an open source tool called Fragger that will add small amounts of object to the heap, but in such a way that it induces fragmentation which will quickly force bad GC. You can run this load generator in your app alongside the normal load. They demonstrated it and in a 1 GB heap they were able to cause bad GC pauses while only using 70 MB of memory, so the heap was mostly empty. Pretty cool. I could definitely see this being useful on performance testing we do at Terracotta.

It turns out there are some common patterns in real apps (that are often NOT in benchmarks) that can cause exactly this kind of fragmentation. One of the most popular is an LRU cache – eviction causes the “active” set to turn over at regular intervals creating garbage. If the concurrent collector can’t keep up, the collector will ultimately get in trouble. Apparently the specjbb benchmark does a really bad job at simulating this kind of “real app” behavior.

Another interesting point was what they called the “mostly” concurrent secret. It’s really important what exactly happens in stop-the-world. Some things that aren’t talked about much but can be important depending on your app are things like class unloading, perm gen collection, weak/soft reference management, stack scanning, code cache cleanup, etc. Since the length of stop-the-world ultimately drives your worst case, this stuff happens to be pretty important.

The Azul collector sounds very cool as they do concurrent young gc and old gc with a guaranteed single pass mark which is oblivious to mutation rate. They also have a concurrent compactor where objects can be moved without stopping the mutator and an entire generatino can be relocated in every gc cycle. There is no stop-the-world fallback cliff as with CMS.

But they only had one slide about the Azul collector without much more detail than that. I’d love to see a more detailed description (maybe it’s there in papers already) and also some more detailed comparison of Azul’s collector vs G1.

If anyone out there has some deep experience in garbage collection and is interested in distributed GC, we are going to be doing some heavy rework of the Terracotta DGC and memory manager this year and would hire the right person to help.