I had lunch yesterday with a friend who is using ehcache (BTW, did you notice that “ehcache” is a palindrome?) on a production web site in a cluster. He’s had a problem where one server in the cluster goes “bad” and becomes CPU-bound. When that happens it starts to “infect” other servers in the cluster because as the tries to maintain coherency, it communicates with the other servers to spread the cache updates. Ultimately, this causes the whole cluster to die. My friend had a 30 minute outage yesterday due to this problem.
Anyhow, we were discussing how Terracotta’s clustered version of ehcache might help alleviate this problem. At the moment, this is just a thought exercise – I haven’t run any tests to back this up. Terracotta works by introducing a set of one or more Terracotta servers in addition to your application servers. Each application server then communicates with the Terracotta servers instead of directly with each other.
If a particular server becomes “bad” other servers will generally be unaffected. Their coherency and state are managed through the Terracotta server, not through communication with other clients. The bad server can then be killed and restarted and will rejoin the cluster and see the contents of the distributed cache as necessary. In Terracotta, the distributed shared memory is faulted into each VM as needed.
Of course, if the bad server is holding distributed locks it will have an effect on the other servers as they will not be able to proceed until the lock is released. We kicked around some ideas internally of how this could be improved with timed locks or perhaps even timed wait/notify.