Clojure agent thread pools

I tangled with Clojure agents recently and while agents rock, I had some unpleasant operational experiences that led me to dig in a bit deeper. This post contains several issues and possible solutions.

Background:

Clojure agents can be sent messages with either send or send-off
send -> uses fixed pool of size (# of processors + 2) – this is considered good practice for building a cpu-bound thread pool (see JCIP)
send-off -> uses expandable thread pool with cached threads and 1 minute keep-alive – necessary for io-bound thread pool to avoid blocking the world
neither of these thread pools uses daemon threads (the JVM exits when all non-daemon threads complete)
using agents with send anywhere in your program means there is a pool of non-daemon threads that prevent JVM exit
using agents with send-off anywhere in your program means there is a pool of non-daemon threads that may take up to 1 minute to allow JVM exit (once no actions are enqueued)
must use (shutdown-agents) to kill these pools gracefully (by draining all actions in the queue)

Some things I think can be improved with the agent thread pools:

Thread naming – it’s a best practice to name the threads in an executor thread pool with a custom ThreadFactory so that the purpose of these threads is clear in thread dumps and other runtime operational tools. By default these threads are currently called something like pool-%d-thread-%d, and this is what you’ll see for the agent send thread pools. I created a patch to do this with thread names like:
- clojure-agent-send-pool-%d – should be fixed # of threads
- clojure-agent-send-off-pool-%d – will be added and removed over time

I have logged this as [ticket #378](https://www.assembla.com/spaces/clojure/tickets/378-set-thread-names-on-agent-thread-pools). </li> 

  * **Non-daemon threads prevent shutdown** &#8211; I had the experience of using a clojure library (the very nice plaza RDF library) which happens to use an agent to protect a data structure (this is totally hidden inside the library). I was very surprised that adding the usage of this library caused my program to no longer exit. For me, this egregiously violated the principle of least surprise. 
    I understand that there are use cases where you want your main thread to kick off a bunch of work, exit, and expect to continue happily running forever based on actions in the agent queues. Is there anecdotal usage information on agents and whether this is common? If we expect this to be uncommon and/or a pattern used by more knowledgeable clojurites, I think I would make the argument that it would be less surprising to have the default agent pool contain \*daemon\* threads that do not prevent JVM exit and make more experienced users do some additional thing to get this behavior.
    
    Seems to me there are several possible solution paths:
    
      1. Always use daemon threads and force users of the &#8220;main can exit&#8221; kind of program to create a keep-alive thread. I assume this path probably breaks some existing programs but it would be easy to fix such programs with a small addition to a well-known function that spun up a thread that never exited. A function like this would be sufficient: 
        <pre>(defn start-keepalive []
(.start (Thread. (fn [] (.join (Thread/currentThread))) "keepalive thread")))</pre>
        
        This function creates a non-daemon thread that joins itself on startup (creating a 1-thread deadlock!). It would also be possible to have the Thread await on a CountdownLatch and have an additional function that released the latch so that you could easily stop the keepalive.
    
      2. Always use daemon threads by default but allow users of agents to specify their own custom executor service in which to execute send/send-off actions. This would give you the opportunity to make the threads non-daemon threads, give your threads priorities, set a different size for the pool, change thread keep-alive times, etc. I&#8217;m assuming you would want to specify this per-agent at agent creation time with new options like :sendPool and/or :sendOffPool that took a ExecutorService. 
        To me this has a certain appeal as it opens up a world of advanced options for agents that doesn&#8217;t currently exist, but still works well by default. What I don&#8217;t know is the consequences of using multiple task pools for different agent sets and whether that affects task ordering in ways that are weird when crossing different agent pools.
    
      3. Create new variants of send or some \*var\* that chooses whether to use a daemon or non-daemon pool. I suspect this is not very elegant and kind of gross.
      4. Something else???
    
    If this issue was addressed in some way, I would also recommend making `(shutdown-agents)` deprecated and a no op.

  * **shutdown waits for existing queue actions to complete** &#8211; Should we also have a function to abort without waiting for actions to complete? We could add a `(kill-agents)` could call `shutdownNow()` instead of `shutdown()` on the executor service. You could do this now directly with: 
    <pre>(.shutdownNow clojure.lang.Agent/pooledExecutor)</pre>
    
    which of course would be an incredibly evil thing to do in any Clojure library! Note that this issue is possibly moot if #2 is addressed.

  * **`(shutdown-agents)` is non-reversible** &#8211; As I just mentioned, it&#8217;s possible to do evil things in a clojure library that break other libraries and your own program wrt agents. I&#8217;m ok with that &#8211; if libs do evil things, people shouldn&#8217;t use them. 
    But it does seem like it would be _possible_ to make agent pools automatically restartable albeit with some non-trivial concurrency code to do it properly (to take into account thundering herd type problems on re-creation). I&#8217;m not really sure this is worth doing (esp if #2 made the need for shutdown-agents moot) but I mention it for completeness.</ol> 

From my point of view, #1 is a no-brainer that makes Clojure easier to use operationally in a JVM and I&#8217;m happy to file a ticket with a patch. #2 is imho, a real usability issues with agents and I think either 2.1 and/or 2.2 make a lot of sense. #3 and #4 are of lesser importance, especially if #2 is addressed somehow.

I&#8217;m very interested in feedback on these ideas and I&#8217;m happy to work on patches to provide this functionality.

Pure Danger Tech

Clojure agent thread pools