Pure Danger Tech


Comparison of Protobuff, Thrift, Avro, etc

27 May 2011

Lately I’ve been doing some research and prototyping for the purposes of building a custom client-server communication library. This roughly can be broken into:

  • Data serialization – how you translate between language-specific data structures and a byte stream
  • Inter-process communication – transport and RPC/IPC

Below is some free-form commentary based on what I’ve read. This space is evolving at a good pace so don’t take anything here as gospel. Please correct me in the comments when I say something done.

Data serialization

These days there are lots of good options out there for high performance, language-independent data formats. For my own needs, Java is a requirement, Clojure-specific bindings are nice to have (but Java will work), and it would be cool if other languages (like Javascript) could play.

If you don’t care about other languages and Java is your thing, then you could just use plain ol’ Java serialization. The main downside of that is that Java serialization is well-known for being large and slow, mostly due to trying to capture Java object graphs and retain object identity relationships as well as arbitrary class identities. Most other data serialization formats make the simplifying assumption that messages are non-cyclic trees of a well-known set of unique objects.

Some of the options that I’ve looked into are:

  • Protocol Buffers – open-sourced by Google. Protobuf defines serializations using a custom language. Language-specific bindings are then generated by running a C++ “compiler”. Google provides implementations for Java, C++, and Python and open source plugins exist for most languages sane people use. Protocol buffer contains a few nods to defining RPC protocols but does not include an official implementation (they do have one internally).
  • Apache Thrift – originally developed at Facebook by a developer who interned at Google. Messages are defined in a custom language. Language-specific bindings are then generated by running a C++ “compiler”. Thrift provides bindings for a dozen languages or so. Thrift also includes the RPC transport layer in these languages which is a key differentiator vs Protobuf (although open-source libs do exist).
  • Apache Avro – Avro is a newer project designed to accomplish many of the same goals of Protobuf or Thrift but without the static compilation step and greater interop with dynamic languages. Avro is being driven largely by Hadoop, afaict. Messages are defined in JSON (truly more painful than Protobuf or Thrift). Language bindings exist for C, C++, Java, Python, Ruby, and PHP with RPC available in all of those but PHP. Code gen is available if you want to generate code from your messages but your data can be built with generic APIs.
  • Kryo – a Java-specific serialization and RPC library (KryoNet). Kryo is not multi-language and is specifically targeted at high-performance Java serialization and TCP/UDP connections.

There are a bunch of other options for this stuff, but these looked like some of the most promising for my purposes. I wanted to avoid the static compilation step and ended up prototyping systems with both Kryo and Avro. In my tests they were both about the same performance-wise and Avro met a few more of my needs (multi-language, transport-level security, etc) so I’ve settled on Avro for the time being. I hope to blog a few more posts about Avro soon.