XML, JSON, Protocol Buffer, Thrift and Avro

28 Nov

XML is the standard format for information exchange. It is well understood, extensible and widely supported. To implement a web service, SOAP is almost always a MUST to support.

There are actually issues with SOAP, the default invocation uses elaborate POST and SOAP envelop are used for both request and response. As a tradeoff for the robust and rich interface, it can be cumbersome to use and limited performance that can be achieved. In particular the SOAP-style invocation is very hard to cache. As an result although it is considered a step backwards for SOAP, HTTP GET and HTTP POST are allowed in mainstream web services implementation to help improving performance or simplify programming at the cost of robust error handling. They also allow REST directly exposed at the interface.

But people still need more light-weighted data format and web service. Besides the CPU and memory needed to processing extra bytes, network bandwidth and available storage are always limited. No one wants to pay more than necessary.

JSON is thus used for most of the scenarios originally need XML. With standard compression and support in JavaScript, it is among the top choices for simplicity, support and speed. BSON is also sometimes used to reduced the string binary conversions but is considerably less popular due to problem similar to Protocol Buffer below…

Google Protocol Buffer attempted to go further than BSON by compressing integer data. To make it even faster, the field name are replaced with integer numbers. It is a tradeoff between performance and parseability, readability and extensibility. As a result, the protocol Buffer messages can only be parsed by specifically generated code. Different versions of message has to be accessed differently. To make it worse, the length-prefix schema prevented streaming and nesting of large messages. One can implement otherwise but less performant –thus in general defeat the rational for Protocol Buffer.

When Protocol Buffer were not open sourced, Apache Thrift was developed for the people living out of the luxury of Google. While the integer compression was thrown out for due to complication for limited gain, a full stack of RPC was added to make it more complete albeit less lightweight. If Protocol Buffer already chose to be an RPC-replacement, why not create just a language neutral RPC to replace platform-dependent solutions like RMI, .net remote and SUN/ONC RPC? Thus it is a more complete replacement of SOAP. Unfortunately, the attempt to provide a full stack for multiple languages and platforms means it is harder to get everything right and updated.

Apache Avro attempted to move the balance back towards a real replacement for XML again in most of the case, with both binary (compression) and JSON serialization and great design. It has best of both worlds. The schema are included to allow parsing without pre-generated code. And the framing feature also means streaming is possible again even in binary format. On the other hand,  it also showed appreciation of “less is more” and decided against the transportation implementation like Thrift attempted. These makes it a strong contender for the top choices of data format. The only catch is that it is more complex and sometimes can be slower than a comparable straight JSON. Unfortunately people may hate choices and new things… While whoever needs and likes it may be enthusiastic, it still need momentum and blazed path to attract pragmatics and conservatives.

This brings us back to the tradeoffs. And it is natural to speculate that Google uses portobuff for IPC messages with a lot of integer data codes. Sounds like some serous data warehouse and mining.

References:

http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

http://www.thrift.pl/Thrift-tutorial-running-tutorial.html

https://developers.google.com/protocol-buffers/docs/overview

http://www.json.org/

http://bsonspec.org/

http://www.w3.org/TR/

Advertisements

2 Responses to “XML, JSON, Protocol Buffer, Thrift and Avro”

  1. loopluke September 28, 2013 at 1:03 am #

    People sometimes comment to me that they were confused with the array of choice. Or they send me the performance comparisons that frankly compared apple to orange. Why people keep on reinventing the wheels? Simple, one size doesn’t fit all and benchmarking is only secondary consideration.

    HTML5: JSON
    Data Sharing: JSON (simple), AVRO (compact), XML (general purpose)
    RPC: SOAP/Thrift/Protobuf …
    Human: INI/YAML, HTML, documents, media

    Binary XML never really took off as users balk at paying for virtual vendor lock ins. There were open sourced solutions but less mature due to the complexity. Platform support is also spotty. They realize that for simple data they can do it cheaply with JSON/AVRO. For big complex data, compressed XML works not too bad even though the generic compression is not the highest performing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: