Out of Context

28 Sep

Lots of things can be taken out of context. It is however the author’s burden to clarify. Redemption from a technologist from various battle fields.

1. Protobuf vs. JSON, XML or Avro. It is not a secret I “hate” protobuf. In the context of exchangeable data, a rigid RPC solution doesn’t make much sense.

2. Java vs. C#, Python or C(++)/PHP/Perl, Ruby, JavaScript …

Performance and platform zealot use C(++). PHP is at home with them to script their C(++) solutions for the web and Perl is for the backend. If we need highest performance and be fully open source, it is the best solution. This is the choice of Facebook and to some extent Google.

Java speed is not bad and it is more approachable for people like lots of libraries and less things to worry.  But it is always memory hunger and start slowly. Also the syntax and basic data type support is not the most desirable. I have no problem using it for projects that fit and customer desire. It is cheaper than C(++) in development for sure. It is also easy to setup clusters of middle-ware for business critical systems.

C# is comparable to Java. But for crazy reason Microsoft decided that only windows platform is worth supporting, leaving server and mobile to Java. When Java (Android) rules mobile and server, even staunchest Windows fan had to decamp to where jobs are. Windows become the old standard desktop only solution and so goes C#. Small business still uses Microsoft for all but large enterprises only use Microsoft for desktop. Monotouch/Unity attempted to piggyback on the skills of few die hard Microsoft developer left, cutting out their little survive space in the corner. If one doesn’t deliver what customer wants, customer will just go and not coming back. Said that, hack out quick client side solutions in C# is still much more easier and less frustrating than Java.

Python is a very pleasant language with also a lot of libraries and framework. The widespread Python 2.x is also not new and pre-date OOP. But it is fun, practical and useful. Networking, data analysis and visualization are among its strength.

Ruby arise since Jython never was really not good enough to do everything cPython can do. A new language without the huge libraries, an asset and formidable task to support,  fits the bill.  Smaller than python to install but like python it is more maintainable than perl for new users, it fits into rapid prototyping and system administration.

JavaScript started life in browser. It is more familiar for web developer and like Ruby was designed and implemented in C/JVM early on. It is also sometimes used for data query and server-side development (node.js).

We also have other notable languages  like scala, go, F# and old SQL, LISP…

In the end, it is moot to judge these languages by simple shootout. Use the right tool for the right task. If the algorithm is chosen right and IO bottleneck were resolved, they can all do a lot of different jobs. If we need to do simple things in nanoseconds, we can do it in C or hand tuned Assembly. If for general server programming, Java, PHP and sometimes Python are the choices.

3. revolution/evolution development

Depends on cases, sometimes it is easy to burn the old code and redo from almost scratch. Most of the time evolution makes more sense. This depends on whether there’s big requirement change or misfit and how maintainable the current code base are. It really sucks need to maintain bug for bug compatible — it is wrong and crazy. Typical symptom of lousy product management / business.

4. High level and low level programming

Most of the time we should all develop at highest possible since computer hours are cheap and development efforts are expensive. In real life we often have to plumb into the details to process the amount of data fast enough. So we go from tools like Hive, Pig, MatLab, R, Python, SQL to Java and C. We also use Columnar Store/bitmap index/bloom filer/estimation for OLAP, SSD and In-Memory Data Grids to improve performance in general. These days a lot of things are achieved with massive data and computing power (like Google translation). Since we never have enough hardware resource due to the cost, often it doesn’t hurt to be as efficient as possible.

5. MySQL vs. PostgreSQL or such

It is really simple, MySQL does simple things faster with good replication/clustering support. It is good for OLTP kind of work load (when data is not critical),  sharding. PostgresQL (including products in the family like Greenplum, Vertica)  works better with OLAP type of analytics. We have Hadoop for even large and less structured data and specialized document or graph database..

4. Strong Consistency vs. Eventual Consistency

Although majority of the cases Strong Consistency of data is more important. For cloud-based services, eventual consistencies can be cheaper, more scalable and thus more desirable. We can architect and implement system for either situations.

6. Synchronous vs. Asynchronous

Synchronous is simpler and more predictable. Asynchronous allows much better scale and often is MUST for performance. There are also thread pool, fiber (green threads, greenlet) and pure event-based asynchronous programming model. Use where fit and keep it simpler.


REST is the basic methods for Object-Oriented Big Data Programming

28 Sep

Define a piece of data, define the standard operations, give it a handle and publish it to the world. This is what REST is about.

While standard HTTP/HTML provides UI and Interactions, SOAP tunnels programming fabrics, REST lays down foundation for Object-Oriented Big Data Programming. With either JSON or XML, we have the read-write semantic web.

Technology like Avatar  is an easy framework for REST on top of Java EE just like GWT is for with HTML.

XML, JSON, Protocol Buffer, Thrift and Avro

28 Nov

XML is the standard format for information exchange. It is well understood, extensible and widely supported. To implement a web service, SOAP is almost always a MUST to support.

There are actually issues with SOAP, the default invocation uses elaborate POST and SOAP envelop are used for both request and response. As a tradeoff for the robust and rich interface, it can be cumbersome to use and limited performance that can be achieved. In particular the SOAP-style invocation is very hard to cache. As an result although it is considered a step backwards for SOAP, HTTP GET and HTTP POST are allowed in mainstream web services implementation to help improving performance or simplify programming at the cost of robust error handling. They also allow REST directly exposed at the interface.

But people still need more light-weighted data format and web service. Besides the CPU and memory needed to processing extra bytes, network bandwidth and available storage are always limited. No one wants to pay more than necessary.

JSON is thus used for most of the scenarios originally need XML. With standard compression and support in JavaScript, it is among the top choices for simplicity, support and speed. BSON is also sometimes used to reduced the string binary conversions but is considerably less popular due to problem similar to Protocol Buffer below…

Google Protocol Buffer attempted to go further than BSON by compressing integer data. To make it even faster, the field name are replaced with integer numbers. It is a tradeoff between performance and parseability, readability and extensibility. As a result, the protocol Buffer messages can only be parsed by specifically generated code. Different versions of message has to be accessed differently. To make it worse, the length-prefix schema prevented streaming and nesting of large messages. One can implement otherwise but less performant –thus in general defeat the rational for Protocol Buffer.

When Protocol Buffer were not open sourced, Apache Thrift was developed for the people living out of the luxury of Google. While the integer compression was thrown out for due to complication for limited gain, a full stack of RPC was added to make it more complete albeit less lightweight. If Protocol Buffer already chose to be an RPC-replacement, why not create just a language neutral RPC to replace platform-dependent solutions like RMI, .net remote and SUN/ONC RPC? Thus it is a more complete replacement of SOAP. Unfortunately, the attempt to provide a full stack for multiple languages and platforms means it is harder to get everything right and updated.

Apache Avro attempted to move the balance back towards a real replacement for XML again in most of the case, with both binary (compression) and JSON serialization and great design. It has best of both worlds. The schema are included to allow parsing without pre-generated code. And the framing feature also means streaming is possible again even in binary format. On the other hand,  it also showed appreciation of “less is more” and decided against the transportation implementation like Thrift attempted. These makes it a strong contender for the top choices of data format. The only catch is that it is more complex and sometimes can be slower than a comparable straight JSON. Unfortunately people may hate choices and new things… While whoever needs and likes it may be enthusiastic, it still need momentum and blazed path to attract pragmatics and conservatives.

This brings us back to the tradeoffs. And it is natural to speculate that Google uses portobuff for IPC messages with a lot of integer data codes. Sounds like some serous data warehouse and mining.








Join if you cannot beat them, Microsoft’s distribution of Hadoop has been available

16 Nov

More than year ago Microsoft announced plan to move away from Dryad to offer Hadoop for its customers. It has actually been available for downloading, with just a few months delay from original roadmap.



As usual, Mary Jo Foley had more background information more inquisitive readers:




My biggest mistakes with big data

14 Nov

Seven years ago when I was asked to look into our first time-series database solution, I have no idea the challenges we were facing. I wrote a long email about how B+-tree’s disk representation were optimized for random reading. Our visionary COO didn’t even blame me. He just laughed at my ignorance. After talked with others and understood the problems, I designed the data model and tested the performance with teammates. A compressed blob stored in Microsoft SQL Server 2005 was our first version of time-series database. After that it has evolved under others teammates.

Flash back when I heard a CTO speak out, “our dataset is so small that we can put it in the memory of a single server”. I couldn’t help just laughed out loudly in private. It is also important to look into the need for business use-cases, update, retrieve, availability, partition, and management. Big or fast database touted by vendors may or may not be the solution.

An equally short-sighted call was that I had been against Hadoop back in 2009. I just think the map-reduce is too slow and the processing has been too batch and Java-centric. Instead I supported building up our computing platform on Condor. Our data was not that large and a simpler platform is easier to manage.

Now finally Hadoop growing out of Map-Reduce. It is in my opinion the de facto standard for really big data for the 90%. Any serious data provider should be big-data proof. While our condor based solution still works well for the smaller set of data, I wish that I had the foresight to embrace Hadoop/HBase, using it to create new products before competitors realize.


Is facebook ditching Hadoop? Hardly.

14 Nov

There are so many sensational headlines these days about facebook ditching Hadoop. Well, actually it is on the contrary.

“Hadoop Corona is the next version of Map-Reduce. The current Map-Reduce has a single Job Tracker that reached its limits at Facebook. The Job Tracker manages the cluster resource and tracks the state of each job. In Hadoop Corona, the cluster resources are tracked by a central Cluster Manager. Each job gets its own Corona Job Tracker which tracks just that one job. The design provides some key improvements…”


What facebook does is like others, they improved on jobtracker and now renamed their much improved Corona to remedy issues with the existing Map-Reduce implementation. It is not different from their creation of Hives as query tool, or Cloudera’s Impala which added real-time query engine in Hbase or Quantcast/MapR’s tweaking on the file systems side.

The point is that Hbase (BigTable model) already won over Cassandra (Dynamo model) in architecture design to be the de factor standard in really big data. Now various vendors and users are using Hadoop, tweaking and build upon it. As a young product, there will be more improvement and dramatic makeover for Hadoop/Hbase.

Separately Google also went further from their GFS/BigTable to Colossus/Spanner during these years.



Hopefully Hadoop can also make the big quantum leaps soon with increased adoption and battle test. Before that happens, we still have to rely on Hadoop/HBase of today.

Cassandra vs. HBase (or Dynamo vs. Big Table)

9 Nov

Internet scale. Big data. Which to choose?

With real big data, the performance are really limited by network and disk drives. There are hundreds to thousands of nodes. Brewer’s CAP Theorem told us that we will be looking at either CP (Big Table) like HBase or AP (Dynamo) exemplified by Cassandra.

Benchmark showed that the winner for the ultimate race is Cassandra…

It is indeed good and easier to sell story! But devils are in the details, –even objective statistics need skilled and well thought through interpretation. Otherwise good benchmarks could become just lies, lies and damned lies.

Why Facebook chose HBase over Cassandra? Storage Infrastructure Behind Facebook Messages – HPTS had the statistics on Page 8:

▪75+ Billion R+W ops/day
▪At peak: 1.5M ops/sec
▪~ 55% Read vs. 45% Write ops
▪Avg write op inserts ~16 records across multiple column families.

With effective reader caching, writing performance is more important. And strong consistency is key. If the data is “eventual consistency”, how can cache be free of staled data? It doesn’t help when vector clocks were replaced by cheaper client timestamps.

No wonder in a followup presentation, this is made clearer in page 21

Simpler Consistency Model
▪ Eventual consistency: tricky for applications fronted by a cache
▪ replicas may heal eventually during failures
▪ but stale data could remain stuck in cache

For example, a simple “like” counter cannot be reliably done with eventual consistency.

Will Cassandra be able to remedy these by adding atomic operations and use R+W>N to ensure consistency? According to Sarma, this is not going to work unless re-joined nodes are resynchronized first. With all these additional requirement, the overall performance advantage of Cassandra vanished.

Then how about the network awareness in avoiding network contention and better DR? Going back to our network bounded criteria, HBase is the clear winner instead.

Obviously we are just at the start of big data systems. Each system will learn from each other and their own experiences. If consistency is not that important and network contention is not an issue, Cassandra and other AP model can be good choice. Otherwise, stick with CP model and BigTable/HBase. It has a lot of things right in the first place and is also improving quickly.

%d bloggers like this: