Cassandra vs. HBase (or Dynamo vs. Big Table)

9 Nov

Internet scale. Big data. Which to choose?

With real big data, the performance are really limited by network and disk drives. There are hundreds to thousands of nodes. Brewer’s CAP Theorem told us that we will be looking at either CP (Big Table) like HBase or AP (Dynamo) exemplified by Cassandra.

Benchmark showed that the winner for the ultimate race is Cassandra…
http://research.yahoo.com/Web_Information_Management/YCSB
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf

It is indeed good and easier to sell story! But devils are in the details, –even objective statistics need skilled and well thought through interpretation. Otherwise good benchmarks could become just lies, lies and damned lies.

Why Facebook chose HBase over Cassandra? Storage Infrastructure Behind Facebook Messages – HPTS had the statistics on Page 8:

▪75+ Billion R+W ops/day
▪At peak: 1.5M ops/sec
▪~ 55% Read vs. 45% Write ops
▪Avg write op inserts ~16 records across multiple column families.

With effective reader caching, writing performance is more important. And strong consistency is key. If the data is “eventual consistency”, how can cache be free of staled data? It doesn’t help when vector clocks were replaced by cheaper client timestamps.

No wonder in a followup presentation, this is made clearer in page 21

Simpler Consistency Model
▪ Eventual consistency: tricky for applications fronted by a cache
▪ replicas may heal eventually during failures
▪ but stale data could remain stuck in cache

For example, a simple “like” counter cannot be reliably done with eventual consistency.

Will Cassandra be able to remedy these by adding atomic operations and use R+W>N to ensure consistency? According to Sarma, this is not going to work unless re-joined nodes are resynchronized first. With all these additional requirement, the overall performance advantage of Cassandra vanished.

Then how about the network awareness in avoiding network contention and better DR? Going back to our network bounded criteria, HBase is the clear winner instead.

Obviously we are just at the start of big data systems. Each system will learn from each other and their own experiences. If consistency is not that important and network contention is not an issue, Cassandra and other AP model can be good choice. Otherwise, stick with CP model and BigTable/HBase. It has a lot of things right in the first place and is also improving quickly.

Advertisements

One Response to “Cassandra vs. HBase (or Dynamo vs. Big Table)”

Trackbacks/Pingbacks

  1. Out of Context | High Performance System - September 28, 2013

    […] majority of the cases Strong Consistency of data is more important. For cloud-based services, eventual consistencies can be cheaper, more scalable and thus more […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: