Some of the attendees who stopped by our booth asked this question, so I reflected on this conundrum for couple of days. Yes, these databases keep multiple copies of the data so why do they need backup? I know I have a good answer but how do I answer in the terms that the need for backup will be self-explanatory. Does the in built replication of these products alone ensure the business continuity in all failure scenarios? To answer this question, let’s look back at the evolution of computer systems.
- Before the advent of disk RAID, it was single disk systems with no redundancy whatsoever. RAID offered data protection against disk failures by maintaining redundant data. Since mid 1980s, RAID has been the industry standard against disk failures. But having additional copies of the data in the disk controller did not alleviate the need to have regular backups of your applications.
- To improve the redundancy at system level, clustering technology ensured the application availability in case of system failures. However clustering technology did not alleviate the need to have regular backups.
- To protect against site failures, storage vendors implemented synchronous and asynchronous replications. People still perform backups.
So how do these features stack up against scale out databases that we all started to love so dearly lately?
|Features||Traditional IT||Scale out databases/File Systems|
|Parity Based Protection||RAID-5/6||Erasure Coding|
|Clusters||Operating System Based Clustering||Clustering functionality built into the application|
|Site wide replication||Synchronous replication of storage||Replication between racks in a datacenter
Caveat: Not all databases understand the data center topology. For example Cassandra supports different snitches that enables replication at various levels
|Geo Replication||Asynchronous replication to two or more geographical locations||Datacenter wide replication|