CAP theorem, Lambda Architecture and how to pick the right database for your application

We are going to review databases, their challenges and architectural approaches.

Firstly let’s revisit the CAP theorem.

It states that only 2 of the 3 constraints (Consistency, Availability, Partitioning) can be guaranteed in a shared distributed data system at a time.

For example, Cassandra prefers Availability and Partitioning tolerance (A-P system with Eventual consistency). You can get strong consistency with Cassandra with an increased latency (almost C-A-P system) since it allows a client to choose consistency level for each read/write – weak (ANY) or strong(ALL), or in between (ONE, QUORUM).

Cassandra was invented at Facebook, and it combines Amazon Dynamo’s fully distributed design with Google Bigtable’s column-oriented data model.
HBase on the other hand prefers Consistency and Partitioning tolerance (C-P system). Other examples are mongoDB, Redis etc.
The default consistency model in HBase is “strong”, where reads and writes go through a single server which serializes the updates, and returns all data that was written and ack’ed.

Other examples of A-P systems with Eventual consistency are GCP Bigtable, AWS DynamoDb.
With Google’s Bigtable you when you write to one cluster the information will become available in other clusters after replication between clusters has completed. Google has a workaround to gain strong consistency in Bigtable, all you need to do is configure single-cluster routing in your application and use other clusters only for fail over. Bigtable will revert to eventual consistency after fail over, and you will read stale data for a brief time until replication completes for those rows.

Some of the cloud databases and storage

Google Datastore is strongly consistent for Ancestor Query (those that execute against an entity group) and Lookup by key (get() call), but only Eventually consistent for Global Queries.
Cloud storage (e.g. S3, Google Cloud Storage) provides strong consistency for reads after writes or deletes, object/bucket listing etc.
GCP Spanner provides strong consistency + high availability + partition tolerance (ALMOST C-A-P, for most practical purposes)
GCP Bigtable can be Eventually consistent, read-your-writes consistent, or strongly consistent, depending on configuration.

RDBMS systems

RDBMS systems in a cluster like MySql, Oracle or SqlServer prefer Consistency and Availability (C-A system), so if a cluster struggles to communicate between nodes it will become unavailable in order to maintain consistency. That’s why RDBMS vendors put so much emphasis on having a dedicated Interconnect for the RAC.
Single server RDBMS don’t provide availability, but they do guarantee strong consistency (e.g. when the row is locked for an update or server is down you won’t be able to work with that record). Transaction commit or rollback and fail over will provide “eventual availability” in respected situations (still a C-A system)

What exactly are Consistency, Availability, Partition tolerance?

Consistency means each session reads the most recent write result, or the operation fails (we can think of it as serializability)
Availability means every session gets a response, for both reads and writes. Response may not be guaranteed to be the latest version of the data. In other words, service is always available: system allows operations on all the records all the time
Partition tolerance means the system continues serving sessions even when network communication between the system’s node degrades and starts dropping messages (e.g. due to disconnects between racks or data centers). The challenge here is that the system needs to decide what’s better: continue being available but serve potentially obsolete data and let it get updated, or to preserve consistency by refusing to allow data modifications – means we sacrificed availability for consistency.

Many databases utilize Write-ahead logging (WAL) to avoid data loss. It’s a family of techniques that enables durability and atomicity. The system first writes changes to the log on disk, before writing them to the database, and in most cases both undo and redo information is stored in this log.

A-P systems give you huge performance boost by always being available, even if the data hasn’t been fully synchronized between the nodes (yet), thus typically these systems are called “eventually consistent” or BASE (for Basically Available Soft-state Eventual consistency, with soft state meaning the system maintains a lot of in-memory data, e.g. mem tables) – opposite end of spectrum from ACID (for Atomicity Consistency Isolation Durability).

C-P systems like HBase make sure that all the sessions see consistent data, but it means the users of one node may need to wait a bit until the other node(s) get synchronized (availability suffers here).

Based on the above you can start thinking about which system suits your needs best. Banks for example, put such strong emphasis on consistency that RDBMS is almost always their default choice.
But there’s a caveat.
CAP theorem is just about 20 years old, yet now it’s too simplistic to describe today’s distributed systems.
Distributed systems today provide a bit of each: Consistency, Availability, and Partitioning tolerance, based on the configurations of the system. Hence, it would not be correct to categorize these systems in either pure CP or pure AP.

Today’s MySQL can be C-A or C-P, depending on the configurations.
With default configuration, MySQL is C-A due to its master-slave replication principle (https://dev.mysql.com/doc/internals/en/replication.html)
When network communication suffers a failure, the ClusterControl’s automatic recovery will pick a new master candidate if old master goes offline. So partition tolerance is sacrificed here and the systems needs Reverse proxy/Load balancer in front of the database instances to route the clients to the new master.

Couple of words about Google Spanner database.

Google Spanner is a very attractive solution for those cases when it’s critical to have ordering of reads and writes and external consistency in a globally distributed database.
Spanner uses locks to achieve serializability, and it uses TrueTime to get inter-cluster consistency (see references below).
Spanner prefers consistency during partitioning, so it will give up availability when network is partitioned. But it’s a C-P system on steroids, with actual availability being so high that you can accept it as effectively C-A-P system even though it operates in a wide geographical area.
Another remarkable feature of Spanner (as a truly horizontally scaleable database) is snapshot consistency: you can do consistent reads without locking (when it’s important to get only completed transactions’ data – and fast). It also makes it easy to restore from backups as no dirty transactions need to be cleaned up manually.
If you are running MapReduce analytics job in a system that doesn’t provide snapshot consistency, your results will be inconsistent. Running the same job against Spanner would guarantee consistency because it can pick prcise timestamps.

Choosing the right DB for you

But CAP Theorem is only one side of the story. When choosing the right DB for your application you also have to consider costs (TCO), skillset available and how easy it will be to work with the data. For example, unlike RDBMS, NoSQL databases do not provide features like joins or aggregation, and do not allow you to use SQL to query data (but might offer their own data access language like Cassandra Query Language) – leading to costly workarounds.
This in turn raises a question of appropriate data model, with necessity to consider data access patterns, potential need to de-normalize and duplicate data for read performance etc.
Then you also have to consider patches and maintenance, over the whole cluster of nodes – which requires a strong DevOps culture.
Another point is future migrations: it’s one thing to migrate from Netezza to Greenplum (still not an entirely painless exercise) – but it’s a whole new ballgame in NoSql world, where standardization is a foe.
If schema-less development and JSON is a must for you – don’t forget that Postgres, Greenplum, MySql and Oracle 12.1.0.2 onwards support JSON storage and querying , while also providing you SQL support and ACID.
Last, but not least: Security of data at rest and in transit. You’ll need to review what level of security, access mechanisms and encryption each system supports.

Lambda architecture brief overview

Most databases that we know are better at Consistency than Availability, so there were different attempts to achieve the nirvana, and Lambda architecture is one of them.

Lambda architecture was designed to come as close to the three guarantees as possible, in order to allow for more human-fault tolerance when developing shared data systems. It was introduced by Nathan Martz in 2011. As organizations realized the importance of data, they started collecting much more of it and storing it for later processing. The amount of data quickly became unmaintainable and unmanageable: your old RDBMS cannot be scaled vertically as you cannot simply increase resources indefinitely – the complexity and relationships in the data get in the way.
The goal of collecting all this “big data” is to understand the knowledge possessed, examine the data and get the insight and maximum value of the data.

Some of the data becomes obsolete sooner than the other, so this poses a challenge – how do you process a data in real time or near real time?
The data may also come from multiple sources in parallel streams, at a very high rate, overwhelming the system or at sporadic rates creating periods of near-still and then huge inflows.
So maybe we could discard the older data and just process a small window of the most recent data? For some business needs it works, and sometimes you can approximate the results sufficiently. But sometimes you need to process “all the data”, not just a small time window.

Lambda layers: Batch, Speed, Serving

Lambda architecture defines 3 layers: batch, speed, and serving.
Batch layer can handle gigantic amounts of data, but it has to be immutable. It’s also responsible for pre-calculating the views, and this comes at a price of latency. When the batch layer is calculating the views the system is unavailable.
Speed layer is supposed to cache the most recent data and pre-calculate the most recent views.
Serving layer is supposed to combine results from the batch and speed layers and present the unified view to a user. Then, as the most recent data is absorbed by the batch layer, it starts recalculating the batch views again and this allows the speed layer to discard this chunk of data. This process is done at regular intervals.

So as you can see, lambda uses pre-computed results to achieve low latency responses.

Apache Hadoop, Google Big Query, AWS Redshift

Typically the batch layer is implemented in Apache Hadoop, an open source framework that allows processing very large data sets across clusters of computers. Hadoop offers a popular implementation of Mapreduce programming model (invented by Google). Mapreduce is composed of 3 steps: map, shuffle and reduce. It uses clusters for distributed parallel computations.
Google Big Query, AWS Redshift are also used as a batch layer.

Apache Spark, HBase, GCP Bigtable, Cassandra, Azure Cosmos DB

Speed layer needs to compensate for higher latency of the batch layer, because by the time batch layer completes its calculation the result is already obsolete.
Apache Spark is typically used to implement the speed layer in order to reach faster performance. The data provided by the speed layer is not as complete as the batch layer, but it serves it almost immediately after receiving it from the input stream. The output of Spark jobs is stored in NoSQL databases like Apache HBase, GCP Bigtable, Cassandra, Azure Cosmos DB.
The batch layer will process the new sets of data and provide the complete and accurate views, so the speed layer’s incomplete data will be replaced with a complete data eventually.

Apache Kafka – gluing all of this together

Message broker like Apache Kafka is capable of serving massive amounts of data very fast, and is typically used to cache and serve data input stream to both speed and batch layers, asynchronously, decoupling senders and receivers.

The serving layer needs to be able to work with relatively smaller amounts of data – pre calculated batch and stored views.
To store the these views the same class of NoSql databases as above can be used, or Apache Impala – which can query Hadoop, AWS S3, and HBase data.

Pros and Cons of Lambda architecture

This architecture is highly tolerant against hardware failures and human mistakes, but it’s very complex and highly redundant. The same business formulas need to be coded at three different layers, making it hard to synchronize the changes.

This type of architecture requires high discipline of DevOps, automation and very consistent data management approach that supports versioned generic schema.

Versioning your schemas and APIs allows you to decouple moving parts as much as possible.Sometimes you need to work with all the data at once, not just the sum of batch and speed.

If the data analytics task is a function of all data (for example when the data is bi-temporal, or when it’s incoherent, arrives in wrong order, or is dirty) then the Lambda architecture cannot be used because the speed layer is naturally separated from the batch datasets. Another example would be if the nature of calculation is a certain type of statistical function that needs to process all the data at once.

Kappa architecture

Sometimes it’s also acceptable to forgo the batch layer, significantly simplifying the system design – and this is called Kappa architecture.

References:

CAP Theorem @IBM

Spanner, TrueTime and the CAP Theorem