Data in cloud: comparing cloud databases

The public cloud adoption is changing how enterprises host and manage their database systems.
With a range of Database as a Service (DBaaS) options, and public cloud providers relieved many organizations from mundane and costly database management tasks like provisioning, updates, upgrades and administration.

Let’s compare database products that some of the verndors – Amazon, GCP, and Pivotal – are offering.

Data warehousing solutions: Amazon, Google, Pivotal

BigQuery is Google’s petabyte-scale cloud data warehouse.
Redshift is Amazon’s petabyte-scale cloud data warehouse.
Greenplum is Pivotal’s petabyte-scale data warehouse rhat can be run on premised and in the cloud – and it’s open-source.

First let’s compare Amazon Redshift vand BigQuery, and then we’ll get to Greenplum:

Basics

  • Both Redshift and BigQuery are mature, BigQuery being couple of years older than Redshift
  • Both provide immediate consistency, user-defined functions (in Redshift they can be implemented as Python functions, while in BigQuery as SQL expressions or JavaScript)
  • Both support SQL: Reshift supports it since it’s based on PostgreSQL 8, and Google BigQuery supports full set of ANSI SQL
  • Both support stored procedures/server side scripting

Data types: Arrays and JSON support

  • Arrays are supported only in BigQuery and Greenplum
  • JSON is supported only in BigQuery (as STRUCT type) and Greenplum
  • Redshift, just like BigQuery, supports most of SQL data types, but it doesn’t support arrays and JSON types.

Amazon Redshift and Pivotal Greenplum comparison: MPP databases

Since both Redshift and Greenplum have similar architecture, and both are based on PostgreSQL, let’s review their similarities.
Client applications talk to a leader node and the leader talks to the compute notes in the cluster to carry out database operations. Data is partitioned/sliced according to the distribution keys defined per table.

  • Both architectures rely on hi-bandwidth interconnections between the leader node and the compute nodes.
  • Both are optimized for querying large datasets for analytics, business intelligence and data mining.
  • Both are Massively Parallel Processing databases
  • Both allow unique, primary key, and foreign key constraints, but they are for information only. They are not enforced by the system, though they are used by the query planner.

Google’s BigQuery doesn’t use indexes or foreign keys.

Column-based or Row-based storage. Let’s review the storage types of these three databases.

Columnar Data Storage: Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance.

Compression in Redshift benefits from column-based storage

BigQuery uses columnar storage that supports semi-structured data — nested and repeated fields.

Greenplum allows you to pick either one – Row or Column-Oriented Storage

  • Row-oriented storage: good for OLTP types of workloads with many iterative transactions and many columns of a single row needed all at once, so retrieving is efficient.
  • Column-oriented storage: good for data warehouse workloads with aggregations of data computed over a small number of columns, or for single columns that require regular updates without modifying other column data.

NoSQL offerings from Amazon and GCP.

Google Cloud Datastore vs Amazon DocumentDB

Google Cloud Datastore was released in 2008, and Amazon DocumentDB in 2019.
You’d pick one of these databases if you were looking for automatically scaling schema-free NoSQL Database vs fully managed schema-free Mongo-compatible storage.

  • Amazon DocumentDB offers immediate consistency
  • Google Cloud Datastore offers immediate consistency or eventual consistency – depending on type of query and configuration you want
  • Foreign keys are supported only in Google Cloud Datastore, via Reference Properties or Ancestor paths
  • Transaction support: ACID in Datastore, with Amazon DocumentDB adding support in November 2020
  • Triggers support: Google Datastore supports triggers before and after put() and delete() operations in Appengine Standard
  • Additionally, Google Datastore supports MapReduce via GCP DataFlow

How does Google Cloud Datastore compare to Amazon DynamoDB (Released in 2012)?

DynamoDB is a document store, key-value store – while Google Datastore is a document store

  • Both are schema-free
  • Secondary indexes are supported in both databases
  • SQL support: there’s no SQL support in Amazon DynamoDB. Datastore supports SQL-like query language (GQL), and Amazon caught up with Google in 2019 with PartiQL.
  • Trigger support: Datastore the triggers are supported via callbacks for pre/post load, put, delete operations in Appengine; and in DynamoDB via AWS lambdas for insert, modify and remove events.
  • Transactions support: Google Datasore is ACID-comliant. As for AWS – since 2018 Dynamo DB also supports atomicity, consistency, isolation, and durability (ACID) across one or more tables within a single AWS account and region.

How does Google Cloud Datastore compare to Amazon SimpleDB (Released in 2007)?

SimpleDB is a key-value store – while Google Datastore is a document store

  • Schema: Both SimpleDB and Datastore are schema-free
  • Secondary indexes support: both SimpleDB and Datastore support secondary indices, in fact SimpleDB creates index on every field
  • SQL support: only simple SQL select queries are supported in SimpleDB. Google Datastore supports SQL-like query language (GQL) which allows more elaborate select statements.
  • Trigger support: Datastore the triggers are supported via callbacks for pre/post load, put, delete operations in Appengine; but Amazon SimpleDB has no trigger support.
  • Foreign keys are supported only in Google Cloud Datastore, via Reference Properties or Ancestor paths
  • Transactions support: Google Datasore is ACID-comliant, but there’s no transactions suppor in Amazon SimpleDB

Additionally, Google has Bigtable as a NoSQL wide-columnar database (like Cassandra, HBase, Microsoft Azure Cosmos DB).
Just like AWS DynamoDB it’s a BASE (eventually consistent) database, but – in contrast to DynamoDB – Google allows a Bigtable configuration that provides strong consistency – all you need to do is cofigure single-cluster routing in your application and use other clusters only for failover.
Google Bigtable is the same database that powers many core Google services, like Google Search, Analytics, Maps, and Gmail.
Bigtable provides some ACID features: atomic single-row operations
There’s no support for foreign keys, no triggers, and no indexes – and Bigtable database is blazingly fast.

Fully managed RDBMS offerings from Amazon and GCP.

We will compare fully managed database services from AWS and GCP.

  • Amazon offers 6 flavors: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server instances.
  • Google offers 3 flavors: PostgreSQL, MySQL, and SQL Server instances

The key here is fully-managed: this will lower your costs as you get automated backups, better scalability, and 99.95% availability anywhere in the world.

You also get replication on schedule, automatic patches and updates – all you need to do is set a maintenance window.

Tasks like cloning a DB is a single-screen operation in console, very easy to do, and the same goes for a failover.

In addition you can use REST API or gcloud/aws CLI commands:

  • Single step in GCP:
   gcloud sql instances clone [SOURCE_INSTANCE_NAME] [TARGET_INSTANCE_NAME]
  • Two steps in AWS:
   aws rds create-db-snapshot \
    --db-instance-identifier db7-old-instance \
    --db-snapshot-identifier db7-old-snapshot

   aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier db7-new-instance \
    --db-snapshot-identifier db7-old-snapshot \
    --db-instance-class db.t3.small

Or if you only want to add a read replica in AWS:

aws rds create-db-instance-read-replica \
--db-instance-identifier db7-old-instance-repl \
--source-db-instance-identifier db7-old-instance

Google Cloud SQL is an implementation of RDBMs on top of Google’s Compute Engine and Persistent Disk.

You get to configure your options: storage type, the initial storage capacity, failover replica, automated backups, binary logging, and you can get automatic storage capacity expansion with your instance.

Google gives you a simplified settings page, which is very easy to read.

Comparing to Amazon – Aurora, for example does show you all of the MySQL settings and you can change them.

Both AWS and Google provide well-tuned defaults.

Connecting to an instance

Let’s take a look at GCP example since AWS has a similar approach.

Connecting to an instance is possible either from a Private IP that’s available in VPC, or externally via public IP – but a public IP must be authorized using either the Cloud SQL proxy or authorized networks.

Other benefits

With backups enabled you can perform the point-in-time recovery just like that, and it’s very quick:

gcloud sql instances clone [SOURCE_INSTANCE_NAME] [NEW_INSTANCE_NAME] \
--point-in-time [TIMESTAMP]

Scaling up or down your instance is a single line command, and you can go as low as 1 vCPU/4GB RAm to 96 vCPUs/640GB and anything in between.

Assuming your instance was created as db-n1-standard-1 (1 vCPU, 4GB RAM) and you want to scale it to db-n1-highmem-4 (4 vCPUs, 28GB RAM):

  • create instance
gcloud sql instances create sql_instance1 --tier=db-n1-standard-1 --activation-policy=ALWAYS
  • scale to db-n1-highmem-4
gcloud sql instances patch sql_instance1 --tier=db-n1-highmem-4 --activation-policy=ALWAYS

When the machine is scaled (either via CLI or GUI or REST API), it will restart, so you will have some downtime because the volume need to be moved to a different instance – usually a few minutes.

Business IntelLigence

At the end, we should also mention a BI tool that Google provides: GCP Looker.

It’s an enterprise platform for business intelligence, data applications, and embedded analytics.

You can connect, analyze, and visualize data across Google Cloud, Azure, AWS, on-premises databases.

Looker can connect to most of the popular databases, for example: Amazon Aurora, Redshift, Google BigQuery, Google Cloud SQL, Google Cloud Spanner, Greenplum, IBM Netezza, IBM DB2, including dashDB on cloud, MariaDB, Microsoft Azure SQL Database, Microsoft Azure PostgreSQL, Microsoft SQL Server (MSSQL), MongoDB, MySQL, MySQL on Amazon RDS, Oracle, PostgreSQL, PostgreSQL on Heroku, SAP HANA, Snowflake, Teradata and many more.