BigQuery

BigQuery is probably the single most compelling reason to adopt the GCP right now. It is a data warehousing service that is really fast, really price-competitive, and incredibly easy to use. Unlike some other Google Cloud services, BigQuery is widely used and so has little unpredictability in its behavior. I like to joke that if you interact with a Google Cloud sales professional, no matter what your question, the answer that comes back is, just use BigQuery for that!

In this chapter, you will learn about the following:

BigQuery as Google's fully managed petabyte scale serverless database
Architecture of BigQuery
Working with BigQuery using web console
Working with BigQuery using CLI

BigQuery competes with proprietary data warehousing solutions such as Teradata, but has obvious and major advantages over them, notably that it is cloud-based, serverless, and supports auto-scaling (so that you really pay only for what you need). There is no need for prohibitively expensive purchases of proprietary hardware.

Within the world of cloud providers, BigQuery probably most directly competes with Amazon's RedShift and the comparisons between these two technologies get folks quite riled up. In a nutshell, Redshift allows you to provision nodes (similar to Bigtable or Spanner in GCP), and the more you provision, the better the performance, but also the higher the cost. With BigQuery on the other hand, you don't provision a cluster or create indices, or really do any ops at all. The advantage of that is convenience, but the downside, is that you have less control over performance, and you have really no control at all over how your queries are executed. This can really take some getting used to for most folks. The idea that you don't create indices and you can't specify failover replicas or interface with the underlying hardware at all is quite different from the traditional way of doing either OLTP or OLAP.

In the previous chapters, we saw various storage options of GCP for various use cases. For blob storage, we have GCP Cloud Storage buckets, for creating VMs we have Compute Engine, which also provides persistent disks, for schema-strict relational databases we have Cloud SQL and Cloud Spanner, and for NoSQL databases we have seen Cloud Bigtable and Datastore. All of these options require more or less administration from the user's end. This requires time, skills, and expert administrators. Google's BigQuery is a step further. It is a large scale (typically, in petabytes) fully managed data warehouse. This frees admins from managing databases so they can focus more on analysis.

To explain this further, you do not need to deploy your resources and keep track of them, or even worry about scaling them. All of this is handled by Google so we can directly run our queries efficiently and save the ones we need to retrieve later. Just like other resources from GCP, BigQuery queries are directly managed and tracked under a project. It supports CSV, JSON, datastore backups, and Apache Avro input data formats.

Table of Contents for BigQuery

Create new playlist

Sign In

Sign Up

Table of Contents for
BigQuery