Use pre-emptible instances in your Dataproc clusters

Hadoop (and Spark) jobs constitute perhaps the single-most important use case for organizations moving to the cloud. So, getting your Hadoop strategy right is really important, and here, Dataproc is a fairly obvious way to get started. One important bit of cost optimization that you ought to perform is using as many pre-emptible instances as possible.

Recall that pre-emptible instances are those that can be taken back by the platform at very short notice. So, if you are using a pre-emptible VM instance, you could have it snatched away at any point with just about 30 seconds to execute a shutdown script and clean up your state.

The flip side of this inconvenience is that pre-emptible VM instances are very cheap. On an apples-to-apples basis, you can expect a pre-emptible instance to cost about 60-80% less than a non-pre-emptible instance with the same specs. And here's the kicker: Hadoop has fault-tolerance built-in, and it is the perfect setting in which to exploit the affordability of pre-emptible instances.

Recall that Hadoop is the big daddy of distributed computing apps, it practically invented the idea of horizontal scaling in which large clusters of generic hardware are assembled and managed by some central orchestration software. This use of generic hardware implies that Hadoop always expects bad things to happen to nodes in clusters: it has elaborate mechanisms for sharding, replication, and making sure that node failures are managed gracefully.

In fact, within a Dataproc cluster, all of the pre-emptible VMs are collectively placed inside a Managed Instance Group, and the platform takes responsibility for clearing away the old pre-empted VMs so that they don't clog up your cluster.

There are some guidelines to keep in mind while allocating pre-emptible VMs to your Dataproc clusters. If your Hadoop jobs are skewed toward Map-only jobs and do not rely on HDFS a whole lot, you can probably push the envelope and use even 80-90% pre-emptible VMs without seeing performance degradation. On the other hand, if your Hadoop jobs tend to have a lot of shuffling, then using more than 50% preemptible VMs might be a bad idea: the pre-emption of a lot of VMs can significantly slow down your job, and the additional processing for fault-tolerance might end up even increasing the total cost.

Table of Contents for Use pre-emptible instances in your Dataproc clusters

Create new playlist

Sign In

Sign Up

Table of Contents for
Use pre-emptible instances in your Dataproc clusters