Hadoop on the cloud

Hadoop and MapReduce are standard, cookie-cutter ways of taking complicated jobs that parallelize the jobs that run on a number of machines and get the results back for your convenience. The power and versatility of the MapReduce programming paradigm and the great design underlying Hadoop have spawned an entire ecosystem of its own. One side effect of this Hadoop ecosystem and popularity is the rise of clustered or distributed computing. But it is a simple fact that configuring a cluster of distributed machines is quite complicated and expensive. Ask yourself how many companies or organizations do you know that run Hadoop in a fully distributed mode with raw Hadoop, without making use of nice company versions such as Cloudera. The answer is probably none. That's because it takes a lot of work to get Hadoop and MapReduce going in fully distributed mode, and this really is where cloud computing helps and comes into its own.

Managed Hadoop services are now offered by all of the major cloud providers. Dataproc is Google's offering and Amazon's version is called Elastic MapReduce. The basic idea of these managed Hadoop offerings is simple and very clever: the storage component of a traditional Hadoop cluster is now separated from compute by moving the data from HDFS to buckets (GCS or S3); the compute can simply be performed on cloud VMs, and those VMs can be done away with when the job is completed.

This is a brilliant insight because the main drawback of gigantic Hadoop/Spark clusters is their fixed costs and low utilization. Far too many companies have made the mistake of investing in enormous Hadoop clusters with hundreds or even thousands of nodes, and then found that the cluster is rarely used. Measuring the utilization of a Hadoop cluster is not all that straightforward, but it is fairly common for this to be in the sub-20% utilization range. When you consider the amount of fixed-asset investment and depreciation expenses that such a cluster entails, you can get a sense of why finance professionals tend to like cloud-based solutions: there is no depreciation, no fixed assets, and no politically charged conversations around utilization (albeit at the cost of potentially higher operating expenses).

This rise of Hadoop on the cloud has had serious negative implications for Hadoop providers such as Cloudera and Hortonworks. The complexity of Hadoop was key to their business model, and, now that complexity has been stripped away by the cloud providers, they could face challenging times ahead.

It is worth mentioning, in the context of Dataproc, Google's managed Hadoop offering, that Google initially had an arrangement with Hortonworks (with a nice press release dated January 2015). That arrangement basically centered around a utility called bdutil, which was a command-line tool that ran HDP on the GCP. Then, a year later, Google launched Dataproc, which made bdutil obsolete and there's been not a peep about a special relationship with Hortonworks after that. Dataproc runs Apache Hadoop; if you really need to run HDP or Cloudera, your best bet is to use virtual machines and run the third-party orchestrator (something like Cloudera director) from those VMs. So, in a nutshell, Dataproc is the recommended, Google-approved way to run Hadoop on the GCP. If you decide to try a third-party distro, you are largely on your own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset