Google Cloud Dataproc

As mentioned earlier, Google Cloud Dataproc is a managed Spark and Hadoop solution from Google. Its nature of being managed and of being on the cloud gives users the ability to turn the clusters off when they are not required, which saves a lot of cost. So, Dataproc is not only simple and time saving, but it is also cost effective.

Just like other managed services from Google, we can use GCP APIs to interact with Dataproc. We will get into the details later in this chapter. While the initial vision of Dataproc was to provide managed Hadoop and Spark, the current state boasts managed support for open source Apache Hive, Pig, Hadoop, and Spark, and integration with Cloud Storage and BigQuery through connectors, on top of being monitored by Stackdriver. Just like Hadoop, Dataproc also has Master, Client and Worker nodes configurations where Master nodes manage storing data into HDFS and running parallel operations using MapReduce. While worker nodes store the data and run computations.

Apart from these, resource management facilities such as YARN, HDFS, and MapReduce can also be leveraged from the web interface of Hadoop. These web interfaces can be accessed by SSH or SOCKS proxy.

Table of Contents for Google Cloud Dataproc

Create new playlist

Sign In

Sign Up

Table of Contents for
Google Cloud Dataproc