How to do it...

Go to https://community.cloud.databricks.com:

Click on Sign Up :

Choose COMMUNITY EDITION (or full platform):

Fill in the details and you'll be presented with a landing page, as follows:

Click on Clusters, then Create Cluster (showing community edition below it):

Enter the cluster name, for example, myfirstcluster, and choose Availability Zone (more about AZs in the next recipe). Then click on Create Cluster:

Once the cluster is created, the blinking green signal will become solid green, as follows:

Now go to Home and click on Notebook. Choose an appropriate notebook name, for example, config, and choose Scala as the language:

Then set the AWS access parameters. There are two access parameters:
- ACCESS_KEY: This is referred to as fs.s3n.awsAccessKeyId in SparkContext's Hadoop configuration.
- SECRET_KEY: This is referred to as fs.s3n.awsSecretAccessKey in SparkContext's Hadoop configuration.
Set ACCESS_KEY in the config notebook:

        sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<replace  
          with your key>")

Set SECRET_KEY in the config notebook:

        sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","
          <replace with your secret key>")

Load a folder from the sparkcookbook bucket (all of the data for the recipes in this book are available in this bucket:

        val yelpdata = 
          spark.read.textFile("s3a://sparkcookbook/yelpdata")

The problem with the previous approach was that if you were to publish your notebook, your keys would be visible. To avoid the use of this approach, use Databricks File System (DBFS).

DBFS is Databricks Cloud's internal file system. It is a layer above S3, as you can guess. It mounts S3 buckets in a user's workspace as well as caches frequently accessed data on worker nodes.

Set the access key in the Scala notebook:

        val accessKey = "<your access key>"

Set the secret key in the Scala notebook:

        val secretKey = "<your secret key>".replace("/", "%2F")

Set the bucket name in the Scala notebook:

        val bucket = "sparkcookbook"

Set the mount name in the Scala notebook:

        val mount = "cookbook"

Mount the bucket:

        dbutils.fs.mount(s"s3a://$accessKey:$secretKey@$bucket", 
          s"/mnt/$mount")

Display the contents of the bucket:

        display(dbutils.fs.ls(s"/mnt/$mount"))

The rest of the recipes will assume that you would have set up AWS credentials.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...