How to do it...

Let's start with the AWS portal:

Go to http://aws.amazon.com, and log in with your username and password.
Once logged in, navigate to Storage | S3 | Create Bucket:
Enter the bucket name, for example, com.infoobjects.wordcount. Please make sure you enter a unique bucket name (no two S3 buckets can have the same name globally).
Select Region, click onCreate and then on the bucket name you created, and you will see the following screen:
Click on Create Folder and enter words as the folder name.

Create the sh.txt text file on the local filesystem:

    $ echo "to be or not to be" > sh.txt

Navigate to Words | Upload | Add Files, and choose sh.txt from the dialog box, as shown in the following screenshot:
Click on Start Upload.
Select sh.txt, and click on Properties, and it will show you details of the file:
Set AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY as environment variables.
Open the Spark shell, and load the words directory from s3 in the words dataset:

scala>  val words = spark.read.textFile("s3a://com.infoobjects.wordcount/words")

Now the dataset is loaded, and you can continue doing regular transformations and actions on the dataset.

You can also load data from S3 directly into DataFrames in other formats. For example, here's how you will load JSON data:

scala> val ufos = spark.read.format("json").load("s3a://infoobjects.ufo/ufos")

Sometimes there is confusion between s3:// and s3a://. s3a:// means a regular file sitting in the S3 bucket but readable and writable by the outside world. This filesystem puts a 5 GB limit on the file size. s3:// means an HDFS file sitting in the S3 bucket. It is a block-based filesystem. The filesystem requires you to dedicate a bucket for this filesystem and is not interoperable with other S3 tools. There is no limit on the file size in this system.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...