Loading data from Amazon S3

If Spark is a MapReduce killer, Amazon S3 is an HDFS killer. S3 is what the ultimate dream of cloud storage can be thought of as. S3 is a foundational service on Amazon Web Services (AWS), and almost every application running on AWS uses S3 for storage. Not only end-user applications but also other AWS services use S3 extensively; following are a few examples:

  • Amazon Kinesis uses S3 as target storage
  • Amazon Elastic MapReduce has one storage mode in S3
  • Amazon Elastic Block Store (EBS) uses S3 to store snapshots
  • Amazon Relation Database Service (RDS) uses S3 to store snapshots
  • Amazon Redshift uses S3 for data staging
  • Amazon DynamoDB uses S3 for data staging

Following are some of the salient features of S3:

  • 11 9's of durability
  • 4 9's of availability
  • Typical cost being $30/TB per month while even lower cost options are available

Amazon Simple Storage Service (S3) provides developers and IT teams with a secure, durable, and scalable storage platform. The biggest advantage of Amazon S3 is that there is no upfront IT investment, and companies can build capacity (just by clicking a button) as they need.

Though Amazon S3 can be used with any compute platform, it integrates really well with Amazon's cloud services, such as Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Block Store (EBS). For this reason, companies that use Amazon Web Services are likely to use it as they have significant data already stored on Amazon S3.

This makes a good case for loading data in Spark from Amazon S3, and that is exactly what this recipe is about.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset