Getting ready

We are going to use Yelp data as part of this recipe, which is provided by Yelp for Yelp Data Challenge. The data is divided into the following six files:

yelp_academic_dataset_business.json
yelp_academic_dataset_review.json
yelp_academic_dataset_user.json
yelp_academic_dataset_checkin.json
yelp_academic_dataset_tip.json
photos (from the photos auxiliary file)

We are going to use this data for multiple purposes across the book. This data really works for this recipe as it has joins everywhere.

This data is already loaded in the s3a://sparkcookbook/yelpdata Amazon S3 bucket for your convenience. Spark provides a convenient way to access S3 using the S3a prefix. This is not the standard way to access S3 buckets though. S3 buckets are accessed using HTTP URL. There are a few ways to specify the URL. For example, in the case of the sparkcookbook bucket, the following options are valid: http://sparkcookbook.s3.amazonaws.com and http://s3.amazonaws.com/sparkcookbook.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready