Getting ready

We are going to use Yelp data as part of this recipe, which is provided by Yelp for Yelp Data Challenge. The data is divided into the following six files:

  • yelp_academic_dataset_business.json
  • yelp_academic_dataset_review.json
  • yelp_academic_dataset_user.json
  • yelp_academic_dataset_checkin.json
  • yelp_academic_dataset_tip.json
  • photos (from the photos auxiliary file)

We are going to use this data for multiple purposes across the book. This data really works for this recipe as it has joins everywhere. 

This data is already loaded in the s3a://sparkcookbook/yelpdata Amazon S3 bucket for your convenience. Spark provides a convenient way to access S3 using the S3a prefix. This is not the standard way to access S3 buckets though. S3 buckets are accessed using HTTP URL. There are a few ways to specify the URL. For example, in the case of the sparkcookbook bucket, the following options are valid: http://sparkcookbook.s3.amazonaws.com and http://s3.amazonaws.com/sparkcookbook.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset