Getting ready

We are going to use Wikipedia's page link data to calculate the page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from, which has the data in two files:

  • links-simple-sorted.txt
  • titles-sorted.txt
I have put both of them on Amazon S3 at s3a://com.infoobjects.wiki/links and s3a://com.infoobjects.wiki/nodes. Since the data size is larger, it is recommended that you run it on either Databricks Cloud or EMR.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset