We are going to use Wikipedia's page link data to calculate the page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from, which has the data in two files:
- links-simple-sorted.txt
- titles-sorted.txt
I have put both of them on Amazon S3 at s3a://com.infoobjects.wiki/links and s3a://com.infoobjects.wiki/nodes. Since the data size is larger, it is recommended that you run it on either Databricks Cloud or EMR.