Let's build a small dataset of the followers:
Follower | Followee |
John | Barack |
Pat | Barack |
Gary | Barack |
Chris | Mitt |
Rob | Mitt |
Our goal is to find out how many followers each node has. Let's load this data in the form of two files: nodes.csv and edges.csv.
The following is the content of nodes.csv:
1,Barack
2,John
3,Pat
4,Gary
5,Mitt
6,Chris
7,Rob
The following is the content of edges.csv:
2,1,follows
3,1,follows
4,1,follows
6,5,follows
7,5,follows
You can load the files to hdfs using the following commands:
$ hdfs dfs -mkdir data/na
$ hdfs dfs -put nodes.csv data/na/nodes.csv
$ hdfs dfs -put edges.csv data/na/edges.csv