- Load pres.csv to HDFS:
$ hdfs dfs -put pres.csv
- Start the Spark shell:
$ spark-shell
- Import the statistics and related classes:
scala> import org.apache.spark.mllib.linalg.Vectors
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
- Load pres.csv as an RDD:
scala> val data = sc.textFile("pres.csv")
- Transform data into an RDD of dense vectors:
scala> val parsedData = data.map( line =>
Vectors.dense(line.split(',').map(_.toDouble)))
- Create RowMatrix from parsedData:
scala> val mat = new RowMatrix(parsedData)
- Compute svd:
scala> val svd = mat.computeSVD(2,true)
- Calculate the U factor (eigenvector):
scala> val U = svd.U
- Calculate the matrix of singular values (eigenvalues):
scala> val s = svd.s
- Calculate the V factor (eigenvector):
scala> val s = svd.s
If you look at s, you will realize that it gave a much higher score to the Npr article than the Fox article.