How to do it...

  1. Start the Spark shell:
        $ spark-shell
  1. Create a DataFrame of house price and size:
        scala>  val houses = spark.createDataFrame(Seq(
(1620000d,2100),
(1690000d,2300),
(1400000d,2046),
(2000000d,4314),
(1060000d,1244),
(3830000d,4608),
(1230000d,2173),
(2400000d,2750),
(3380000d,4010),
(1480000d,1959)
)).toDF("price","size")
  1. Compute the correlation:
        scala> houses.stat.corr("price","size")
correlation: Double = 0.8577177736252574
Since we do not have a specific algorithm here, it is, by default, Pearson. The corr method is overloaded to take the algorithm name as the third parameter. 0.85 means a very strong positive correlation.
  1. Compute the correlation with Pearson passed explicitly as a parameter:
        scala> houses.stat.corr("price","size","pearson")
correlation: Double = 0.8577177736252574
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset