- Start the Spark shell:
$ spark-shell
- Do the imports:
scala> import org.apache.spark.ml.classification.LogisticRegression
scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
- Create a tuple for Lebron, who is a basketball player, is 80 inches tall, and weighs 250 lbs:
scala> val lebron = (1.0,Vectors.dense(80.0,250.0))
- Create a tuple for Tim, who is not a basketball player, is 70 inches tall, and weighs 150 lbs:
scala> val tim = (0.0,Vectors.dense(70.0,150.0))
- Create a tuple for Brittany, who is a basketball player, is 80 inches tall, and weighs 207 lbs:
scala> val brittany = (1.0,Vectors.dense(80.0,207.0))
- Create a tuple for Stacey, who is not a basketball player, is 65 inches tall, and weighs 120 lbs:
scala> val stacey = (0.0,Vectors.dense(65.0,120.0))
- Create a training DataFrame:
scala> val training = spark.createDataFrame(Seq
(lebron,tim,brittany,stacey)).toDF("label","features")
- Create a LogisticRegression estimator:
scala> val estimator = new LogisticRegression
- Create a transformer by fitting the estimator with the training DataFrame:
scala> val transformer = estimator.fit(training)
- Now, let's create a test data—John is 90 inches tall, weighs 270 lbs, and is a basketball player:
scala> val john = Vectors.dense(90.0,270.0)
- Create more test data—Tom is 62 inches tall, weighs 150 lbs, and is not a basketball player:
scala> val tom = Vectors.dense(62.0,120.0)
- Create a test data DataFrame:
scala> val test = spark.createDataFrame(Seq(
(1.0, john),
(0.0, tom)
)).toDF("label", "features")
- Do the prediction using the transformer:
scala> val results = transformer.transform(test)
- Print the schema of the results DataFrame:
scala> results.printSchema
root
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
As you can see, besides prediction, the transformer has also added rawPrediction and a probability column.
- Print the DataFrame results:
scala> results.show
+-----+------------+----------------+--------------------+-------+
|label| features| rawPrediction| probability|prediction|
+-----+------------+----------------+--------------------+-------+
| 1.0|[90.0,270.0]|[-61.884758625897...|[1.32981373684616...| 1.0|
| 0.0|[62.0,120.0]|[31.4607691062275...|[0.99999999999997...| 0.0|
+-----+------------+----------------+--------------------+-------+
- Let's select only features and prediction:
scala> val predictions = results.select
("features","prediction").show
+------------+----------+
| features|prediction|
+------------+----------+
|[90.0,270.0]| 1.0|
|[62.0,120.0]| 0.0|
+------------+----------+