Getting started with PySpark

Before going deeper, at first, we need to see how to create the Spark session. It can be done as follows:

spark = SparkSession
.builder
.appName("PCAExample")
.getOrCreate()

Now under this code block, you should place your codes, for example:

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

The preceding code demonstrates how to compute principal components on a RowMatrix and use them to project the vectors into a low-dimensional space. For a clearer picture, refer to the following code that shows how to use the PCA algorithm on PySpark:

import os
import sys

try:
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
print ("Successfully imported Spark Modules")

except ImportErrorase:
print ("Can not import Spark Modules", e)
sys.exit(1)

spark = SparkSession
.builder
.appName("PCAExample")
.getOrCreate()

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

spark.stop()

The output is as follows:

Figure 6: PCA result after successful execution of the Python script
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset