Reading and manipulating raw text files

You can read a raw text data file using the textFile() method. Suppose you have the logs of some purchase:

number	product_name	transaction_id	website	price	date0	jeans	30160906182001	ebay.com	100	12-02-20161	camera	70151231120504	amazon.com	450	09-08-20172	laptop	90151231120504	ebay.ie	1500	07--5-20163	book	80151231120506	packt.com	45	03-12-20164	drone	8876531120508	alibaba.com	120	01-05-2017

Now reading and creating RDD is pretty straightforward using the textFile() method as follows:

myRDD = spark.sparkContext.textFile("sample_raw_file.txt")
$cd myRDD
$ cat part-00000
number product_name transaction_id website price date 0 jeans 30160906182001 ebay.com 100 12-02-20161 camera 70151231120504 amazon.com 450 09-08-2017

As you can see, the structure is not that readable. So we can think of giving a better structure by converting the texts as DataFrame. At first, we need to collect the header information as follows:

header = myRDD.first() 

Now filter out the header and make sure the rest looks correct as follows:

textRDD = myRDD.filter(lambda line: line != header)
newRDD = textRDD.map(lambda k: k.split("\t"))

We still have the RDD but with a bit better structure of the data. However, converting it into DataFrame will provide a better view of the transactional data.

The following code creates a DataFrame by specifying the header.split is providing the names of the columns:

 textDF = newRDD.toDF(header.split("\t"))
textDF.show()

The output is as follows:

Figure 10: Sample of the transactional data

Now you could save this DataFrame as a view and make a SQL query. Let's do a query with this DataFrame now:

textDF.createOrReplaceTempView("transactions")
spark.sql("SELECT * FROM transactions").show()
spark.sql("SELECT product_name, price FROM transactions WHERE price >=500 ").show()
spark.sql("SELECT product_name, price FROM transactions ORDER BY price DESC").show()

The output is as follows:

Figure 11: Query result on the transactional data using Spark SQL
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset