Grouping and creating tables in-memory

To have a function applied to a group of rows (exactly as in the case of SQL GROUP BY), you can use two similar methods. In the following example, we want to compute the average balance per gender:

In:(df.na.fill({'gender': "U", 'balance': 0.0})
.groupBy("gender").avg('balance').show())

Out: +------+------------+
|gender|avg(balance)|
+------+------------+
| F| -0.25|
| M| 2.0|
| U| 7.5|
+------+------------+

So far, we've worked with DataFrames, but, as you've seen, the distance between DataFrame methods and SQL commands is minimal. Actually, using Spark, it is possible to register the DataFrame as a SQL table to fully enjoy the power of SQL. The table is saved in memory and distributed in a way similar to an RDD. To register the table, we need to provide a name, which will be used in future SQL commands. In this case, we decide to name it users:

In: df.registerTempTable("users")

By calling the SQL method provided by the Spark SQL context, we can run any SQL-compliant table:

In: sqlContext.sql("""
SELECT gender, AVG(balance)
FROM users
WHERE gender IS NOT NULL
GROUP BY gender""").show()

Out: +------+------------+
|gender|avg(balance)|
+------+------------+
| F| -0.25|
| M| 2.0|
+------+------------+

Not surprisingly, the table output by the command (as well as the users table itself) is of the Spark DataFrame type:

In: type(sqlContext.table("users"))

Out: pyspark.sql.dataframe.DataFrame

DataFrames, tables, and RDDs are intimately connected, and RDD methods can be used on a DataFrame. Remember that each row of the DataFrame is an element of the RDD. Let's see this in detail, and first collect the whole table:

In: sqlContext.table("users").collect()

Out: [Row(balance=10.0, gender=None, user_id=0),
Row(balance=1.0, gender='M', user_id=1),
Row(balance=-0.5, gender='F', user_id=2),
Row(balance=0.0, gender='F', user_id=3),
Row(balance=5.0, gender=None, user_id=4),
Row(balance=3.0, gender='M', user_id=5)]


In: a_row = sqlContext.sql("SELECT * FROM users").first()
print(a_row)

Out: Row(balance=10.0, gender=None, user_id=0)

The output is a list of Row objects (they look like Python's namedtuple). Let's dig deeper into this. A Row contains multiple attributes, and it's possible to access them as a property or dictionary key; that is, to get the balance from the first row, we can choose between the two following ways:

In: print(a_row['balance'])
print(a_row.balance)

Out: 10.0
10.0

Also, Row can be collected as a Python dictionary using the asDict method for Row. The result contains the property names as key and property values (as dictionary values):

In: a_row.asDict()

Out: {'balance': 10.0, 'gender': None, 'user_id': 0}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset