Read-only broadcast variables

Broadcast variables are variables shared by the driver node; that is, the node running the IPython notebook in our configuration, with all the nodes in the cluster. It's a read-only variable, as the variable is broadcast by one node and never read back if another node changes it.

Let's now see how it works in a simple example: we want to one-hot encode a dataset containing just gender information as a string. The dummy dataset contains just a feature that can be male M, female F, or unknown U (if the information is missing). Specifically, we want all the nodes to use the defined one-hot encoding, as listed in the following dictionary:

In: one_hot_encoding = {"M": (1, 0, 0), "F": (0, 1, 0),
"U": (0, 0, 1)}

In our solution, we first broadcast the Python dictionary (calling the broadcast method provided by the SparkContext, sc) inside the mapped function; using its value property, we can now access it. After doing this, we have a generic map function that can work on any one-hot map dictionary:

In: bcast_map = sc.broadcast(one_hot_encoding)
def bcast_map_ohe(x, shared_ohe):
return shared_ohe[x]

(sc.parallelize(["M", "F", "U", "F", "M", "U"])
.map(lambda x: bcast_map_ohe(x, bcast_map.value))
.collect())

Broadcast variables are saved in-memory in all the nodes composing a cluster; therefore, they never share a large amount of data, which can fill them up and make the following processing impossible.

To remove a broadcast variable, use the unpersist method on the broadcast variable. This operation will free up the memory of that variable on all the nodes:

In: bcast_map.unpersist()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset