Broadcast variables are variables shared by the driver node; that is, the node running the IPython notebook in our configuration, with all the nodes in the cluster. It's a read-only variable, as the variable is broadcast by one node and never read back if another node changes it.
Let's now see how it works in a simple example: we want to one-hot encode a dataset containing just gender information as a string. The dummy dataset contains just a feature that can be male M, female F, or unknown U (if the information is missing). Specifically, we want all the nodes to use the defined one-hot encoding, as listed in the following dictionary:
In: one_hot_encoding = {"M": (1, 0, 0), "F": (0, 1, 0),
"U": (0, 0, 1)}
In our solution, we first broadcast the Python dictionary (calling the broadcast method provided by the SparkContext, sc) inside the mapped function; using its value property, we can now access it. After doing this, we have a generic map function that can work on any one-hot map dictionary:
In: bcast_map = sc.broadcast(one_hot_encoding)
def bcast_map_ohe(x, shared_ohe):
return shared_ohe[x]
(sc.parallelize(["M", "F", "U", "F", "M", "U"])
.map(lambda x: bcast_map_ohe(x, bcast_map.value))
.collect())
Broadcast variables are saved in-memory in all the nodes composing a cluster; therefore, they never share a large amount of data, which can fill them up and make the following processing impossible.
To remove a broadcast variable, use the unpersist method on the broadcast variable. This operation will free up the memory of that variable on all the nodes:
In: bcast_map.unpersist()