Serializing and deserializing data

As we are harvesting data from web APIs under rate limit constraints, we need to store them. As the data is processed on a distributed cluster, we need consistent ways to save state and retrieve it for later usage.

Let's now define serialization, persistence, marshaling, and caching or memorization.

Serializing a Python object converts it into a stream of bytes. The Python object needs to be retrieved beyond the scope of its existence, when the program is shut. The serialized Python object can be transferred over a network or stored in a persistent storage. Deserialization is the opposite and converts the stream of bytes into the original Python object so the program can carry on from the saved state. The most popular serialization library in Python is Pickle. As a matter of fact, the PySpark commands are transferred over the wire to the worker nodes via pickled data.

Persistence saves a program's state data to disk or memory so that it can carry on where it left off upon restart. It saves a Python object from memory to a file or a database and loads it later with the same state.

Marshalling sends Python code or data over a network TCP connection in a multicore or distributed system.

Caching converts a Python object to a string in memory so that it can be used as a dictionary key later on. Spark supports pulling a dataset into a cluster-wide, in-memory cache. This is very useful when data is accessed repeatedly such as when querying a small reference dataset or running an iterative algorithm such as Google PageRank.

Caching is a crucial concept for Spark as it allows us to save RDDs in memory or with a spillage to disk. The caching strategy can be selected based on the lineage of the data or the DAG (short for Directed Acyclic Graph) of transformations applied to the RDDs in order to minimize shuffle or cross network heavy data exchange. In order to achieve good performance with Spark, beware of data shuffling. A good partitioning policy and use of RDD caching, coupled with avoiding unnecessary action operations, leads to better performance with Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset