Persistence and caching

One of the unique features of Spark is persisting RDDs in memory. You can persist an RDD with persist or cache transformations as shown in the following:

>>> myRDD.cache()
>>> myRDD.persist()

Both the preceding statements are the same and cache data at the MEMORY_ONLY storage level. The difference is cache refers to the MEMORY_ONLY storage level, whereas persist can choose different storage levels as needed, as shown in the following table. The first time it is computed with an action, it will be kept in memory on the nodes. The easiest way to know the percentage of the cached RDD and its size is to check the Storage tab in the UI as shown in Figure 3.11:

Persistence and caching

Figure 3.11: Cached RDD – percentage and size cached.

Storage levels

RDDs can be stored using different storage levels as needed by application requirements. The following table shows the storage levels of Spark and their meaning.

Storage Level

Meaning

MEMORY_ONLY

Store RDDs in memory only. A partition that does not fit in memory will be re-computed.

MEMORY_AND_DISK

Store RDDs in memory and a partition that does not fit in memory will be stored on disk.

MEMORY_ONLY_SER

Store RDDs in memory only but as serialized Java objects.

MEMORY_AND_DISK_SER

Store RDDs in memory and disk as serialized Java objects.

DISK_ONLY

Store the RDDs on disk only.

MEMORY_ONLY_2

MEMORY_AND_DISK_2

Same as MEMORY_ONLY and MEMORY_AND_DISK, but replicate every partition for faster fault recovery.

OFF_HEAP (experimental)

Store RDDs in eTachyon, which provides less GC overhead.

What level to choose?

Spark's storage levels provide different trade-offs between memory usage and CPU efficiency. Follow the process to select one:

  • If the entire RDD fits in memory, choose MEMORY_ONLY.
  • Use MEMORY_ONLY_SER for better compactness and better performance. This does not matter for Python because objects are always serialized with the pickle library.
  • Use MEMORY_AND_DISK if re-computing is more expensive than reading from disk.
  • Do not replicate the RDD storage until fast fault recovery is needed.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset