One of the unique features of Spark is persisting RDDs in memory. You can persist an RDD with persist or cache transformations as shown in the following:
>>> myRDD.cache() >>> myRDD.persist()
Both the preceding statements are the same and cache data at the MEMORY_ONLY
storage level. The difference is cache refers to the MEMORY_ONLY
storage level, whereas persist can choose different storage levels as needed, as shown in the following table. The first time it is computed with an action, it will be kept in memory on the nodes. The easiest way to know the percentage of the cached RDD and its size is to check the Storage tab in the UI as shown in Figure 3.11:
RDDs can be stored using different storage levels as needed by application requirements. The following table shows the storage levels of Spark and their meaning.
Spark's storage levels provide different trade-offs between memory usage and CPU efficiency. Follow the process to select one:
MEMORY_ONLY
.MEMORY_ONLY_SER
for better compactness and better performance. This does not matter for Python because objects are always serialized with the pickle library.MEMORY_AND_DISK
if re-computing is more expensive than reading from disk.