Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Persistence and caching

One of the unique features of Spark is persisting RDDs in memory. You can persist an RDD with persist or cache transformations as shown in the following:

>>> myRDD.cache()
>>> myRDD.persist()

Both the preceding statements are the same and cache data at the MEMORY_ONLY storage level. The difference is cache refers to the MEMORY_ONLY storage level, whereas persist can choose different storage levels as needed, as shown in the following table. The first time it is computed with an action, it will be kept in memory on the nodes. The easiest way to know the percentage of the cached RDD and its size is to check the Storage tab in the UI as shown in Figure 3.11:

Figure 3.11: Cached RDD – percentage and size cached.

Storage levels

RDDs can be stored using different storage levels as needed by application requirements. The following table shows the storage levels of Spark and their meaning.

Storage Level	Meaning
`MEMORY_ONLY`	Store RDDs in memory only. A partition that does not fit in memory will be re-computed.
`MEMORY_AND_DISK`	Store RDDs in memory and a partition that does not fit in memory will be stored on disk.
`MEMORY_ONLY_SER`	Store RDDs in memory only but as serialized Java objects.
`MEMORY_AND_DISK_SER`	Store RDDs in memory and disk as serialized Java objects.
`DISK_ONLY`	Store the RDDs on disk only.
`MEMORY_ONLY_2` `MEMORY_AND_DISK_2`	Same as `MEMORY_ONLY` and `MEMORY_AND_DISK`, but replicate every partition for faster fault recovery.
`OFF_HEAP` (experimental)	Store RDDs in eTachyon, which provides less GC overhead.

What level to choose?

Spark's storage levels provide different trade-offs between memory usage and CPU efficiency. Follow the process to select one:

If the entire RDD fits in memory, choose MEMORY_ONLY.
Use MEMORY_ONLY_SER for better compactness and better performance. This does not matter for Python because objects are always serialized with the pickle library.
Use MEMORY_AND_DISK if re-computing is more expensive than reading from disk.
Do not replicate the RDD storage until fast fault recovery is needed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Persistence and caching

Create new playlist

Sign In

Sign Up

Persistence and caching

Storage levels

What level to choose?

Table of Contents for
Persistence and caching