Handling large datasets with R

The second weakness of those mentioned earlier was related to the handling of large datasets. Where does this weakness come from? It is something actually related to the core of the language—R is an in-memory software. This means that every object created and managed within an R script is stored within your computer RAM. This means that the total size of your data cannot be greater than the total size of your RAM (assuming that no other software is consuming your RAM, which is unrealistic). Answers to this problem are actually out of the scope of this book. Nevertheless, we can briefly summarize them into three main strategies:

  • Optimizing your code, profiling it with packages such as profvis, and applying programming best practices.
  • Relying on external data storage and wrangling tools, such as Spark, MongoDB, and Hadoop. We will reason a bit more on this in later chapters.
  • Changing R memory handling behavior, employing packages such as fffilehashR.huge, or bigmemory, that try to avoid RAM overloading.

The main point I would like to stress here is that even this weakness is actually superable. You should bear this in mind when you encounter it for the first time on your R mastery journey. 

One final note: as long as the computational power price is getting lower, the issue related to large dataset handling will become a more negligible one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset