Summary

The most important thing to keep in mind when working with large datasets and matplotlib is to use data wisely and take advantage of NumPy and tools such as PyTables. When moving to distributed data, a large burden with regard to infrastructure is taken on compared to working with data on a single machine. As datasets approach terabytes and petabytes, the greatest work involved really has less to do with plotting and visualization and has more to do with deciding what to visualize and how to actually get there. An increasingly common aspect of big data is real-time analysis, where matplotlib might be used to generate hundreds or thousands of plots of a fairly small set of data points. Not all problems in big data visualization are about visualizing big data!

Finally, it cannot be overstated that knowing your data is the most crucial component when tackling large datasets. It is very rare that an entire raw dataset is what you want to present to your colleagues or end users. Gaining an intuition with the help of your data through an initial process of exploration will enable you to select the appropriate presentation approaches, which may involve the process of binning your data in a simple histogram, decimating data, or simply providing statistical summaries. The biggest problem of big data that users face is how not to get lost in either the sheer size of it or the complex ecosystem of tools and fads that have grown up around the buzz of big data. Careful thinking and an eye towards simplicity will make all the difference in having a successful experience with large datasets in matplotlib.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset