Social Data Analytics at Scale – Spark and Amazon Web Services

In the age of big data we have to handle new problems for data handling that did not exist before in terms of the three Vs (volume, variety, and velocity). When we handle very large amounts of data, we have to change our approach entirely. For example, algorithms can no longer use exhaustive brute force, because this approach might just take years to complete. Instead, we would use intelligent filtering to reduce the search space. Another example is when we have very high dimensions; for example, in text analysis, where every word or combination of words in the vocabulary constitute a dimension we need to change algorithms to adapt to such scenarios.

Advances in cluster computing have given us a new tool to handle the challenge of big data. No longer do we think of performing an analysis on a single node (your computer), we have progressed to thinking in terms of clusters of resources. Of course, cluster systems existed before, but were not accessible to everybody. Now it is much simpler to set up a cluster for analysis purposes.

Scaling refers to the use of multiple computational resources to perform tasks. There are two fundamental types of scaling: vertical and horizontal.

Vertical scaling is where we increase the capacity of our current resources so that they are able to handle larger loads of data. For example, if our dataset is too large for our current node, we could increase the total disk space or the amount of random access memory (RAM) of the current node appropriately, to be able to handle more data. This approach is simpler because you don't have to handle problems of networking and node resource management. However, we will encounter limitations, because you cannot increase a single node's capacity infinitely, simply because such machines either don't exist or are too expensive.

Horizontal scaling is where we combine multiple nodes to increase the accumulated resources of the entire cluster. In the previous example, if our dataset was too large for our current node, instead of changing the type of current node, we could cut the dataset into multiple subsets (chunks), and assign them to different nodes. In the final step, we would combine the results of each subset to get the global result of the entire dataset. This approach potentially has no limitations, because depending on the size of the data we can increase the number of nodes in our cluster.

However, here we encounter a different set of problems, such as no single node having knowledge about the entire dataset, or combining results of subsets into the global context can be difficult. One of the early known problems of scaling up is the Google search engine, when Google invented the famous MapReduce algorithm. The idea behind this was simple. Instead of treating all the contents of the web in a single global context, the data is broken down into chunks, each chunk then maps out its results and finally the reduce stage combines the different chunks together to build the global context. Today, we have even more evolved systems like Spark that take MapReduce to yet another level.

In this chapter, we will focus on horizontal scaling and cover the following topics:

  • Distributed computing on Spark
  • Text mining with Spark
  • Parallel computing
  • Distributed computing with Celery
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset