Big Data Analytics

Big data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community, while MPI clusters provide high parallel efficiency for compute-intensive workloads. Bringing the big data and Big Compute communities together is an active area of research. Projects such as Apache ZooKeeper provide a centralized infrastructure and service that enables synchronization across a cluster, which is the way to achieve distributed computing in big data systems.

In this chapter, we will cover the following:

  • What is big data?
  • Big data characteristics
  • NoSQL databases
  • Hadoop, MapReduce, and HDFS
  • Distributed computing for big data
  • ZooKeeper for distributed computing

While technologies such as Hadoop, Hbase, Accumulo, and Cassandra allow us to store, query, and index large volumes of complex data, Dynamic Distributed Dimensional Data Model (D4M) provides a uniform mathematical framework for processing structured, semi-structured, and non-structured multidimensional data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset