Big Data Analytics

Big data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community, while MPI clusters provide high parallel efficiency for compute-intensive workloads. Bringing the big data and Big Compute communities together is an active area of research. Projects such as Apache ZooKeeper provide a centralized infrastructure and service that enables synchronization across a cluster, which is the way to achieve distributed computing in big data systems.

In this chapter, we will cover the following:

What is big data?
Big data characteristics
NoSQL databases
Hadoop, MapReduce, and HDFS
Distributed computing for big data
ZooKeeper for distributed computing

While technologies such as Hadoop, Hbase, Accumulo, and Cassandra allow us to store, query, and index large volumes of complex data, Dynamic Distributed Dimensional Data Model (D4M) provides a uniform mathematical framework for processing structured, semi-structured, and non-structured multidimensional data.

Table of Contents for Big Data Analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Big Data Analytics