7. Big Data Analytics

Using data to understand customers or clients and business operations to sustain (and foster) growth and profitability is an increasingly challenging task for today’s enterprises. As more and more data becomes available in various forms and fashion, timely processing of the data with traditional means becomes impractical. This phenomenon is called Big Data, and it is receiving substantial press coverage and drawing increasing interest from both business users and IT professionals. The result is that Big Data is becoming an overhyped and overused marketing buzzword.

Big Data means different things to people with different backgrounds and interests. Traditionally, the term Big Data has been used to describe the massive volumes of data analyzed by huge organizations like Google or research science projects at NASA. But for most businesses, it’s a relative term: Big depends on an organization’s size. The point is more about finding new value within and outside conventional data sources. Pushing the boundaries of data analytics uncovers new insights and opportunities, and big depends on where you start and how you proceed. Consider this popular description of Big Data: Big Data exceeds the reach of commonly used hardware environments and/or capabilities of software tools to capture, manage, and process it within a tolerable time span for its user population. Big Data has become a popular term to describe the exponential growth, availability, and use of information, both structured and unstructured. Much has been written on the Big Data trend and how it can serve as the basis for innovation, differentiation, and growth.

Where Does Big Data Come From?

A simple answer is “everywhere.” The sources of data that were ignored because of the technical limitations are now treated as gold mines. Big Data may come from any number of sources, including blogs, RFID tags, GPS, sensor networks, social networks, Internet-based text documents, Internet search indexes, detailed call records, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, photography archives, video archives, and large-scale ecommerce practices.

Figure 7.1 illustrates the sources of Big Data in a three-level diagram. The traditional data sources—mainly business transactions—are illustrated as the first echelon, where the volume, variety, and velocity of the data are moderate to low. The next echelon is the data generated by the Internet and social media. This human-generated data is perhaps the most complicated and potentially most valuable for understanding collective ideas and perceptions of people. The volume, variety, and velocity of data in this echelon are moderate to high. The topmost echelon is machine-generated data. With the automation of data collection systems on many fronts, coupled with the Internet of things (which connects everything to everything else), organizations are now able to collect data at volumes and richness that were unimaginable just a few years ago. All three echelons of data sources create a wealth of information that can significantly improve an organization’s capability of solving complex problems and taking advantage of opportunities—if recognized and leveraged properly.

Image

Figure 7.1 The Wide Range of Sources for Big Data

Big Data is not new. What is new is that the definition and the structure of Big Data constantly change. Companies have been storing and analyzing large volumes of data since the advent of the data warehouses in the early 1990s. While terabytes used to be synonymous with Big Data warehouses, now it’s petabytes, and the rate of growth in data volumes continues to escalate as organizations seek to store and analyze greater levels of transaction details, as well as Weband machine-generated data, to gain a better understanding of customer behavior and business drivers. Many—academics and industry analysts and leaders alike—think that “Big Data” is a misnomer. What it says and what it means are not exactly the same. That is, Big Data is not just “big.” The sheer volume of the data is only one of many characteristics that are often associated with Big Data, such as variety, velocity, veracity, variability, and value proposition, among others.

The Vs That Define Big Data

Big Data is typically defined by three Vs: volume, variety, and velocity. In addition to these three, we see some of the leading Big Data solution providers adding other Vs, such as veracity (IBM), variability (SAS), and value proposition (literally almost everyone—academics and industry).

Volume

Volume is obviously the most common trait of Big Data. Many factors contributed to the exponential increase in data volume, such as transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, automatically generated RFID and GPS data, etc. In the past, excessive data volume created storage issues, both technical and financial. But with today’s advanced technologies coupled with decreasing storage costs, these issues are no longer significant; instead, other issues emerge, including how to determine relevance amid the large volumes of data and how to create value from data that is deemed to be relevant.

As mentioned before, big is a relative term. It changes over time and is perceived differently by different organizations. With the staggering increase in data volume, even the naming of the next Big Data echelon has been challenging. The highest unit of data used to be the petabyte (PB), but now we speak of the zettabyte (ZB), which is a trillion gigabytes (GB) or a billion terabytes (TB). As the volume of data increases, we are having a hard time keeping up with universally accepted naming for the next level. Table 7.1 provides an overview of the size and naming of the modern-day data volumes (Sharda et al. 2014).

Image

Source: en.wikipedia.org/wiki/Petabyte

Table 7.1 Naming for the Increasing Volumes of Data

Consider that an exabyte of data is created on the Internet each day, which equates to 250 million DVDs’ worth of information. And the idea of even larger amounts of data—a zettabyte—isn’t too far off when it comes to the amount of info traversing the Web in a year. In fact, industry experts are already estimating that we will see 1.3 zettabytes of traffic annually over the Internet by 2016—and soon enough, we might start talking about even bigger volumes. Some of the Big Data scientists often allege that the NSA and FBI have a yottabyte of data on people. To put this measurement in perspective, a yottabyte is the amount of storage on 250 trillion DVDs. A brontobyte, which is not an official SI unit but is apparently recognized by some people in the measurement community, is a 1 followed by 27 zeros. Size of such magnitude can be used to describe the type of sensor data that we will get from the Internet of things in the next decade, if not sooner. A gegobyte is 1030. With respect to where the Big Data comes from, consider the following (Higginbotham, 2012):

• The CERN Large Hadron Collider generates 1 petabyte of data per second.

• Sensors from a Boeing jet engine create 20 terabytes of data every hour.

• Facebook databases take in 500 terabytes of new data per day.

• On YouTube, 72 hours of video are uploaded per minute, translating to 1 terabyte every four minutes.

• The proposed Square Kilometer Array telescope (the proposed biggest telescope in the world) will generate an exabyte of data per day.

From a short historical perspective, in 2009, the world had about 0.8 ZB of data; in 2010, it exceeded the 1 ZB mark; at the end of 2011, the number was 1.8 ZB. IBM has estimated that six or seven years from now, we will have 35 ZB. Though this number is astonishing in size, so are the challenges and opportunities that come with it.

Variety

Data today comes in all types of formats—ranging from traditional databases to hierarchical data stores created by end users and OLAP systems; to text documents, email, and XML; to meter-collected, sensor-captured data; to video, audio, and stock ticker data. By some estimates, 80% to 85% of all organizations’ data is in some sort of unstructured or semistructured format (i.e., a format that is not suitable for traditional databases schemas). But there is no denying its value, and hence it must be included in the analyses to support decision making.

Velocity

According to Gartner, a well-known and highly respected technology consultancy company, velocity means both how fast data is being produced and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags, automated sensors, GPS devices, and smart meters are driving an increasing need to deal with torrents of data in near real time. Velocity is perhaps the most overlooked characteristic of Big Data. Reacting quickly enough to deal with velocity is a challenge to most organizations. For time-sensitive environments, the opportunity cost clock of the data starts ticking the moment the data is created. As time passes, the value proposition of the data degrades and eventually becomes worthless. Whether the subject matter is the health of a patient, the well-being of a traffic system, or the health of an investment portfolio, accessing the data and reacting faster to the circumstances will always create more advantageous outcomes.

In the Big Data storm that we are witnessing today, almost everyone is fixated on at-rest analytics, using optimized software and hardware systems to mine large quantities of variant data sources. Although this is critically important and highly valuable, another class of analytics that is often overlooked is driven by the velocity nature of Big Data; it is called data stream analytics or in-motion analytics. If done correctly, data stream analytics can be as valuable as—and in some business environments more valuable than—at-rest analytics. Later in this chapter we cover this topic in more detail.

Veracity

IBM is using the term veracity as a fourth V to describe Big Data. It refers to conformity to facts—to the accuracy, quality, truthfulness, or trustworthiness of data. Tools and techniques are often used to handle Big Data’s veracity by transforming the data into trustworthy insights.

Variability

In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Inconsistency in data flow makes it very difficult to properly and cost effectively develop data infrastructures. If the resources are put in place to handle the peak times, for the rest of the time they will be significantly underutilized. One popular way to handle the problem of variability is to use pooled resources, based on the infrastructure-as-a-service business model. Cloud computing, service oriented architecture, and massively parallel processing make variability a manageable issue for not only large, but also small to medium-sized businesses.

Value Proposition

The excitement around Big Data is about its value proposition. A preconceived notion about Big Data is that it contains (or has a greater potential to contain) more patterns and interesting anomalies than “small” data. Thus, by analyzing large and feature-rich data, organizations can gain greater business value than they could otherwise gain. While users can detect the patterns in small data sets by using simple statistical and machine learning methods or ad hoc query and reporting tools, Big Data means “big” analytics. Big analytics means greater insight and better decision making—something that every organization needs.

Since the exact definition of Big Data is still a matter of ongoing discussion in academic and industrial circles, it is likely that more characteristics (perhaps more Vs) will be added to this list. Regardless of what happens, the importance and value proposition of Big Data are here to stay.

Fundamental Concepts of Big Data

Big Data by itself, regardless of the size, type, or speed of the data, is worthless unless business users do something with it that delivers value to their organizations. That’s where “big” analytics comes into the picture. Although organizations have long run reports and dashboards against data warehouses, most have not opened these repositories to in-depth on-demand exploration. This is partly because analysis tools are too complex for the average user and partly because the repositories often do not contain all the data that power users need. But this is about to change (and had already changed for some) in a dramatic fashion, thanks to the new Big Data analytics paradigm.

Along with the value proposition, Big Data has also brought about big challenges for organizations. The traditional means for capturing, storing, and analyzing data are not capable of dealing with Big Data effectively and efficiently. Therefore, new breeds of technologies need to be developed (or purchased, hired, or outsourced) to take on the Big Data challenge. Before making such an investment, organizations should justify the means. Here are some questions that may help shed light on this situation. If any of the following statements are true, then you need to seriously consider embarking on a Big Data journey:

• You can’t process the amount of data that you want to process because of the limitations posed by your current platform or environment.

• You want to involve new/contemporary data sources (e.g., social media, RFID, sensory, Web, GPS, textual data) into your analytics platform, but you can’t because it does not comply with the data storage schema-defined rows and columns without sacrificing fidelity or the richness of the new data.

• You need to (or want to) integrate data as quickly as possible to be current on your analysis.

• You want to work with a schema-on-demand (as opposed to the predetermined schema used in RDBMSs) data storage paradigm because the nature of the new data may not be known, or there may not be enough time to determine it and develop schema for it.

• The data is arriving so fast at your organization’s doorstep that your traditional analytics platform cannot handle it.

As is the case with any other large IT investment, the success in Big Data analytics depends on a number of factors. Figure 7.2 shows a graphical depiction of the most critical success factors (Watson, 2012).

Image

Figure 7.2 Critical Success Factors for Big Data Analytics

Following are among the most critical success factors for Big Data analytics:

A clear business need (alignment with the vision and the strategy). Business investments ought to be made for the good of the business, not for the sake of mere technology advancements. Therefore, the main driver for Big Data analytics should be the needs of the business, at any level: strategic, tactical, or operations.

Strong, committed sponsorship (i.e., executive champions). It is a well-known fact that if you don’t have strong, committed executive sponsorship, it is difficult (or even impossible) to succeed. If the scope is a single or a few analytical applications, the sponsorship can be at the department level. However, if the target is enterprise-wide organizational transformation, which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels, and it needs to be organization-wide.

Alignment between the business and IT strategy. It is essential to make sure that analytics work is always supporting the business strategy—and not the other way around. Analytics should play an enabling role in execution of the business strategy.

A fact-based decision-making culture. In a fact-based decision-making culture, the numbers—rather than intuition, gut feeling, or supposition—drive decision making. There is also a culture of experimentation to see what works and doesn’t. To create a fact-based decision-making culture, the senior management needs to

• Recognize that some people can’t or won’t adjust

• Be a vocal supporter

• Stress that outdated methods must be discontinued

• Ask to see what analytics went into decisions

• Link incentives and compensation to desired behaviors

A strong data infrastructure. Data warehouses have provided the data infrastructure for analytics. This infrastructure is changing and being enhanced in the Big Data era with new technologies. Success requires marrying the old with the new for a holistic infrastructure that works synergistically.

As the size and complexity of data increases, the need for more efficient analytical systems is also increasing. In order to keep up with the computational needs of Big Data, a number of new and innovative computational techniques and platforms are being developed. These techniques, collectively called high-performance computing, include the following:

In-memory analytics solves complex problems in near real time with highly accurate insights by allowing analytical computations and Big Data to be processed in-memory and distributed across a dedicated set of nodes.

In-database analytics speeds insights and enables better data governance by performing data integration and analytic functions inside the database so you don’t have to move or convert data repeatedly.

Grid computing promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources.

Appliances bring together hardware and software in a physical unit that is not only fast but also can be scaled on an as-needed basis.

Computational requirement is just a small part of the list of challenges that Big Data imposes on today’s enterprises. Following is a list of challenges that business executives have found to have significant impacts on successful implementation of Big Data analytics. When considering Big Data projects and architecture, being mindful of these challenges will make the journey to analytics competency a less stressful one:

Data volume. It’s important to be able to capture, store, and process the huge volume of data at an acceptable speed so that the latest information is available to decision makers when they need it.

Data integration. It’s important to be able to combine data that is not similar in structure or source and to do so quickly and at reasonable cost.

Processing capabilities. It’s important to be able to process the data quickly, as it is captured. Traditional ways of collecting and then processing the data may not work. In many situations, data needs to be analyzed as soon as it is captured to leverage the most value. (This is called stream analytics and is covered later in this chapter.)

Data governance. It’s important to be able to keep up with the security, privacy, ownership, and quality issues of Big Data. As the volume, variety (format and source), and velocity of data change, so should the capabilities of governance practices.

Skills availability. Big Data is being harnessed with new tools and is being looked at in different ways. There is a shortage of people (often called data scientists, discussed later in this chapter) with the skills to do the job.

Solution cost. Since Big Data has opened up a world of possible business improvements, there is a great deal of experimentation and discovery taking place to determine the patterns that matter and the insights that turn to value. To ensure a positive ROI on a Big Data project, therefore, it is crucial to reduce the cost of the solutions used to find that value.

The challenges are real, but so is the value proposition of Big Data analytics. Anything that business analytics leaders can do to help prove the value of new data sources to the business will move the organization beyond experimenting and exploring Big Data to adapting and embracing it as a differentiator. There is nothing wrong with exploration, but ultimately the value of Big Data comes from putting insights into action.

The Business Problems That Big Data Analytics Addresses

The top business problems addressed by Big Data overall are process efficiency and cost reduction, as well as enhanced customer experience, but different priorities emerge when it is looked at by industry. Process efficiency and cost reduction are common business problems that can be addressed by analyzing Big Data, and these are perhaps among the top-ranked problems that can be addressed with Big Data analytics for the manufacturing, government, energy and utilities, communications and media, transport, and healthcare sectors. Enhanced customer experience may be at the top of the list of problems addressed by insurance companies and retailers. Risk management usually is at the top of the list for companies in banking and education. Here is a list of problems that can be addressed using Big Data analytics:

• Process efficiency and cost reduction

• Brand management

• Revenue maximization, cross-selling, and up-selling

• Enhanced customer experience

• Churn identification and customer recruiting

• Improved customer service

• Identification of new products and market opportunities

• Risk management

• Regulatory compliance

• Enhanced security capabilities

Big Data Technologies

There are a number of technologies to processing and analyzing Big Data, but most have some common characteristics (Kelly, 2012). Namely, they take advantage of commodity hardware to enable scale-out and parallel-processing techniques; they employ nonrelational data storage capabilities in order to process unstructured and semi-structured data; and they apply advanced analytics and data visualization technology to Big Data to convey insights to end users. There are three Big Data technologies that stand out as likely to transform the business analytics and data management markets: MapReduce, Hadoop, and NoSQL. The following sections examine these three technologies.

MapReduce

MapReduce is a technique popularized by Google that distributes the processing of very large multistructured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds—potentially thousands—of nodes in the cluster. In their seminal paper on MapReduce, Dean and Ghemawat (2004) said:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

The key point to note from this quote is that MapReduce is a programming model, not a programming language. That is, it is designed to be used by programmers rather than by business users.

To see how MapReduce works, let’s look at an example. In the colored square counter in Figure 7.3, the input to the MapReduce process is a set of colored squares. The objective is to count the number of squares of each color. The programmer in this example is responsible for coding the map and reduce programs; the remainder of the processing is handled by the software system implementing the MapReduce programming model.

Image

Figure 7.3 A Graphical Depiction of the MapReduce Process

The MapReduce system first reads the input file and splits it into multiple pieces. In this example, there are two splits, but in a real-life scenario, the number of splits would typically be much higher. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. The role of each map program in this case is to group the data in a split by color. The MapReduce system then takes the output from each map program and merges (i.e., shuffles and sorts) the results for input to the reduce program, which calculates the sum of the number of squares of each color. In this example, only one copy of the reduce program is used, but there may be more in practice. To optimize performance, programmers can provide their own shuffle and sort program, and they can also deploy a combiner that combines local map output files to reduce the number of output files that have to be remotely accessed across the cluster by the shuffling and sorting step.

Why Use MapReduce?

MapReduce aids organizations in processing and analyzing large volumes of multistructured data. Application examples include indexing and search, graph analysis, text analysis, machine learning, and data transformation. These types of applications are often difficult to implement using the standard SQL employed by relational DBMSs.

The procedural nature of MapReduce makes it easy for skilled programmers to understand. It also has the advantage of not requiring developers to be concerned with implementing parallel computing; the system transparently handles that.

Although MapReduce is designed for programmers, nonprogrammers can exploit the value of prebuilt MapReduce applications and function libraries. Both commercial and open source MapReduce libraries are available that provide a wide range of analytic capabilities. Apache Mahout, for example, is an open source machine learning library of algorithms for clustering, classification, and batch-based collaborative filtering that are implemented using MapReduce.

Hadoop

Image

Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. Originally created by Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed by Google in the early 2000s for indexing the Web. Hadoop was designed to handle petabytes and exabytes of data, distributed over multiple nodes in parallel.

Hadoop clusters run on inexpensive commodity hardware, so projects can scale out without breaking the bank. Hadoop is now a project of the Apache Software Foundation, where hundreds of contributors continuously improve the core technology.

How Hadoop Works

Fundamentally, rather than bang away at one huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so all the parts can be processed and analyzed at the same time. A client accesses unstructured and semistructured data from sources including log files, social media feeds, and internal data stores. It breaks the data up into parts, which are then loaded into a file system made up of multiple nodes running on commodity hardware. The default file store in Hadoop is the Hadoop Distributed File System (HDFS). File systems such as HDFS are adept at storing large volumes of unstructured and semistructured data because they do not require data to be organized into relational rows and columns. Each part is replicated multiple times and loaded into the file system so that if a node fails, another node has a copy of the data contained on the failed node. A name node acts as facilitator, communicating back to the client information such as which nodes are available, where in the cluster certain data resides, and which nodes have failed.

Once the data is loaded into the cluster, it is ready to be analyzed via the MapReduce framework. The client submits a map job—usually a query written in Java—to one of the nodes in the cluster known as the job tracker. The job tracker refers to the name node to determine which data it needs to access to complete the job and where in the cluster that data is located. Once this is determined, the job tracker submits the query to the relevant nodes. Rather than bring all the data back into a central location for processing, processing occurs at each node simultaneously, or in parallel. This is an essential characteristic of Hadoop.

When each node has finished processing its given job, it stores the results. The client initiates a reduce job through the job tracker, in which results of the map phase stored locally on individual nodes are aggregated to determine the “answer” to the original query and are then loaded onto another node in the cluster. The client accesses these results, which can then be loaded into one of a number of analytic environments for analysis. The MapReduce phase has now been completed.

After the MapReduce phase, the processed data is ready for further analysis by data scientists and others with advanced data analytics skills. Data scientists can manipulate and analyze the data, using a number of tools for a number of uses, including to search for hidden insights and patterns or to use as the foundation to build user-facing analytic applications. The data can also be modeled and transferred from Hadoop clusters into existing relational databases, data warehouses, and other traditional IT systems for further analysis and/or to support transactional processing.

Hadoop Technical Components

A Hadoop “stack” is made up of a number of components, including the following:

Hadoop Distributed File System (HDFS). This is the default storage layer in any given Hadoop cluster.

Name node. This is the node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and whether any nodes fail.

Secondary node. A backup to the name node, the secondary node periodically replicates and stores data from the name node in case it fails.

Job tracker. This is the node in a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data.

Slave nodes. The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the job tracker.

In addition, the Hadoop ecosystem is made up of a number of complimentary subprojects. NoSQL data stores like Cassandra and HBase are used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. A rich set of Hadoop-related projects and supporting tools/platforms can be found in Sharda et al. (2014).

Hadoop Pros and Cons

The main benefit of Hadoop is that it allows enterprises to process and analyze large volumes of unstructured and semistructured data, heretofore inaccessible to them, in a cost- and time-effective manner. Because Hadoop clusters can scale to petabytes and even exabytes of data, enterprises no longer need to rely on sample data sets but can process and analyze all relevant data. Data scientists can apply an iterative approach to analysis, continually refining and testing queries to uncover previously unknown insights. Also, getting started with Hadoop doesn’t cost much. Developers can download the Apache Hadoop distribution for free and begin experimenting with Hadoop in less than a day.

The downside to Hadoop and its myriad components is that they are immature and still developing. As with any other young, raw technology, implementing and managing Hadoop clusters and performing advanced analytics on large volumes of unstructured data requires significant expertise, skill, and training. Unfortunately, there is currently a dearth of Hadoop developers and data scientists available, which means it’s impractical for many enterprises to maintain and take advantage of complex Hadoop clusters. Further, as Hadoop’s myriad components are improved by the community and as new components are created, there is (as there is with any immature open source technology/approach) a risk of forking. Finally, Hadoop is a batch-oriented framework, which means it does not support real-time data processing and analysis.

The good news is that some of the brightest minds in IT are contributing to the Apache Hadoop project, and a new generation of Hadoop developers and data scientists is coming of age. As a result, the technology is advancing rapidly, becoming both more powerful and easier to implement and manage. An ecosystems of vendors, both Hadoop-focused startups like Cloudera and Hortonworks and well-worn IT stalwarts like IBM and Microsoft, are working to offer commercial, enterprise-ready Hadoop distributions, tools, and services to make deploying and managing the technology a practical reality for traditional enterprises. Other bleeding-edge startups are working to perfect NoSQL (which stands for Not Just SQL) data stores capable of delivering near-real-time insights in conjunction with Hadoop.

A Few Demystifying Facts About Hadoop

Although Hadoop and related technologies, such as MapReduce and Hive, have been around for over five years now, many people still have several misconceptions about them. The following list of 10 facts intends to clarify what Hadoop is and does relative to BI, as well as in which business and technology situations Hadoop-based BI, data warehousing, and analytics can be useful (Russom, 2013):

Fact 1: Hadoop consists of multiple products. We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that in the next point.)

The Apache Hadoop library includes (in BI priority order) the Hadoop Distributed File System (HDFS), MapReduce, Hive, HBase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with HBase and Hive) constitute a useful technology stack for applications in BI, DWs, and analytics.

Fact 2: Hadoop is open source but available from vendors, too. Apache Hadoop’s open source software library is available from ASF at www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools and technical support.

Fact 3: Hadoop is an ecosystem, not a single product. In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.

Fact 4: HDFS is a file system, not a DBMS. Hadoop is primarily a distributed file system and lacks capabilities associated with a DBMS, such as indexing, random access to data, and support for SQL. That’s okay because HDFS does things DBMSs cannot do.

Fact 5: Hive resembles SQL but is not standard SQL. Many of us are handcuffed to SQL because we know it well, and our tools demand it. People who know SQL can quickly learn to hand-code Hive, but that doesn’t solve compatibility issues with SQL-based tools. The Data Warehouse Institute (TDWI) feels that over time, Hadoop products will support standard SQL, so this issue will soon be moot.

Fact 6: Hadoop and MapReduce are related but don’t require each other. Developers at Google developed MapReduce before HDFS existed, and some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some DBMSs.

Fact 7: MapReduce provides control for analytics, not analytics per se. MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for any kind of application that you can hand-code—not just analytics.

Fact 8: Hadoop is about data diversity, not just data volume. Theoretically, HDFS can manage the storage and access of any data type, as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS.

Fact 9: Hadoop complements a DW; it’s rarely a replacement. Most organizations have designed their DW for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multistructured data types that most DWs can’t handle.

Fact 10: Hadoop enables many types of analytics, not just Web analytics. Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data. But other use cases exist. For example, consider the Big Data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer-base segmentation, fraud detection, and risk analysis—can benefit from the additional Big Data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view.

NoSQL

A related new style of database called NoSQL (which stands for Not Only SQL) has emerged to, like Hadoop, process large volumes of multistructured data. However, whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most part (though there are some important exceptions), at serving up discrete data stored among large volumes of multistructured data to end-user and automated Big Data applications. This capability is sorely lacking from relational database technology, which simply can’t maintain needed application performance levels at a Big Data scale.

In some cases, NoSQL and Hadoop work in conjunction. HBase, for example, is a popular NoSQL database modeled after Google Big-Table that is often deployed on top of HDFS to provide low-latency, quick lookups in Hadoop. The downside of most NoSQL databases today is that they have traded ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools. Both these shortcomings are in the process of being overcome by both the open source NoSQL communities and a handful of vendors that are attempting to commercialize the various NoSQL databases. NoSQL databases currently available include HBase, Cassandra, MongoDB, Accumulo, Riak, CouchDB, and DynamoDB, among others.

Data Scientists

Data scientist is a role or a job frequently associated with Big Data or data science. In a very short time, it has become one of the most sought-out roles in the marketplace. In an article published in the October 2012 issue of Harvard Business Review, authors Thomas H. Davenport and D. J. Patil called data scientist “the sexiest job of the 21st century.” In their article, they specified data scientists’ most basic, universal skill as the ability to write code (in the latest Big Data languages and platforms). Although this may be less true in the near future, when many more people will have the title “data scientist” on their business cards, at this time it seems to be the most fundamental skill required of data scientists. A more enduring skill will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both (Davenport & Patil, 2012).

Data scientists use a combination of business and technical skills to investigate Big Data, looking for ways to improve current business analytics practices (from descriptive to predictive and prescriptive) and hence to improve decisions for new business opportunities. One of the biggest differences between a data scientist and a business intelligence user—such as a business analyst—is that a data scientist investigates and looks for new possibilities, while a BI user analyzes existing business situations and operations.

One of the dominant traits expected of data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field. For example, one data scientist studying a fraud problem realized that it was analogous to a type of DNA sequencing problem (Davenport & Patil, 2012). By bringing together those disparate worlds, he and his team were able to craft a solution that dramatically reduced fraud losses.

Where Do Data Scientists Come From?

Although there still is disagreement about the use of science in the name, data science is becoming less of a controversial issue. Real scientists use tools made by other scientists, or they make them if they don’t exist, as a means to expand knowledge. That is exactly what data scientists do. Experimental physicists, for example, have to design equipment, gather data, conduct multiple experiments to discover knowledge, and communicate their results. Even though they may not be wearing white coats and may not be living in a sterile lab environment, data scientists have a similar role to experimental physicists in that they use creative tools and techniques to turn data into actionable information for others to use for better decision making.

There is no consensus on what educational background a data scientist has to have. The usual suspects, like MS or PhD in computer science, MIS, industrial engineering, or, recently, analytics, may be necessary but not sufficient to call someone a data scientist. One of the most sought-out characteristics of a data scientist is expertise in both technical and business application domains. In that sense, it somewhat resembles the professional engineer (PE) or project management professional (PMP) roles, where experience is valued as much as (if not more so than) technical skills and educational background. It would not be a huge surprise to see in the next few years a certification specifically designed for data scientists.

Because it is a profession for a field that is still being defined and many of whose practices are still experimental and far from being standardized, companies are overly sensitive about the experience dimension of data scientists. As the profession matures and practices are standardized, experience will be less of an issue in the definition of a data scientist. Today, companies are looking for people who have extensive experience working with complex data and have had good luck recruiting among those with educational and work backgrounds in the physical or social sciences. Some of the best and brightest data scientists have been PhDs in esoteric fields like ecology and systems biology (Davenport & Patil, 2012). Even though there is no consensus on where data scientists come from, there is a common understanding of what skills and qualities they are expected to possess. Figure 7.4 shows a high-level graphical illustration of skills that a data scientist needs.

Image

Figure 7.4 Skills That a Data Scientist Needs

Data scientists need to have soft skills such as creativity, curiosity, communication/interpersonal, domain expertise, problem definition, and managerial skills (shown on the left side of Figure 7.4). They also need sound technical skills such as data manipulation, programming/hacking/scripting, and Internet and social media/networking technologies (shown on the right side of the figure).

Big Data and Stream Analytics

Along with volume and variety, as we have seen earlier in this chapter, one of the key characteristics defining Big Data is velocity, which refers to the speed at which the data is created and streamed into the analytics environment. Organizations are looking for new means to process streaming data as it comes in to react quickly and accurately to problems and opportunities to please customers and gain competitive advantage. In situations where data streams in rapidly and continuously, traditional analytics approaches that work with previously accumulated data (i.e., data at arrest) often either arrive at the wrong decisions because they use too much of out-of-context data, or they arrive at the correct decisions but too late to be of any use to organization. Therefore, it is critical for a number of business situations to analyze data soon after it is created and/or as soon as it streamed into the analytics system.

A presumption that the vast majority of modern businesses are living by today is that it is important to record every piece of data because it might contain valuable information now or sometime in the near future. However, as the number of data sources increases, the store-everything approach becomes harder and harder to do and, in some cases, even infeasible. In fact, despite the technological advances, current total storage capacity lags far behind the digital information being generated in the world. Moreover, in the constantly changing business environment, real-time detection of meaningful changes in data as well as of complex pattern variations within a given short time window are essential in order to come up with actions that better fit with the new environment. These facts become the main triggers for a paradigm called stream analytics. The stream analytics paradigm was born as an answer to challenges such as the unbounded flows of data that cannot be permanently stored in order to be subsequently analyzed, in a timely and efficient manner, and complex pattern variations that need to be detected and acted upon as soon as they happen.

Stream analytics (also called data-in-motion analytics and real-time data analytics) is a term commonly used for the analytic process of extracting actionable information from continuously streaming data. A stream can be defined as a continuous sequence of data elements (Zikopoulos et al., 2013). The data elements in a stream are often called tuples. In a relational database sense, a tuple is similar to a row of data (i.e., a record, an object, or an instance). However in the context of semistructured or unstructured data, a tuple is an abstraction that represents a package of data, which can be characterized as a set of attributes for a given object. If a tuple by itself is not sufficiently informative for analysis, a correlation or another collective relationship among tuples is needed, and then a window of data that includes a set of tuples is used. A window of data is a finite number/sequence of tuples, and the window is continuously updated as new data becomes available. The size of a window is determined based on the system being analyzed. Stream analytics is becoming increasingly more popular because of two things: First, time-to-action has become a decreasing value, and second, we have the technological means to capture and process the data while it is being created.

Some of the most impactful applications of stream analytics have been developed in the energy industry, specifically for smart grid (electric power supply chain) systems. The new smart grids are capable of not only real-time creation and processing of multiple streams of data in order to determine optimal power distribution to fulfill real customer needs but also generate accurate short-term predictions aiming at covering unexpected demand and renewable energy generation peaks.

Figure 7.5 shows a depiction of a generic use case for streaming analytics in the energy industry (a typical smart grid application). The goal is to accurately predict electricity demand and production in real time by using streaming data that is coming from smart meters, production system sensors, and meteorological models. The ability to predict near-future consumption and production trends and detect anomalies in real time can be used to optimize supply decisions (e.g., how much to produce, what sources of production to use, how to optimally adjust production capacities) as well as to adjust smart meters to regulate consumption and favorable energy pricing.

Image

Figure 7.5 A Use Case of Streaming Analytics in the Energy Industry

Data Stream Mining

Data stream mining, as an enabling technology for stream analytics, is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. As we have seen in this book, traditional data mining methods require data to be collected and organized in a proper file format and then processed in a recursive manner to learn the underlying patterns. In contrast, a data stream is a continuous flow of an ordered sequence of instances that in many applications of data stream mining can be read/processed only once or a small number of times, using limited computing and storage capabilities. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, Web searches, and financial data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.

In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream, given some knowledge about the class membership or values of previous instances in the data stream. Specialized machine learning techniques (mostly derivative of traditional machine learning techniques) can be used to learn this prediction task from labeled examples in an automated fashion. An example of such a prediction method has been developed by Delen et al. (2005), who gradually built and refined a decision tree model by using a subset of the data at a time.

References

Davenport, T. H., & D. J. Patil. (2012, October). “Data Scientist.” Harvard Business Review, pp. 70–76.

Dean, J., & S. Ghemawat. (2004). MapReduce: Simplified Data Processing on Large Clusters. http://research.google.com/archive/mapreduce.html (accessed May 2014).

Delen, D., M. Kletke, & J. Kim. (2005). “A Scalable Classification Algorithm for Very Large Datasets,” Journal of Information and Knowledge Management, 4(2): 83–94.

Higginbotham, S. (2012). As Data Gets Bigger, What Comes After a Yottabyte? http://gigaom.com/2012/10/30/as-data-gets-bigger-what-comes-after-a-yottabyte (accessed June 2014).

Issenberg, S. (2012, October 29). “Obama Does It Better,” Slate.

Kelly, L. (2012). Big Data: Hadoop, Business Analytics and Beyond. wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond (accessed June 2014).

Romano, L. (2012, June 9). “Obama’s Data Advantage,” Politico.

Russom, P. (2013). “Busting 10 Myths About Hadoop,” Best of Business Intelligence, 10: 45–46.

Samuelson, D. A. (2013, February). “Analytics: Key to Obama’s Victory,” ORMS Today, pp. 20–24.

Sharda, R., D. Delen, & E. Turban. (2014). Business Intelligence and Analytics: Systems for Decision Support. New York: Prentice Hall.

Shen, G. (2013, January–February). “Big Data, Analytics and Elections,” INFORMS Analytics Magazine.

Watson, H. (2012). “The Requirements for Being an Analytics-Based Organization,” Business Intelligence Journal, 17(2): 42–44.

Watson, H., R. Sharda, & D. Schrader. (2012). “Big Data and How to Teach It,” Workshop at AMCIS 2012, Seattle.

Zikopoulos, P., et al. (2013). Harness the Power of Big Data. New York: McGraw-Hill.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset