Hadoop is a hot and trendy topic that is growing in popularity as an emerging, modern technology to tackle big data and the increasing amount of data being generated. It is a new approach to manage all of your structured and semi-structured data, analyze, and deliver results. You can find many books, blogs, articles, and websites about Hadoop. In addition, there are many conferences focusing on Hadoop, and many vendors are jumping on the Hadoop bandwagon to develop, integrate, and implement this technology. Customers and prospects are excited because of its open source and claim to be low cost. I will modestly cover Hadoop at a high level in its simplistic form—how it is related to analytics and data management, and how it can fit into your architecture. If your desire is to have a more in-depth understanding of Hadoop, I kindly suggest additional resources from the Internet and software vendors such as Hortonworks or Cloudera. The following topics will be covered in this chapter:
The history of Hadoop is fluid. According to my research, the fundamental technology behind Hadoop was invented by Google and had a tremendous amount of influence from Yahoo!. The underlying concept for this technology is to conveniently index and store all the rich textural and structural data being assimilated, analyze the data, and then present noteworthy and actionable results to users. It sounds pretty simple but at that time, in early 2000s, there was nothing in the market to process massive volumes of data (terabytes and beyond) in a distributed environment quickly and efficiently. Yet very little of the information is formatted in the traditional rows and columns of conventional databases. Google's innovative initiative was incorporated into an open source project called Nutch, and then it was spun off in mid-2000 to what is known now as Hadoop. Hadoop is the brain-child of Doug Cutting and Mike Cafarella, who continue to be influential today with open source technology. There are many definitions to what Hadoop is and what it can do. However, I try to use the most simplistic definition. In layman's term, Hadoop is open source software to store lots of data in a file system and process massive volumes of data quickly in a distributed environment. Let's break down what Hadoop provides:
To break it down even further, MapReduce coordinates the processing of (big) data by executing the various tasks in parallel across a distributed network of machines and manages all data transfers between the various parts of the system. Figure 4.1 illustrates a simplistic view of the Hadoop architecture, where MR stands for MapReduce.
Hadoop has grown over the years and is now an ecosystem of products from the open source and vendors communities. Today, Hadoop's framework and ecosystem of technologies are managed and maintained by the nonprofit Apache Software Foundation (ASF), a global community of software developers and contributors.
Big data is among us, and the amount of data produced and collected is overwhelming. In recent years, due to the arrival of new technologies, devices, and communication types such as social media, the amount of data produced is growing rapidly and exponentially every year. In addition, traditional data is also growing, which I consider is a part of the data ecosystem. Table 4.1 shows the data sources that are considered as big data and the description of the data that you may be collecting for your business.
Table 4.1 Big Data Sources
Type Data | Description |
Black Box | Data from airplanes, helicopters, and jets that capture voices and recordings of the flight crew and the performance information of the aircraft. |
Power Grid | Power grids data contain information about consumption by a particular node and customer with respect to a base station. |
Sensor | Sensor data come from machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. Can also include meteorological patterns, underground pressure during oil extraction, or a patient's vital statistics during recovery from a medical procedure. |
Social Media | Sites such as Facebook, Twitter, Instagram, YouTube, etc., collect a lot of data points from posts, tweets, chats, and videos by millions of people across the globe. |
Stock Exchange | The stock exchange holds transaction data about the “buy” and “sell” decisions made on a share of different companies from the customers. |
Transport | Transport data encompasses model, capacity, distance, and availability of a car or truck. |
Traditional (CRM, ERP, etc.) | Traditional data in rows and columns coming from CRM, ERP, financials about a product, customer, sales, etc. |
In addition to the data source mentioned, big data also encompasses everything from call center voice data to genomic data from biological research and medicine. Companies that learn to take advantage of big data will use real-time information from sensors, radio frequency identification, and other identifying devices to understand their business environments at a more granular level, to create new products and services, and to respond to changes in usage patterns as they occur. In the life sciences, such capabilities may pave the way to treatments and cures for threatening diseases.
If you are collecting any or all of the data sources referenced in Table 4.1, then your traditional computing techniques and infrastructure may not be adequate to support the volumes, variety, and velocity of the data being collected. Many organizations may need a new data platform that can handle exploding data volumes, variety, and velocity. They also need a scalable extension for existing IT systems in data warehousing, archiving, and content management. Others may need to finally get business intelligence value out of semi-structured data. Thus, Hadoop was designed and is capable to fulfill these needs with its ecosystem of products. Hadoop is not just a storage platform for big data; it's also a computational platform for analytics. This makes Hadoop ideal for companies that wish to compete on analytics, as well as retain customers, grow accounts, and improve operational excellence via analytics. Many firms believe that, for companies that get it right, big data will be able to unleash new organizational capabilities and value that will increase their bottom line.
Before we get into how Hadoop fits into the modern architecture, let's spend a few minutes on the traditional architecture. In the traditional architecture, you likely have a configuration that consists of a data warehouse storing the structured data and data, can be accessed and analyzed using your analytical and/or business intelligence tools. Figure 4.2 shows the typical and traditional architecture.
A data warehouse is good at providing consistent, integrated, and accurate data to the users. They can deliver in minutes the summaries, counts, and lists that make up the most common elements of business analytics. The combination of business analytics and massive parallel processing of a data warehouse provides much of the self-service data access to users as well as interactivity. Providing self-service capability is a huge benefit in itself; the ability to interact with the data can teach the user more about the business and can produce new insights.
Another advantage of the warehouse is concurrency. Concurrency in a data warehouse ranges from tens to thousands of people simultaneously accessing the querying the system. The mass parallel processing engine behind the database supports this capability with a superior level of performance and efficiency.
As your business evolves, the data warehouse may not meet the requirements of your organization. Organizations have information needs that are not completely served by a data warehouse. The needs are driven as much by the maturity of the data use in business as they are by new technology.
For example, the relational database at the center of the data warehouse is ideal for data processing to what can be done via SQL. Thus, if the data cannot be processed via SQL, then it limits the analysis of the new data source that is not in row or column format. Other data sources that do not fit nicely in the data warehouse include text, images, audio, and video, all of which are considered as semi-structured data. Thus, this is where Hadoop enters the architecture.
Hadoop is a family of products (Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, Mahout, Cassandra, YARN, Ambari, Avro, Chukwa, and Zookeeper), each with different and multiple capabilities. Please visit www.apache.org for details on these products. These products are available as native open source from Apache Software Foundation (ASF) and the software vendors.
Once the data are stored in Hadoop, the big data applications can be used to analyze the data. Figure 4.3 shows a simple stand-alone Hadoop architecture.
Hadoop: The Hadoop Distributed File System (HDFS), which is the data storage component of the open source Apache Hadoop project, is ideal for collecting the semi-data structured sources. (However, it can also host structured data as well). For this example, it is a simple architecture to capture all of the semi-structured data. Hadoop HDFS is designed to run on less expensive commodity hardware and is able to scale out quickly and inexpensively across a farm of servers.
MapReduce is a key component of Hadoop. It is the resource management and processing component of Hadoop and also allows programmers to write code that can process large volumes of data. For instance, a programmer can use MapReduce to locate friends or determine the number of contacts in a social network application, or process web access log to analyze web traffic volume and patterns. In addition, MapReduce can process the data where it resides (in HDFS) instead of moving it around, as is sometimes the case in a traditional data warehouse system. It also comes with a built-in recovery system—so if one machine goes down, MapReduce knows where to go to get another copy of the data. Although MapReduce processing is fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations that need to process data more frequently and/or closer to real time. The good news is that with the release of Hadoop 2.0, the resource management functionality has been packaged separately (it's called YARN) so that MapReduce doesn't get bogged down and can stay focused on processing big data.
Big data applications: This is the analysis and action component using data from Hadoop HDFS. These are the applications, tools, and utilities that have been natively built for users to access, interact, analyze, and make decisions using data in Hadoop and other nonrelational storage systems. It does not include traditional BI/analytics applications or tools that have been extended to support Hadoop.
Since the inception of Hadoop, there is a lot of noise and hype in the market as to what Hadoop can or cannot do. It is definitely not a silver bullet to solve all data management and analytical issues. Keep in mind that Hadoop is an ecosystem of products and provides multipurpose functions to manage and analyze big data. It is good at certain things and has shortcomings as well:
For the above reasons, I highly recommend integrating Hadoop with your data warehouse. The combination of Hadoop and the data warehouse offers the best of both worlds: managing structured and semi-structured data and optimizing performance for analysis. More scenarios and explanations will be described in the next chapter when I bring all of the elements together.
Now that you know what Hadoop can do, let's shift our focus on some of the best practices for Hadoop.
As I talk to customers about Hadoop, they share some dos and don'ts based on their experience. Keep in mind that there will be many more best practices as the technology matures into the mainstream and is implemented.
Now that you are aware of the technology and some of the best practices, let's check out some of the benefits, use cases, and success stories.
In the previous section, customers who implemented Hadoop saw business value and IT benefits. The benefits that follow are the most common ones from talking to customers. As the technology becomes more mature, the benefits will continue to grow in business and IT.
A recent “Best Practices Report” conducted by The Data Warehousing Institute (TDWI) revealed some interesting findings for the usage of Hadoop across all industries.1 The report consisted of several surveys and one of them asked the participants to name the most useful application of Hadoop if their organization were to adopt and employ it. Following is a snapshot from the survey.
In my research, there are many public-use cases and success stories that you can find on the web by simply searching for them. It is no surprise that Google and Yahoo! leverage Hadoop because they were both involved in the development and maturation of this technology. In my research, I discovered a plethora of companies across industries using Hadoop. Many of the use cases are very similar in nature and share a common theme—support of semi-structured data using HDFS and MapReduce.
Our first success story comes from an American corporation that provides online social networking services. With tens of millions of users and more than a billion page views every day, this company accumulates massive amounts of data and ends up with terabytes of data to process. It is probably more so than your typical organization, especially considering the amount of media it consumes. This corporation leverages Hadoop since its inception to explore the data and improve user experience.
One of the challenges that this company has faced is developing a scalable way of storing and processing all massive amounts of data. Collecting and analyzing historical data is a very big part of this company so that it can improve the user experience.
Years ago, it began playing around with the idea of implementing Hadoop to handle its massive data consumption and aggregation. In the early days of considering Hadoop, it was questionable whether importing some interesting data sets into a relatively small cluster of servers was feasible. Developers were quickly excited to learn that processing big data sets was possible with MapReduce programming model where this capability was not previously possible due to their massive computational requirements.
Hadoop is a big hit at this organization. It has multiple Hadoop clusters deployed now—with the biggest having about 2,500 central processing unit (cpu) cores and 1 petabyte of disk space. This company is loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and has hundreds of jobs running daily against these data sets. The list of projects that are using this infrastructure has increased tremendously—from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. A majority of the engineering staff has played with and run Hadoop at this company.
There are a number of reasons for the rapid adoption of Hadoop. First, developers are free to write MapReduce programs in the language of their choice. Second, engineers have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system are published as tables. Developers and engineers can explore the schemas and data of these tables much like they would do with a relational database management system (RDBMS). A combination of MapReduce and SQL can be very powerful to retrieve and process semi-structured and structured data.
At the corporate level, it is incredibly important that they use the information generated by and from their users to make decisions about improvements to the product and services. Hadoop has enabled them to make better use of the data at their disposal.
Our next customer is a global leader in consumer transaction technologies, turning everyday interactions with businesses into exceptional experiences. With the software, hardware, and portfolio of services, this company makes nearly 550 million transactions possible daily.
This company's motto is to help their customers respond to the demand for fast, easy, and convenient transactions with intuitive self-service and assisted-service options. But what they do goes beyond niche technologies or markets. Their solutions help businesses around the world increase revenue, build loyalty, reach new customers, and lower their costs of operations.
By continually learning about—and pioneering—how the world interacts and transacts, this global leader is helping companies not only reach their goals, but also change the way all of us shop, eat, travel, bank, and connect. They include thousands of banks and retailers that depend on tens of thousands of self-service kiosks, point-of-sale (POS) terminals, ATMs (automated teller machines), and barcode scanners, among other equipment.
In the traditional architecture, this customer could only collect a snapshot of data, which limited the accuracy of its predictive models. For example, the company might collect snapshots from a sampling of ATMs (perhaps once-daily performance data) where the real need is to collect the entire set.
In addition, this company needed to avoid downtime for its clients at all costs. The traditional architecture could not collect millions of records with many data sources and analyze it in a timeline manner to predict the likelihood of a device for preventative maintenance, replacement, or upgrades. Having downtime means that their clients would not be able to operate, and it hurts the reputation of the company. Furthermore, it could not store real-time performance data from every deployed device in banks and retailers and, therefore, was unable to analyze them to detect and prevent downtime for these devices. This process could not be done with the traditional architecture.
As mentioned in Chapter 1, Hadoop is used in the data exploration phase in the analytical data life cycle. It is a common use for Hadoop to be a landing zone for all data, to explore what you have before you apply any analysis to the data.
This customer considered Hadoop to be a landing zone for all data, as Hadoop is good at storing both structured and semi-structured data. Hadoop provides the technology to efficiently and effectively store large amounts of data—that includes data that they are not sure yet how they will use.
To resolve the downtime issue mentioned previously, it is leveraging Hadoop as a “data lake” where it has the capacity to gather and store real-time performance data from every deployed device in banks and retailers. This type of data is not only big in millions of records, but also in the rate or velocity that it is collected. From the data, this company can create rich predictive models using advanced analytics to detect and prevent downtime. It is able to analyze the entire set of data, not just a subset; it is able to build predictive models. For example, the data lake built on Hadoop can monitor every ATM machine it manufactures and develop data models to predict its failure. Fortunately, the downtime is few and far between, due to how big data is collected and analyzed.
Another use for Hadoop is collecting a new data source such as the telephony data to analyze the use of tele-presence devices worldwide. When communications are done between offices, this customer is able to determine the most efficient path and advise where to put telephony equipment in the network. This is purely operational use of the data, and a business analyst responsible for financial analysis of the product's portfolio may never considered looking at retail service calls. With Hadoop and collecting the telephony data, this company is able to expand the scope of the query and root cause analysis to see how a collocated printer, for example, might be shared between a point of sale device, an ATM, and a self-checkout device. The ability to leverage all the data in one place for analytics across lines of business and sharing analytical results creates a culture of openness and of information sharing.
Let's examine another customer success story using Hadoop.
Our next success story comes from a global payment services company that provides consumers and businesses with fast, reliable, and convenient ways to send and receive money around the world, to send payments, and to purchase money orders. With a combined network of over 500,000 agent locations in 200 countries and territories and over 100,000 ATMs and kiosks, this customer completed 255 million consumer-to-consumer transactions worldwide, moving $85 billion of principal between consumers and 484 million business payments. Its core business includes
This customer has been collecting structured data for many years. With the popularity of the Internet and web services, its customers are leveraging the online services to send money to other people, pay bills, and buy or reload prepaid debit cards, or find the locations of agents for in-person services. To give you a sense of the volume, more than 70 million users tapped into their services online, in person, or by phone, and the company said it averaged 29 transactions per second across 200 countries and territories. Thus, web data were growing fast and needed to be captured and analyzed for business needs. Log-based clickstream data about user activity from the website were captured in semi-structured nonrelational formats, and then got mixed and integrated with the relational data to give this customer a true view of their customer activity and engagements. All of the structured and semi-structured data were considered as funnels for a data analysis pipeline. With the traditional infrastructure, only the structured data was captured, integrated, and analyzed. Now, there is a need to integrate large amounts of web data into the corporate workflow to provide a holistic view of its operations.
The current infrastructure was unable to accommodate the business needs. The amount of data was so huge that analyzing it was a challenge and getting answers from massive amounts of data was very difficult. Because data are touched and shared across the enterprise, data scientists, business analysts, and marketing personnel were unable to rely on the existing technology for their data management and analytical needs.
At this money-transfer and payment services company, incorporating Hadoop into its enterprise operations also meant successfully integrating large amounts of web data into the corporate workflow. Significant work was required to make new forms of data readily accessible and usable by many staff members—data scientists, business analysts, marketing, and sales. Hadoop is the center of a self-service business intelligence and analytics platform that consists of a variety of others systems and software components. Hadoop feeds these applications the data needed since it is collecting all of semi-structured data.
This customer is adding data from mobile devices to its Hadoop environment and complements it with online data from the Internet. Once the streams of data, structured and semi-structured, are collected, it is parsed and prepared for analysis using a collection of technologies:
The Hadoop system continues to grow in importance at this global payment company. The enterprise architecture that includes Hadoop is expanding into risk analysis and compliance with financial regulation, anti-money laundering, and other financial crimes. Hadoop is widely used and crucial to the analytics process.
There is no question that big data is here, and organizations are spinning in an expanding wealth of data that is either too voluminous or not structured enough to be managed and analyzed through traditional means and approaches. Hadoop is a technology that can complement your traditional architecture to support the semi-structured data that you are collecting. Big data has prompted organizations to rethink and restructure from the business and IT sides. Managing data has always been in the forefront, and analytics has been more of an afterthought. Now, it seems big data has brought analytics to the forefront along with data management. Organizations need to realize that the world we live in is changing, and so is the data. It is prudent for businesses to recognize these changes and react quickly and intelligently to gain the competitive advantage and the upper hand. The new paradigm is no longer stability; it is now agility and discovery. As the volume of data explodes, organizations will need a platform to support new data sources and analytic capability. As big data evolves, the Hadoop architecture will develop into a modern information ecosystem that is a network of internal and external services continuously allocating information, optimizing data-driven decisions, communicating results, and generating new, strategic insights for businesses.
In the next chapter, we will bring it all together with Hadoop with the data warehouse leveraging in-database analytics and in-memory analytics in an integrated architecture.