In Chapter 2, we discussed in-database processing and its value of applying analytics where the data reside. In Chapter 3, we explained in-memory analytics and how data could be lifted in-memory for game-changing results. In Chapter 4, we introduced Hadoop and how it can fit into the data management and analytical landscape. Each of the previous chapters combined is like a running a relay, and we tie it all together in this chapter to help you through the stages—crawl, walk, and sprint. Now, let's bring all of the components together (in-database, in-memory, Hadoop) into what I call a collaborative data architecture so that your organization can effectively run a seamless relay to conquering big data management and analytics.
This chapter will cover the following topics:
Historically, many companies have adopted one data management system to store all their data in the form of an enterprise data warehouse or departmental data mart. This single-system architecture may have met the business requirements when it is mainly structured data. The analytics and reporting applications can access the data and analyze and deliver results to management when requested. As time progressed, three things have evolved:
The analytics workloads involve extensive calculations and are both compute-intensive and read-intensive, which can place a heavy burden on the data warehouse. The process normally requires retrieving the entire data sets and not just a subset. Most analytic workloads are executed in a batch mode, and the results from these workloads are fed to various downstream applications such as business intelligence, reporting, and/or visualization.
As the analytic workloads grow bigger with more complex computations and many more variables to consider, organizations are stretching their single data warehouse environment to its capacity. Businesses have three options:
Combining these options will give you the optimal environment for existing use cases as well as new use cases to meet the business requirements for data-intensive analytical projects. However, having one data management system might not be the best fit to manage the structured data and semi-structured data.
Your current data warehouse may scale well to support complex analytics and data models and access from hundreds of users with varying workloads in a rigorously controlled, highly governed, and managed environment, but there are classes or data types that are not likely going to benefit from having that data residing in the data warehouse. You may need to recognize that the enterprise data warehouse may not always be the ideal place for certain classes of data. For example, your data warehouse might support semi-structured data, log files, or streaming sensor outputs to be stored in the form of binary large object (BLOBs), but placing them there might not provide any analytic lift for the statisticians or data scientists. Data types like BLOBs are very capable of storing virtually any random content, but if your end goal is to support parsing, text analytics, and keyword lookups, then perhaps storing them in another platform may be more suitable and preferable for optimal use.
To accommodate the various classes and types of data, the collaborative data architecture is ideal for integrating data management and analytics within a unified platform, as shown in Figure 5.1. Most likely, your data sources come from your ERP, CRM, images, audio, visual, logs, text, web, and social media. As the data sources are captured, they can be stored in any or all of the three platforms shown.
Collaborative Data Architecture provides three distinct, purpose-built data management platforms, each integrated with the others, intended for specialized needs for your data.
The staging warehouse platform is intended to be an economical platform to capture and store all your data types. This data may not be fully structured, and do not yet need to be accessed for operational reporting across the enterprise. Frequently, the platform is used to collect sets of enormous files, like web logs, machine sensor outputs, or even web log outputs, all of which have some analytic value to them, and semi-structured data. Most importantly, it just needs to be stored in a large clustered file system.
This platform is ideal for supporting highly flexible data capture and staging for all data sources, including structured and semi-structured. For example, you can leverage the Hadoop technology for pre-processing of large volumes of data and preparing the data to be stored in the data warehouse or data mart, or be used by any of the applications such as analytics, business intelligence, or marketing.
This platform is a subset of the data warehouse that is usually purposely built for a specific department, group, business team, or line of business. Because it is a subset of the data warehouse, the data mart improves end-user response time by allowing users to have access to the specific type of data they need.
The fit for purpose data mart could be intended for specific analytic processing, but organized into functional areas such as marketing, operations research, sales operations, and so on. In this case, there's a trade-off, where we are dropping some of the user concurrency requirements in exchange for high levels of analytic flexibility.
For example, customers leverage a data mart for just demand forecasting. All of the data residing in this data mart can be used for a selected group of data modelers and data scientists who build statistical data models to anticipate supply and demand for products and services. Another example is in the pharmaceutical industry where a data mart is created for a specific drug development analysis. There are many variables and data points for analysis to bring a product to market. In the pharmaceutical industry, strict standards and regulations are in place and must be followed before a drug is available to the consumers. Thus, this data mart is highly governed and only accessible by a small group of data scientists for analytics and modeling purposes.
This platform is for delivering strategic and operational analytics throughout your organization, so users from across the company can access a single source of consistent, centralized, and integrated data.
The enterprise data warehouse is targeted at rigorous, highly structured, “ready for prime time” data that can be used by business and analytic users across all departments or enterprise. The data warehouse addresses enterprise data readiness concerns, not only with the ability to scale to support more users, wider range of user applications, or larger volumes of data, but also to anticipate potential failures, keep running in the state of problems, and provide management and monitoring environment to ensure that the environment continually supports data access from around the business. It is ideally suited for modeled and integrated data to support end-to-end optimization and operationalization.
Many customers already utilize an enterprise data warehouse to gather data from multiple operational data sources that integrate together to provide a “single version of the truth.” The integrated data in the data warehouse possibly feeds hundreds of downstream applications and thousands of business users. However, much if not all of the data are generally relational; organized into tables, rows, and columns; and accessible by SQL. Depending on the data maturity that you may be working with, other platforms may be better suited, such as the staging warehouse or the fit for purpose data mart.
For larger companies, they may utilize the different aspects of the collaborative data architecture side by side, each with a special purpose perspective on the end-to-end data life cycle, evolving and maturing data from its “staged” landing area on Hadoop, right up to the “ready for action” enterprise data warehouse environment.
The ultimate goal for the collaborative data architecture is to enable organizations to access and analyze all their data to make smarter decisions and develop a new, data-driven approach to improving their business and the bottom line. The ability to apply analytics and to deliver deeper insights about your customers, operations, and services can deliver a unique competitive advantage. Let's examine in the next section how data and analytics can be managed in the different scenarios.
In Chapter 4, I illustrated the basic architecture of how structured data can be stored in the data warehouse and semi-structured data can be stored in Hadoop. Let's expand on the use and tie it to the collaborative data architecture.
In our first scenario, we use Hadoop as a staging platform for your enterprise data warehouse, as shown in Figure 5.2.
If your organization is just collecting structured data, this scenario can be ideal for you. In this setup, it is a simple case of capturing and managing structured data and integrating with Hadoop. Hadoop can be used as a data staging warehouse or platform to load data into your enterprise data warehouse. Instead of staging the data at the data warehouse level, you can apply the data integration process at the Hadoop platform. According to The Data Warehousing Institute (TDWI) “Best Practices Report” on Hadoop, 39 percent of the participants surveyed indicated that they use Hadoop for their data staging area for data warehousing and data integration.1
As mentioned in Chapter 4, Hadoop comes with Hadoop Distributed File System (HDFS) and MapReduce. You can leverage MapReduce by writing some code to bring the data into HDFS, transform it, and then load the integrated data into the enterprise data warehouse. In this situation, you can choose the data that you want to transform and load into the data warehouse, and the remaining data can be kept in Hadoop until it is ready to be transformed. This approach has several benefits:
If your company is capturing semi-structured data sources and you want to integrate them with your structured data, then Hadoop can help to manage all your data sources. In Figure 5.3, consider Hadoop to be your data lake for all your data sources as they are captured.
The infrastructure in Figure 5.3 is very similar to Figure 5.2, but in this instance, you are incorporating semi-structured data into the mix. Hadoop can act as your data lake—a temporary data storage for all your data sources. In this circumstance, you can integrate the data that is not currently available in your enterprise data warehouse. For example, some of your structured data sources coming into Hadoop may not be integrated into your enterprise data warehouse. In addition, you can now include the semi-structured data to provide additional insight about your customers, products, and services. Because Hadoop can store any data source, it can be complementary to the enterprise data warehouse. According to the TDWI “Best Practices Report,” 46 percent of the respondents indicate that they use Hadoop as a complementary extension of a data warehouse.2 In the same report, 36 percent of the participants say that they use Hadoop as a data lake. You should consider this architecture if you want to integrate structured and semi-structured data in one location.
In this situation, your enterprise data warehouse is the center of your universe and the key component for your analytics applications and business intelligence tools to access for analysis. This arrangement is geared toward organizations that want to keep the enterprise data warehouse as the de facto system of data. Hadoop is used as the data lake to capture all data sources and process and integrate them before they get loaded into the enterprise data warehouse.
Benefits for this scenario include the following:
In our next scenario, the environment gets a bit more complex to provide you the flexibility for the enterprise and departmental needs, as shown in Figure 5.4.
Let's expand the setup to include a data mart for data exploration and discovery. In this scenario, you can use Hadoop as the landing/staging platform and exploit the strengths of Hadoop and the data warehouse. The concept of the data warehouse has existed for over 30 years and this infrastructure shifts the data management paradigm is a major way. As organizations have spent years in resources and financial support to build their enterprise data warehouse, Hadoop literally pops up as this emerging technology and disrupts the way data management has been conducted. This scenario focuses on the strengths of each technology and utilizes the strengths of Hadoop and the data warehouse so that organizations can further take advantage of their data.
Figure 5.4 illustrates step-by-step how it could work for your organization:
In Figure 5.4, benefits include the following:
Keep in mind that these three scenarios are just a subset of all the combinations possible when integrating analytics and data management with Hadoop, enterprise data warehouse, and data mart. In each of the setups, it is enabling companies to become more data and analytic focused:
Now that the architecture is explained and laid out, let's examine how in-database, in-memory, and Hadoop can coexist and be complementary.
Vendors offer in-database, in-memory, and Hadoop technologies. The sophistication and level of integration varies, depending on the vendors. For example, SAS and Teradata offer the end-to-end suite of technologies to enable in-database, in-memory, and Hadoop technologies for analytics and data management needs.
Let's start with in-database processing. In-database processing is offered at all stages of the analytical data life cycle. More technologies have been developed in this area because it is the most mature and offers the most bang for the buck. In reference to the collaborative data architecture, in-database processing can be enabled in all of the three platforms: staging, warehouse, and data mart.
As the name suggests, in-database technology is most used with a data warehouse environment. In Figure 5.5, vendors offer in-database software to help streamline the analytical process.
If your organization relies heavily on the data warehouse as the data universe, then applying the in-database processing where the data are makes logical and economic sense.
Some of the analytical vendors have partnered with the database vendors to integrate the analytical capabilities inside the database. These analytical vendors have tailored their software to run directly on the database system so you can process the data faster. For example, you can explore the data, prepare the data, develop the data model, and deploy the data model directly at the database level. Rather than having to extract a massive data set from your enterprise data warehouse and move it to an analytical server for analysis, now you can run and optimize these capabilities directly within the database. Your data are secured and the results are less error prone because you do not have to pull, copy, and analyze outside the database. Because your data warehouse has the most current data available, you can trust the results of your analysis.
You may recall that Figure 5.5 is the analytical data life cycle from Chapter 1. In-database processing can be applied to all stages of the life cycle. The processes include the following:
If your company has or is considering Hadoop as part of the mix, you can prepare the data in Hadoop. In this case, your data warehouse workload can be offloaded and shares the processing power with Hadoop.
In Figure 5.6, Hadoop can capture all of your data, and the data can be explored directly. There are a number of vendors in the market that offer tools to explore data, both structured and semi-structured. Once you decide what data you want to load into the data warehouse, you can then prepare the data and develop and deploy your data model—all inside the database.
To further spread the workload away from the warehouse, you can add in-memory processing on a dedicated appliance.
In Figure 5.7, Hadoop is not part of the architecture. Although Hadoop's popularity is attractive, many customers are concerned about including it in the production environment. In this scenario, you can:
In-memory analytics is very powerful, as described in Chapter 3. If your organization has large data volumes to be considered in a complex data model, in-memory analytics can quickly display all of your data sources within a user interface. As illustrated in Figure 5.7, data are lifted into memory for visualization and analysis. The same is true as you develop complex data models. The appliance that is connected to the data warehouse is dedicated to processing it in-memory to achieve super-fast performance. The combination of in-database and in-memory can provide end-to-end data management and analytics within an integrated environment (such as the collaborative data architecture).
Not all functions are available using in-database, in-memory, or Hadoop, exclusively, so you may need to combine these technologies along with your current infrastructure to make them work and complement each other for your data management and analytic needs.
The scenario in Figure 5.8 includes all of the technologies to help you:
Scenarios shown in Figures 5.5 to 5.8 have been implemented by customers in various industries. Let's examine some of these scenarios with use cases and customer stories.
I want to share with you three use cases and customer success stories that adopt one of the three configurations using in-database, in-memory, and/or Hadoop. As you can see, each customer success story comes from a different industry with a common thread, which is optimizing analytics and data management in an integrated platform.
The first success story is a large retailer based in the United States. This customer is a pioneer of the one-stop shopping concept. This retailer has hundreds of stores across the United States with thousands of employees committed to excellent customer service, low prices, quality products, and broad selection of brands. Many of its stores are now open 24 hours a day and most stores are open 364 days a year. It offers brick-and-mortar and online services to enhance the customer experience of shopping.
In recent years, the retail industry has evolved and gone through some of the most dramatic changes. This brick-and-mortar business is competing with many channels, especially with other large e-commerce businesses. Today, consumers have so many choices to buy products and services across dozens of retail channels such as drugstores, dollar stores, limited assortment chains, and local mom-and-pop boutique shops.
This one-stop shop sees some top trends that are changing the ways it does business:
A study conducted by A.T. Kearney found that nearly 40 percent of retailers with loyalty programs fail to use all of the data to fully understand their customers.5 Even when they do, some retailers who gained that insight failed to ever translate it to action.
With changing times in an evolving market, this customer created an initiative to evaluate data mining tools and advanced analytics to meet the business needs. Specifically, the project focuses on the ability to analyze customers and household purchasing data at the granular level. The project has five goals:
The competitive nature of the business and the evolution of the retail industry have directed this customer to expand its reliance and use of analytics for insights that provide a competitive edge.
As the company yearns to adopt advanced analytics with the focus of predictive analytics, it needed a new, modern architecture and innovative technologies to support it. The process and journey led to the adoption of in-database and in-memory processing. It has been several years of evaluating and exploring different technologies and vendors.
In the traditional environment, all of the business analysts, data scientists, and statisticians have to do all of their analytics at their desk. There are a number of challenges with this process:
Figure 5.9 shows the process. As illustrated, there are many steps that consume both time and resources. The customer can only transfer a sample of the data set from the data warehouse to the desktop to explore, develop models, and score. By analyzing a sample of the data, it does give you the big, holistic picture of the customer data and profile. In addition, data-driven decisions are executed partially due to the analysis from a small segment. Having a partial view of their customer data is not effective when it comes to executing marketing and sales campaigns of their customer loyalty program. With this current architecture, the team can only develop less than five data models and score them per week. This customer wants to expand its analytical capabilities to develop and score tens of models and eventually increase to hundreds of models.
By working closely with the customer and understanding their goals, the direction was to explore and adopt a combination of in-database and in-memory technologies, as shown in Figure 5.10.
The shift for this organization is going from just doing analytics to more descriptive and predictive analytics. This retailer wants to leverage descriptive analytics to answer questions such as:
Descriptive analytics assess the past performance and comprehend that performance by mining the historical data to look for the reasons behind past success or failure of a decision. In this case, this one-stop shop wants to use descriptive analytics for business and management reporting—such as sales, marketing, operations, and finance.
Predictive analytics takes a leap beyond descriptive analytics. As the name suggests, it is an attempt to forecast an outcome based on the data analysis. It is using a combination of historical performances from descriptive analytics with rules, algorithm, and various external data sources to determine the probable outcome of an event or the likelihood of a situation going to occur. Like descriptive analytics, organizations apply predictive analytics to answer complex questions such as:
Predictive analytics is more sophisticated and requires involvement from statisticians and/or data scientists who can develop algorithms or data models to answer such difficult questions. As previously mentioned, this customer has dabbled in mode development but with limited scope and data to truly call it predictive analytics.
With that objective in mind, the customer wanted an analytical tool that is capable of delivering descriptive and predictive analytics and well integrated with the data warehouse. In Figure 5.10, it got just that. In this success story, the details will be around a specific focus, which is their loyalty program.
Most retailers, especially grocers, offer loyalty programs. Any loyalty program generates more data and, it hopes, more insights from consumers. The loyalty programs provide a win-win and reward both for the retailer (data) and consumers (in-store discounts). However, it is all about translating this data to understand “individual” purchase behaviors and behaviors outside of the purchase to gain the true meaning of the word loyalty.
Let's examine the process in detail to see how this retailer uses advanced analytics to improve its loyalty program by providing personalized offers and rewards to their customers:
When preparing the data for analytics, the analysts do the following:
The end result of this data preparation process provides the statistician and data scientist an analytical data set (ADS). Refer to Chapter 2 for more details about ADS. The streamlined process with the new architecture from Figure 5.10 shows the technologies adopted and what tasks are performed with the complemented and integrated environment.
In this scenario, data exploration allows the business analysts to keep what data they want to include in the data model development process.
The data development is executed on a dedicated analytics appliance. This appliance's primary purpose is to develop complex data models with large volumes of data and many variables. With a dedicated analytics appliance, this retailer is able to develop many more data models compared to the traditional architecture. Currently, it is producing hundreds of models for its loyalty program. All of the power inside this appliance is dedicated to processing the analytics. The advantage of this setup is that it alleviates some of the burden from the enterprise data warehouse because the model development process is on a separate hardware. The team of statisticians and data scientists is able to test their hypothesis and instantly know whether the data models are working or failing. It is a way to easily test a combination of variables and data.
Compared to how the analytical data models were developed, it was taking days and weeks to complete due to the cumbersome process. Now that it is greatly streamlined, model development only takes minutes and hours to complete, depending on the complexity of the model. In addition, this architecture enables a highly governed environment because all of the data used to build the model come from the data warehouse.
The ability to execute many of the functions in the database and have a dedicated analytic appliance has truly streamlined the end-to-end process for this retailer.
This retailer has made significant human investment. The plan is to hire more data scientists and statisticians to expand the support for enterprise analytics. With the expanded staff, it intends to develop many more data models, beyond hundreds of models to ultimately deliver superior customer experience through its loyalty program.
The next success story uses Hadoop as an extension to the data warehouse and uses in-memory analytics for exploring data to detect fraud for the government.
Information systems for governance and public administration have been a high-priority concern in almost all countries. Governance and public administration cannot be productive, effective, and efficient without the support of modern data management and analytic technology. The use of analytics to analyze data and distribute information has been an absolute requirement for public administration and management development. Governments tend to partner with local agencies to provide advisory services and technical assistance in this important area. In practice, however, effective use of information technology and successful development of analytical systems in public sector are not simple. This is substantiated by the numerous government digitization and information systems projects that have failed across the world.
The tremendous speed with which information technology is changing causes misperception, delays, and false expectations for these solutions by many decision makers and government officers. Like the private sector, the public sector has similar issues:
All of the above often cause information projects to fail. At the heart of the failure is the combination of the technical and management side of the government. Understanding what is the purpose of a government information system and how to develop it successfully is fundamental to government decision makers and managers responsible for their critical areas. The critical issues include: appreciation and understanding of the advanced analytics and data management, awareness of the trends of modern information technologies and their impacts on development strategies, and cognizance of government information infrastructure and information systems development. Similar to the private sector, the public sector is recognizing the value and role of promoting information technology to be used in public administration. In particular, analyzing data to detect fraud is a major focus in the public sector, particular for government agencies to reduce costs and waste of taxpayer's money.
Government programs and operations are vulnerable to a wide variety of criminal activities. Fraud has been a major issue if left unchecked and undetected and can seriously undermine the public's perception and confidence in their government. Taxpayers rightly mandate that public officials wisely use and preserve the money and power entrusted in their government. Social benefit programs such as health care, government loans, and grant programs are all exposed to false claims fraud. Individual acts of fraud and organized ones all add up to huge losses for the government.
Unfortunately, government officials and investigators are largely in reactive mode, chasing the money and the fraudsters after the act has been executed. The need for governments to do a much better job in protecting the integrity of these social programs is essential by proactively stopping the frauds before the money exits the account and before they become widespread and organized. Too often, too many programs are constrained by antiquated or archaic information technology systems. Modernization of these systems should include sophisticated data management and analytics that can enable better, faster detection, and fraud prevention controls to regain the public's faith and trust.
This customer is an international information and communication technology company for the ministry of economy and finance located in Europe. With more than 30 years of service, it is committed to ensuring the modernization of the public sector by simplifying administrative procedures and increasing integration between public administrations. This IT company promotes the development and innovation through technology to ensure unity and effectiveness of strategic data-driven actions for the government. With thousands of employees, it has designed and built information systems for taxation, self-service analytics, and applications for government departments. This company has cutting-edge technology infrastructure and network to support nearly 100,000 workstations and direct connection with external entities, citizens, businesses, and professionals. Through their efforts, it has created advanced decision-making tools supporting economic and financial policy from a centralized but complex database system. This innovative company cooperates with its customers to ensure high standards of quality and safety and invests heavily in technology, training, and research and development to prepare for the digital age of big data. Its focus in data management and analytics delivers concrete opportunities for growth and rationalization of spending to enhance efficiency of the public information systems.
The customer collaborates with the local authorities for assistance in developing a system to better identify where, when, and how fraud occurs. Working side-by-side, the agency identifies the issues and challenges with the current infrastructure:
Let's examine how this international IT company is helping the local government to improve its process with cutting-edge technology and innovation.
The goal of the IT company is to promote and consolidate a culture of transparency and integrity for the public administration. By understanding their needs, it is architecting a system to conduct activities to improve performance, economics, and governance by identifying the areas that are most at risk and providing support for acquiring a growing awareness around fraud. This company is taking strong measures to prevent and repress criminal acts in the public administration around fraudulent activities. The development of advanced technological solutions will help to achieve its goal (see Figure 5.11).
Data sources for this government agency are structured, more on the traditional side with rows and columns. The data warehouse is the definitive place for all of the data. The enterprise data warehouse has been developed with all kinds of data for decades. Data in the warehouse include census information, household, income, tax collection, and health status, for example. Instead of placing all of the analytical functions and processes on the data warehouse, it has adopted Hadoop to help alleviate some of the workload and performance augmentation. During peak hours of operations, hundreds of users are accessing the data warehouse for various queries and analysis. Hadoop in this scenario is used to extend the value and purpose of the data warehouse for storing data used for fraud-specific analysis. The data from the data warehouse are copied into Hadoop to prepare for in-memory analytics data exploration. This is where predictive analytics come into play. The data are lifted into memory to do the following:
Selecting the right technology and consulting services to integrate data management and analytics with the enterprise data warehouse, Hadoop and in-memory data exploration have allowed the public administration process to be streamlined and effective. The agency is now able to capture anomalies for fraud detection. It is able to do a cost comparison between generic drug and standard product for best and worst behavioral analysis to improve the health care program. It has developed an external web application for all types of users within the government agency to gather more information and define an outlier or anomaly to identify housing tenancy fraud, procurement fraud, payroll fraud, and pension fraud. More applications are being built as they adopt a vision for deeper analytical approach.
When the government was considering options in consulting services and technology, a proof of concept was conducted using the resources from the agency. It also leveraged the agency's network and expertise to advise on a standard practice and technology for data management and analytics. The need to centralize and standardize the process and analytics help the hundreds of users to look at the same data and decipher the same results across the government departments. The proof of concept consisted of hardware, software, and services.
Once the government selected the IT company to develop the architecture with bundled software, hardware, and services, it quickly learned that it needed to build the process from the ground up. Many of the users across the government agency were not used to exploring and visualizing the data that they had. They simply took what was there and applied basic analytics to it. Of course, it was often not integrated and/or had missing values in the data to make the analysis effective. In recent years, many customers have leveraged the data exploration process to summarize the characteristics of the data and extract knowledge from the data. They often use newer technologies such as data visualization tools with easy-to-use interfaces at their fingertips to quickly reveal trends and correlations. Visualization tools not only offer users the flexibility of the click-drag-drop user interface, they also have analytical functions for deep analysis of the data.
Building a strong foundation is key. Once you know what data you have to work with, you can use it to build a strong footing for any applications, such as fraud. Basic fraud-detection rules were needed to build an effective system, and the old infrastructure and process did not have it. A comprehensive fraud solution is complex, and many business rules need to be considered to detect the anomalies. Focus on a specific fraud case such as benefit overpayment so that you can show success in one area before expanding.
Let's examine another success story with all three components: in-database, in-memory, and Hadoop centered on an enterprise data warehouse.
The next success story is based in the United States and serves more than 10,000 customers by transporting goods and commodities that enhance our daily lives. Being one of the largest freight railroad companies, it employs over 40,000 people and operates across many regions and provinces.
With thousands of miles of track and locomotives, the company has more than 1,500 trains running per day and over 10 million carloads shipped annually. This company has developed one of the most advanced rail networks in the world by optimizing existing rail capacity and making record investments in new infrastructure and equipment directly connected to its operations. The reinvestment into the company sets high standards to improve safety and reliability.
This railroad company is paving its path to the future with analytics. It needs to make sense of all the data that they are collecting. It is applying predictive analytics to enable data-driven decisions to reduce delays and proactively identify issues.
The railroad is a critical element that fuels the economy. No technology, grain for food, or automobile is worth a thing if it cannot move from the manufacturing floor to the consumer's hand. According to the most recent statistic from Commodity Flow Survey, railroads carry more than 40 percent of freight volume in the United States—more than any other mode of transportation—and provide the most fuel- and cost-efficient means for moving freight over land. To support this finding, this customer is able to move a ton of freight almost 500 miles on a single gallon of fuel. This is possible thanks to the technological advancements and innovative sensors in locomotives.
Today, the U.S. freight rail network is widely considered one of the most dynamic transportation systems in the world. The multibillion-dollar industry consists of 140,000 rail miles operated by national, regional, and local railroads. Not only does the railroad system move more freight and goods nationwide, it also offers numerous public benefits including reductions in road congestion, highway fatalities, fuel consumption and greenhouse gasses, logistics costs, and public infrastructure maintenance costs.
The U.S. freight railroads are private organizations that are responsible for their own maintenance and improvement projects. Compared with other major industries, they invest one of the highest percentages of revenues to maintain and add capacity to their system with technological advances. In general, bulk freight such as grain and coal ships in rail cars, while consumer goods such as products found at a neighborhood grocery or department store ship in containers or trailers called intermodal traffic. Intermodal traffic refers to the transport of goods on trains before and/or after transfers from other modes of transportation, such as planes, vessels, or trucks.
Almost anything can be shipped by rail. Bulk commodities products such as agriculture and energy products, automobiles and it components, construction materials, chemicals, coal, equipment, food, metals, minerals, and pulp and paper make up over 90 percent of the rail freight business. Intermodal traffic is less than 10 percent, which consists of consumer goods and other miscellaneous products. Compared to other modes of transportation, the rail option offers many benefits. It is efficient at moving heavy freight over long distances. While the trucking industry excels in providing time-sensitive delivery services for high-value goods being transported over medium- and short-haul distances, the rail industry delivers goods and commodities across the country. Raw materials and heavy freight going long distances are likely to continue their journey by rail, or some combination of truck, rail, and water. With the future growth in freight, it is anticipated that freight rail will continue to make investments in technology to increase capacity for long-distance shipments.
Similar to other industries, the rail sector is going through some significant changes and challenges to meet customers' demands. This company wants to offer multiple transportation solutions in a constant effort to be competitive. One area where it sees growth is intermodal transportation, which allows for containers consisting of all types of goods to be placed on rail and redeployed across the country to facilities where they are transferred to other forms of transportations (such as truck, ship, airplanes, etc.) to their final destination. The efficiencies gained from intermodal transportation are even greater with the emergence of double-stacked containers on railcars. With improvements in service and facilities, rail intermodal will become more competitive and absorb the projected increases in freight movement caused by population growth and the growth of the intermodal movement of goods into the future.
To compete in the global marketplace and to enhance quality of life, this customer has made significant investments in technology specifically in analytics.
Preparing for growth and the future, this rail company is using advanced analytics to help drive data-driven decisions made by management and the leadership team. With a network of thousands of miles of rail tracks and tens of thousands of locomotive movements to transport millions of containers and products, there is an enormous amount of data to analyze and make sense of data. Advanced analytics can transform data into insights that the business leaders can use to determine how they can increase efficiency and expand the business.
Advanced analytics have become an essential and integral part of the business in the decision-making process. This company is using predictive analytics to uncover the most efficient and safest routes for their trains, adequately staff train crews, proactively order parts and equipment, inspect the trains for preventive maintenance, and so much more. Because the railroad operations can be very complex with so many moving parts, this freight company uses all its data and applies the science to help solve problems. It is also leveraging data to anticipate the changing needs of the business and adhere to government regulations. Another use of advanced analytics is analyzing the shipments of merchandise trains. These trains carry containers of mixed commodities such as lumber, paper, machinery, industrial parts, and various types of chemicals used in manufacturing. By using advanced analytics, the company is able to facilitate efficient movements of these shipments by grouping them based on their final destinations to minimize costs and sorting. In addition, it is able to improve the flow of the products, reduce the amount of time it takes to its final destination, and enhance their customer experience.
Analytics is also used in crew assignment and planning. Like any industry, labor costs are the biggest expense and it continues to grow. For many years, crew planning has been a manual task that takes a lot of time, energy, and resources. Planning and assigning a train crew is a complicated process that involves the consideration of many factors in a short amount of time. The train crew planner has to assign crews located across a large geography in the most efficient and cost-effective manner possible so trains are not delayed or canceled by any means.
This task becomes even more complicated with all the rules and regulations imposed by the government to observe and maintain safe rail operations. An analytical algorithm was developed to assign crew planning in real time in an effort to reduce the overall cost of positioning crews to operate trains within a territory. Once the appropriate resource is assigned, it is also using predictive analytics to accurately predict the arrival times of its trains. This information is shared with customers to track their shipments of products in real time. Customers can now know when the train has left the station, when it is in route, and when it will arrive to the final destination.
Timing is everything. In an industry of transporting goods, their customers are heavily relying on the railroad to get their shipments on time so that they can provide these products to us, the consumers.
Even though this company is already using advanced and predictive analytics for many of its operations, it is continuously exploring innovative technologies and approaches to analyzing new data sources. Managing big data is nothing new for this company. It has been analyzing large amounts of data for years and relies on advanced and predictive analytics to strengthen data-driven decisions. For example, this company is gathering months of weather data and correlating it to track conditions and/or to track volume to forecast expected conditions. It also leverages video data sources to assess any number of conditions, such as missing or misplaced signs or vegetation growth to alert its field teams where to focus its efforts. To meet regulations or pursue its own continuous improvement objectives, many railroads are combining technology, processes, and shipper training—along with expertise from field-based damage prevention managers—to reduce cargo damage or loss. Rail safety begins with the quality of the underlying rail infrastructure. Maintaining and investing in tracks and equipment is key to preventing accidents. As they move more and more to real-time collection and analysis of critical data, big data technologies allow us to be even more responsive to our business and, in turn, our customers.
In today's data-diverse environment, everything in the rail industry can create an electronic trail of data about its operations and logistics. The analytics team utilizes the mounds of data to generate better inputs for its complex data models and improve its daily operations. New data sources come from sensors that monitor and assist in determining when would be the best time to schedule railcars and locomotives for inspections and maintenance. In addition, it uses data from GPS tools to help track the exact location of their assets. These data sources are detail-rich in nature and can help the company to create effective solutions and positively impact the profitability and its bottom line. Figure 5.12 illustrates the new architecture to support additional data sources.
In this scenario, this customer has invested in Hadoop, in-database, and in-memory technologies to modernized its architecture for the twenty-first century.
The rail and freight industry uses detectors and sensors to help to improve safety and service, and it is an area of technology with significant investment with high expectation on return of investment. These detectors and sensors generate a lot of data for this company. Detectors are installed under, alongside, above, and attached to the track to find defects in passing equipment, which adds another layer of safety to prevent unplanned incidents. The primary purpose of these detectors is to see, hear, or feel locomotives and railcars as they pass and determine if there are defects. Detector technologies also look for missing, damaged, or defective components from wheels to bearings.
These detectors are generating large data sets daily. There are five types of detection technology, and this company has more than 2,000 detectors collect data 24 hours a day and seven days a week across the system. The data collected from these detectors are analyzed along with other data sources, and the analytics team takes these inputs to develop a data model for predicting patterns and the likelihood of an issue happening to one of its trains. The analysis will help to tweak procedures and business rules criteria for operational improvements. In conjunction with detectors, sensors are the cutting-edge technology that produces many important data points to better manage and identify potential track and equipment defects. By integrating with existing and other types of data, the analysis can provide crucial and critical information about the rail network. By leveraging predictive analytics patterns can be revealed and derailments and outages can be prevented.
Collecting and analyzing detector data is another step toward proactive rather than reactive maintenance. Of course, being proactive can tremendously decrease the unplanned set outs of equipment that can impact velocity and improve safety of responders without increasing risk of derailment. The data from these detectors can augment manual inspections. The inspectors now have additional data points on what to look for and where to look. Ultimately, these detectors focus on derailment prevention and improved safety for the employees and our communities.
The data from the detectors and sensor are helping this company to manage its network to be more on the proactive stance instead of on the reactive position. In addition, it collects a vast array of fixed and mobile sensors to get a real-time view of its assets' condition and usage. Condition-based maintenance and engineering inspection systems will drive reliability in track and equipment, further optimizing maintenance practices. With advanced analytics by using in-memory technology, they will have the ability to model and predict the network impact of day-to-day decisions such as holding or rerouting a train.
The setup in Figure 5.12 serves many purposes for this organization and its customers. For the organization, it will be able to get the information from wherever with any device that includes dashboard and mobile devices. Manual inputs have been replaced with automation. Carrying paper documents and using outdated information is a thing of the past. Documents are automated, smart, and accessible to data and information from any time, anywhere, and any place. Instead of entering lots of data, information is distributed to line of business, managers, and executives. Systems will provide insight to what tomorrow will look like by integrating many different sources for more precise and accurate day-to-day decision making. Touch-, speech-, and gesture-driven applications are in use to better manage the operations. The result of all of this effort behind predictive analytics is that our network should hit new records for safety, velocity, and on-time performance. Working with its customers and partners, it continues to work not only hard but also smart to provide a safe, reliable freight delivery service for our customers.
In addition to organizational benefits, the customers of this freight company are also reaping advantages from predictive analytics. For the customers, data from GPS systems can be analyzed to identify real-time shipment status and more accurate estimated time of arrival. The end goal is to capture customers' needs and proactively know any potential concerns and issues as well as opportunities to enhance customer experience.
The blend of in-database, in-memory, and Hadoop requires the most extensive investment. As expected, this combination delivers the highest competitive advantage because of its comprehensive ability to analyze all data sources and deliver greater performance at each stage of the analytical data life cycle.
In most cases, the data warehouse is the cornerstone. It has been maintained over the years to house many data sources for analytics. The investment into the data warehouse continues to be a focus since it provides data for a number of downstream applications for these organizations. There are no intentions to abandon the data warehouse because it plays a critical part of the information system. Integrating in-database processing would be an obvious investment to improve performance by applying analytics to the data directly. In-database processing may not be possible for all your needs, but it can definitely augment the data preparation and analytics.
As mentioned earlier, 46 percent of the users of Hadoop complement it with a data warehouse, and it is often an extension to capture various semi-structured data sources. However, the two customer successes in this chapter use Hadoop for data staging and data landing. All of these Hadoop options help to offset the performance of the data warehouse. Although Hadoop is open source but also an ecosystem, there is some investment for software and hardware. In addition, if you are integrating Hadoop to other systems, it will require additional resources that are familiar with Hadoop to develop and maintain the programming. Since it is open source, there is no user interface to Hadoop. However, there are many software vendors in the marketplace that integrate with Hadoop and provide a user interface to Hadoop.
The investment for in-memory analytics revolves around data exploration and model development capabilities. There are some vertical solutions, mainly in the financial and retail industries. Customers tend to leverage in-memory analytics for high-volume model development and data exploration. When investing in in-memory analytics, it should be integrated with the enterprise data warehouse and/or Hadoop systems so that data can simply be lifted into memory for analysis. If it is not integrated, then you will need additional investment in resources to develop the integration protocol. Not only could this integration development be expensive, it could be time consuming to implement, maintain, and train the users.
Finally, you should carefully evaluate vendors that can provide end-to-end capabilities so that you can simply leverage their software, hardware, and consulting services in a bundle package. It will streamline the number of contracts, vendors, and support that you will have to deal with and can speed up the implementation of the systems if you are on a tight deadline. Vendors that offer this type of bundle tend to be leaders in their respective markets—analytics and data management. Integration between analytics and data management provides optimal performance, economics, and governance.
The next chapter will cover the future of analytics and data management. It provides the top five trends that customers are considering for their organizations based on some of the market trends and the ever-changing dynamics of technology.