The focus of design work, such as data architecture and data modeling, is to provide insight on how best to set up applications that make usable, accessible, and current data available to the organization. Once data is set up in warehouses, marts, and applications, significant operational work is required to maintain data so that it continues to meet organizational requirements. This chapter will describe the data management functions that focus on enabling and maintaining data, including:
Data storage and operations
The data storage and operations function is what many people think about when they think about traditional data management. This is the highly technical work carried out by database administrators (DBAs) and network storage administrators (NSAs) to ensure that data storage systems are accessible and performant and that data integrity is maintained. The work of data storage and operations is essential to organizations that rely on data to transact their business.
Database administration is sometimes seen as a monolithic function, but DBAs play different roles. They may support production environments, development work, or specific applications and procedures. DBA work is influenced by the overall database architecture of an organization (e.g., centralized, distributed, federated; tightly or loosely coupled), as well as by how databases themselves are organized (hierarchically, relationally, or non-relationally). With the emergence of new technologies, DBAs and NSAs are responsible for creating and managing virtual environments (cloud computing). Because data storage environments are quite complex, DBAs look for ways to reduce or at least manage complexity through automation, reusability, and the application of standards and best practices.
While DBAs can seem far removed from the data governance function, their knowledge of the technical environment is essential to implement data governance directives related to such things as access control, data privacy, and data security. Experienced DBAs are also instrumental in enabling organizations to adopt and leverage new technologies.
Data storage and operations is about managing data across its lifecycle, from obtaining it to purging it. DBAs contribute to this process by:
In short, DBAs make sure the engines are running. They are also first on the scene when databases become unavailable.
Data integration and interoperability
While data storage and operations activities focus on the environments for storing and maintaining data, data integration and interoperability (DII) activities include processes for moving and consolidating data within and between data stores and applications. Integration consolidates data into consistent forms, either physical or virtual. Data Interoperability is the ability for multiple systems to communicate. Data to be integrated usually originates from different systems within an organization. More and more, organizations also integrate external data with data they produce.
DII solutions enable basic data management functions on which most organizations depend:
The implementation of Data Integration & Interoperability practices and solutions aims to:
The design of DII solutions needs to account for:
The main driver for DII is to ensure that data moves efficiently to and from different data stores, both within the organization and between organizations. It is very important to design with an eye toward reducing complexity. Most enterprises have hundreds, sometimes thousands, of databases. If DII is not managed efficiently, just managing interfaces can overwhelm an IT organization.
Because of its complexity, DII is dependent on other areas of data management, including:
Data Integration & Interoperability is critical to Data Warehousing & Business Intelligence, as well as Reference Data and Master Data Management, because all of these transform and integrate data from multiple source systems to consolidated data hubs and from hubs to the target systems where it can be delivered to data consumers, both system and human.
Data Integration & Interoperability is central, as well, to the emerging area of Big Data management. Big Data seeks to integrate various types of data, including data structured and stored in databases, unstructured text data in documents or files, other types of unstructured data such as audio, video, and streaming data. This integrated data can be mined, used to develop predictive models, and deployed in operational intelligence activities.
When implementing DII, an organization should follow these principles:
Data warehousing
Data warehouses allow organizations to integrate data from disparate systems into a common data model in order to support operational functions, compliance requirements, and Business Intelligence (BI) activities. Warehouse technology emerged in the 1980s and organizations began building warehouses in earnest in the 1990s. Warehouses promised to enable organizations to use their data more effectively by reducing data redundancy and bringing about more consistency.
The term data warehouse implies all the data is in one place, as in a physical warehouse. But data warehouses are more complicated than that. They consist of multiple parts through which data moves. During its movement, the structure and format of data may be changed, so that it can be brought together in common tables, from which it can be accessed. It may be used directly for reporting or as input for downstream applications.
Building a warehouse requires skills from across the spectrum of data management, from the highly technical skills required for data storage, operations, and integration, to the decision-making skills of data governance and data strategy leads. It also means managing the foundational processes that enable data to be secure, usable (via reliable Metadata), and of high-quality.
There are different ways to build a warehouse. The approach an organization takes will depend on its goals, strategy, and architecture. Whatever the approach, warehouses share common features:
The best known approaches to data warehousing have been driven by two influential thought leaders, Bill Inmon and Ralph Kimball.
Inmon defines a data warehouse as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process”29. A normalized relational model is used to store and manage data. Figure 19 illustrates Inmon’s approach, which is referred to as the “Corporate Information Factory.”
Kimball defines a warehouse as “a copy of transaction data specifically structured for query and analysis.” Figure 20 illustrates Kimball’s approach, which calls for a dimensional model.
Figure 19: Inmon’s Corporate Information Factory (DMBOK2, p. 388)30
As we approach the third decade of the new millennium, many organizations are building second- and third-generation warehouses or adopting data lakes to make data available. Data lakes make more data available at a higher velocity, creating the opportunity to move from retrospective analysis of business trends to predictive analytics.
Figure 20: Kimball’s Data Warehouse Chess Pieces (DMBOK2, p. 390)31
Managing bigger data takes additional knowledge and planning. But it also requires following some principles fundamental to managing warehouse data, including:
Reference data management
Different types of data play different roles within an organization and have different data management requirements. Reference Data (for example, code and description tables) is data that is used solely to characterize other data in an organization, or solely to relate data in a database to information beyond the boundaries of the organization.32
Reference Data provides context critical to the creation and use of transactional and Master Data. It enables other data to be meaningfully understood. Importantly, it is a shared resource that should be managed at the enterprise level. Having multiple instances of the same Reference Data is inefficient and inevitably leads to inconsistency between them. Inconsistency leads to ambiguity, and ambiguity introduces risk to an organization.
Reference Data Management (RDM) entails control over defined domain values and their definitions. The goal of RDM is to ensure the organization has access to a complete set of accurate and current values for each concept represented.
Because it is a shared resource and crosses internal organizational boundaries, ownership and responsibility for Reference Data are challenging for some organizations. Some reference data originates outside of the organization, other Reference Data may be created and maintained within a department but have potential value elsewhere in an organization. Determining responsibility for obtaining data and applying updates is part of managing it. Lack of accountability introduces risk, as differences in Reference Data may cause misunderstanding of data context (for example, when two business units have different values to classify the same concept).
Reference data often seems simpler than other data because reference data sets are generally smaller than other kinds of data. They have fewer columns and fewer rows. Even a large reference data set, like the USPS ZIP code file, is small relative to the daily financial transactions of even a medium-sized retailer. Reference data is also generally less volatile than other forms of data. With a few notable exceptions (like currency exchange rate data), reference data changes infrequently.
The challenge with managing reference data comes with its usage. For Reference Data Management to be effective (values up-to-date and consistent across multiple applications and uses), it needs to be managed through technology that enables human and system data consumers to access it in a timely and efficient way across multiple platforms.
As with managing other forms of data, managing reference data requires planning and design. Architecture and reference data models must account for how reference data will be stored, maintained, and shared. Because it is a shared resource, it requires a high degree of stewardship. To get the most value from a centrally managed reference data system, an organization should establish governance policies that require use of that system and prevent people from maintaining their own copies of reference data sets. This may require a level of organizational change management activity, as it can be challenging to get people to give up their spreadsheets for the good of the enterprise.
Master Data Management
Like Reference Data, Master Data is a shared resource. Master Data is data about the business entities (e.g., employees, customers, products, vendors, financial structures, assets, and locations) that provide context for business transactions and analysis. An entity is a real-world object (like a person, organization, place, or thing). Entities are represented by entity instances, in the form of data / records. Master Data should represent the authoritative, most accurate data available about key business entities. When well-managed, Master Data values are trusted and can be used with confidence.
Master Data Management (MDM) entails control over Master Data values and identifiers that enable consistent use, across systems, of the most accurate and timely data about essential business entities. The goals include ensuring availability of accurate, current values while reducing the risk of ambiguous identifiers.
Put more simply: when people think of high-quality data, they usually think of well-managed Master Data. For example, a record of a customer that is complete, accurate, current, and usable is considered “well-managed.” From this record, they should be able to bring together an historical understanding of that customer. If they have enough information, they may be able to predict or influence the actions of that customer.
Master Data Management is challenging. It illustrates a fundamental challenge with data: people choose different ways to represent similar concepts and reconciliation between these representations is not always straightforward. As importantly, information changes over time and systematically accounting for these changes takes planning, data knowledge, and technical skills. In short, it takes work, including data stewardship and governance work, to manage Master Data.
Any organization that has recognized the need for MDM probably already has a complex system landscape, with multiple ways of capturing and storing references to real world entities. As a result of organic growth over time and from mergers and acquisitions, the systems that provided input to the MDM process may have different definitions of the entities themselves and very likely have different standards for data quality. Because of this complexity, it is best to approach Master Data Management one data domain at a time. Start small, with a handful of attributes, and build out over time.
Planning for Master Data Management includes several basic steps. Within a domain:
Executing the process, though, is not as simple as these steps make it sound. MDM is a lifecycle management process. In addition, Master Data must not only be managed within an MDM system, it must also be made available for use by other systems and processes. This requires technology that enables sharing and feedback. It must also be backed up by policies that require systems and business processes to use the Master Data values and prevent them from creating their own “versions of the truth.”
Still, Master Data Management has many benefits. Well-managed Master Data improves organizational efficiency and reduces the risks associated with differences in data structure across systems and processes. It also creates opportunity for enrichment of some categories of data. For example, customer and client data can be augmented with information from external sources, such as vendors that sell marketing or demographic data.
Document and content management
Documents, records, and content (for example, the information stored on internet and intranet sites) comprise a form of data with distinct management requirements. Document and Content Management entails controlling the capture, storage, access, and use of data and information stored outside relational databases.33 Like other types of data, documents and unstructured content are expected to be secure and of high quality. Ensuring their security and quality requires governance, reliable architecture, and well-managed Metadata.
Document and content management focuses on maintaining the integrity of and enabling access to documents and other unstructured or semi-structured information; this makes it roughly equivalent to data operations management for relational databases. However, it also has strategic drivers. The primary business drivers for document and content management include regulatory compliance, the ability to respond to litigation and e-discovery requests, and business continuity requirements.
Document Management is the general term used to describe storage, inventory, and control of electronic and paper documents. It encompasses the techniques and technologies for controlling and organizing documents throughout their lifecycle.
Records Management is a specialized form of document management that focuses on records – documents that provide evidence of an organization’s activities. These activities can be events, transactions, contracts, correspondence, policies, decisions, procedures, operations, personnel files, and financial statements. Records can be physical documents, electronic files and messages, or database contents.
Documents and other digital assets, such as videos, photographs, etc., contain content. Content Management refers to the processes, techniques, and technologies for organizing, categorizing, and structuring information resources so that they can be stored, published, and reused in multiple ways. Content may be volatile or static. It may be managed formally (strictly stored, managed, audited, retained or disposed of) or informally through ad hoc updates. Content management is particularly important in web sites and portals, but the techniques of indexing based on keywords and organizing based on taxonomies can be applied across technology platforms.
Successful management of documents, records, and other forms of shared content requires:
For records, retention and disposal policies are critical. Records must be kept for the required length of time, and they should be destroyed once their retention requirements are met. While they exist, records must be accessible to the appropriate people and processes and, like other content, they should be delivered through appropriate channels.
To accomplish these goals, organizations require content management systems (CMS), as well as tools to create and manage the Metadata that supports the use of content. They also require governance to oversee the policies and procedures that support content use and prevent misuse; this governance enables the organization to respond to litigation in a consistent and appropriate manner.
Big data storage
Big Data and data science are connected to significant technological changes that have allowed people to generate, store, and analyze larger and larger amounts of data and to use that data to predict and influence behavior, as well as to gain insight on a range of important subjects, such as health care practices, natural resource management, and economic development.
Early efforts to define the meaning of Big Data characterized it in terms of the Three V’s: Volume, Velocity, Variety.34 As more organizations start to leverage the potential of Big Data, the list of V’s has expanded:
Taking advantage of Big Data requires changes in technology and business processes and in the way that data is managed. Most data warehouses are based on relational models. Big Data is not generally organized in a relational model. Data warehousing depends on the concept of ETL (extract, transform, load). Big Data solutions, like data lakes, depend on the concept of ELT – loading and then transforming. This means much of the upfront work required for integration is not done for Big Data as it is for creating a data warehouse based on a data model. For some organizations and for some uses of data, this approach works, but for others, there is a need to focus on preparing data for use.
The speed and volume of data present challenges that require different approaches to critical aspects of data management, not only data integration, but also Metadata Management, and data quality assessment, and data storage (e.g., on site, in a data center, or in the cloud).
The promise of Big Data – that it will provide a different kind of insight – depends on being able to manage Big Data. In many ways, because of the wide variation in sources and formats, Big Data management requires more discipline than relational data management. Each of the V’s presents the opportunity for chaos.
Principles related to Big Data management have yet to fully form, but one is very clear: organizations should carefully manage Metadata related to Big Data sources in order to have an accurate inventory of data files, their origins, and their value. Some people have questioned whether there is a need to manage the quality of Big Data, but the question itself reflects a lack of understanding of the definition of quality – fitness for purpose. Bigness, in and of itself, does not make data fit for purpose. Big Data also represents new ethical and security risks that need to be accounted for by data governance organizations (see Chapter 4).
Big Data can be used for a range of activities, from data mining to machine learning and predictive analytics. But to get there, an organization must have a starting point and a strategy. An organization’s Big Data strategy needs to be aligned with and support its overall business strategy. It should evaluate:
The strategy will drive the scope and timing of organization’s Big Data capability roadmap.
Many organizations are integrating Big Data into their overall data management environment (see Figure 21). Data moves from source systems into a staging area, where it may be cleansed and enriched. It is then integrated and stored in the data warehouse (DW) and/or an operational data store (ODS). From the DW, users may access data via marts or cubes, and utilize it for various kinds of reporting. Big Data goes through a similar process, but with a significant difference: while most warehouses integrate data before landing it in tables, Big Data solutions ingest data before integrating it. Big Data BI may include predictive analytics and data mining, as well as more traditional forms of reporting.
What you need to know