Chapter 7
Enabling and Maintaining Data

The focus of design work, such as data architecture and data modeling, is to provide insight on how best to set up applications that make usable, accessible, and current data available to the organization. Once data is set up in warehouses, marts, and applications, significant operational work is required to maintain data so that it continues to meet organizational requirements. This chapter will describe the data management functions that focus on enabling and maintaining data, including:

  • Data Storage and Operations
  • Data Integration and Interoperability
  • Data Warehousing
  • Reference Data Management
  • Master Data Management
  • Document and Content Management
  • Big Data Storage

Data storage and operations

The data storage and operations function is what many people think about when they think about traditional data management. This is the highly technical work carried out by database administrators (DBAs) and network storage administrators (NSAs) to ensure that data storage systems are accessible and performant and that data integrity is maintained. The work of data storage and operations is essential to organizations that rely on data to transact their business.

Database administration is sometimes seen as a monolithic function, but DBAs play different roles. They may support production environments, development work, or specific applications and procedures. DBA work is influenced by the overall database architecture of an organization (e.g., centralized, distributed, federated; tightly or loosely coupled), as well as by how databases themselves are organized (hierarchically, relationally, or non-relationally). With the emergence of new technologies, DBAs and NSAs are responsible for creating and managing virtual environments (cloud computing). Because data storage environments are quite complex, DBAs look for ways to reduce or at least manage complexity through automation, reusability, and the application of standards and best practices.

While DBAs can seem far removed from the data governance function, their knowledge of the technical environment is essential to implement data governance directives related to such things as access control, data privacy, and data security. Experienced DBAs are also instrumental in enabling organizations to adopt and leverage new technologies.

Data storage and operations is about managing data across its lifecycle, from obtaining it to purging it. DBAs contribute to this process by:

  • Defining storage requirements
  • Defining access requirements
  • Developing database instances
  • Managing the physical storage environment
  • Loading data
  • Replicating data
  • Tracking usage patterns
  • Planning for business continuity
  • Managing backup and recovery
  • Managing database performance and availability
  • Managing alternate environments (e.g., for development and test)
  • Managing data migration
  • Tracking data assets
  • Enabling data audits and validation

In short, DBAs make sure the engines are running. They are also first on the scene when databases become unavailable.

Data integration and interoperability

While data storage and operations activities focus on the environments for storing and maintaining data, data integration and interoperability (DII) activities include processes for moving and consolidating data within and between data stores and applications. Integration consolidates data into consistent forms, either physical or virtual. Data Interoperability is the ability for multiple systems to communicate. Data to be integrated usually originates from different systems within an organization. More and more, organizations also integrate external data with data they produce.

DII solutions enable basic data management functions on which most organizations depend:

  • Data migration and conversion
  • Data consolidation into hubs or marts
  • Integration of vendor software packages into an organization’s application portfolio
  • Data sharing between applications and across organizations
  • Distributing data across data stores and data centers
  • Archiving data
  • Managing data interfaces
  • Obtaining and ingesting external data
  • Integrating structured and unstructured data
  • Providing operational intelligence and management decision support

The implementation of Data Integration & Interoperability practices and solutions aims to:

  • Make data available in the format and timeframe needed by data consumers, both human and system
  • Consolidate data physically and virtually into data hubs
  • Lower the cost and complexity of managing solutions by developing shared models and interfaces
  • Identify meaningful events (opportunities and threats) and automatically trigger alerts and actions
  • Support Business Intelligence, analytics, Master Data Management, and operational efficiency efforts

The design of DII solutions needs to account for:

  • Change data capture: How to ensure data is correctly updated
  • Latency: The amount of time between when data is created or captured and when it is made available for consumption
  • Replication: How data is distributed to ensure performance
  • Orchestration: How different processes are organized and executed to preserve data consistency and continuity

The main driver for DII is to ensure that data moves efficiently to and from different data stores, both within the organization and between organizations. It is very important to design with an eye toward reducing complexity. Most enterprises have hundreds, sometimes thousands, of databases. If DII is not managed efficiently, just managing interfaces can overwhelm an IT organization.

Because of its complexity, DII is dependent on other areas of data management, including:

  • Data Governance: For governing the transformation rules and message structures
  • Data Architecture: For designing solutions
  • Data Security: For ensuring solutions appropriately protect the security of data, whether it is persistent, virtual, or in motion between applications and organizations
  • Metadata: For tracking the technical inventory of data (persistent, virtual, and in motion), the business meaning of the data, the business rules for transforming the data, and the operational history and lineage of the data
  • Data Storage and Operations: For managing the physical instantiation of the solutions
  • Data Modeling and Design: For designing the data structures including physical persistence in databases, virtual data structures, and messages passing information between applications and organizations

Data Integration & Interoperability is critical to Data Warehousing & Business Intelligence, as well as Reference Data and Master Data Management, because all of these transform and integrate data from multiple source systems to consolidated data hubs and from hubs to the target systems where it can be delivered to data consumers, both system and human.

Data Integration & Interoperability is central, as well, to the emerging area of Big Data management. Big Data seeks to integrate various types of data, including data structured and stored in databases, unstructured text data in documents or files, other types of unstructured data such as audio, video, and streaming data. This integrated data can be mined, used to develop predictive models, and deployed in operational intelligence activities.

When implementing DII, an organization should follow these principles:

  • Take an enterprise perspective in design to ensure future extensibility, but implement through iterative and incremental delivery.
  • Balance local data needs with enterprise data needs, including support and maintenance.
  • Ensure business accountability for DII design and activity. Business experts should be involved in the design and modification of data transformation rules, both persistent and virtual.

Data warehousing

Data warehouses allow organizations to integrate data from disparate systems into a common data model in order to support operational functions, compliance requirements, and Business Intelligence (BI) activities. Warehouse technology emerged in the 1980s and organizations began building warehouses in earnest in the 1990s. Warehouses promised to enable organizations to use their data more effectively by reducing data redundancy and bringing about more consistency.

The term data warehouse implies all the data is in one place, as in a physical warehouse. But data warehouses are more complicated than that. They consist of multiple parts through which data moves. During its movement, the structure and format of data may be changed, so that it can be brought together in common tables, from which it can be accessed. It may be used directly for reporting or as input for downstream applications.

Building a warehouse requires skills from across the spectrum of data management, from the highly technical skills required for data storage, operations, and integration, to the decision-making skills of data governance and data strategy leads. It also means managing the foundational processes that enable data to be secure, usable (via reliable Metadata), and of high-quality.

There are different ways to build a warehouse. The approach an organization takes will depend on its goals, strategy, and architecture. Whatever the approach, warehouses share common features:

  • Warehouses store data from other systems and make it accessible and usable for analysis.
  • The act of storage includes organizing the data in ways that increase its value. In many cases this means warehouses effectively create new data that is not available elsewhere.
  • Organizations build warehouses because they need to make reliable, integrated data available to authorized stakeholders.
  • Warehouse data serves many purposes, from support of work flow to operational management to predictive analytics.

The best known approaches to data warehousing have been driven by two influential thought leaders, Bill Inmon and Ralph Kimball.

Inmon defines a data warehouse as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process”29. A normalized relational model is used to store and manage data. Figure 19 illustrates Inmon’s approach, which is referred to as the “Corporate Information Factory.”

Kimball defines a warehouse as “a copy of transaction data specifically structured for query and analysis.” Figure 20 illustrates Kimball’s approach, which calls for a dimensional model.

Figure 19: Inmon’s Corporate Information Factory (DMBOK2, p. 388)30

As we approach the third decade of the new millennium, many organizations are building second- and third-generation warehouses or adopting data lakes to make data available. Data lakes make more data available at a higher velocity, creating the opportunity to move from retrospective analysis of business trends to predictive analytics.

Figure 20: Kimball’s Data Warehouse Chess Pieces (DMBOK2, p. 390)31

Managing bigger data takes additional knowledge and planning. But it also requires following some principles fundamental to managing warehouse data, including:

  • Focus on business goals: Make sure data warehouse (DW) serves organizational priorities and solves business problems. Doing so requires a strategic perspective, which, most often, is an enterprise perspective.
  • Start with the end in mind: DW content should be driven by the business priorities and the scope of end-data-delivery for BI.
  • Think and design globally; act and build locally: Let end-vision guide the architecture, but build and deliver incrementally, through focused projects or sprints that enable more immediate return on investment.
  • Summarize and optimize last, not first: Build on the atomic data. Aggregate and summarize to meet requirements and ensure performance, not to replace the detail.
  • Promote transparency and self-service: The more context (e.g., including Metadata of multiple kinds) provided, the better able data consumers will be to get value out of the data. Keep stakeholders informed about the data and the processes by which it is integrated.
  • Build Metadata with the warehouse: Critical to DW success is the ability to explain the data. For example, being able to answer basic questions like “Why is this sum X?” “How was that computed?” and “Where did the data come from?” Metadata should be captured as part of the development cycle and managed as part of ongoing operations.
  • Collaborate: Collaborate with other data initiatives, especially those for data governance, data quality, and Metadata.
  • One size does not fit all: Use the right tools and products for each group of data consumers.

Reference data management

Different types of data play different roles within an organization and have different data management requirements. Reference Data (for example, code and description tables) is data that is used solely to characterize other data in an organization, or solely to relate data in a database to information beyond the boundaries of the organization.32

Reference Data provides context critical to the creation and use of transactional and Master Data. It enables other data to be meaningfully understood. Importantly, it is a shared resource that should be managed at the enterprise level. Having multiple instances of the same Reference Data is inefficient and inevitably leads to inconsistency between them. Inconsistency leads to ambiguity, and ambiguity introduces risk to an organization.

Reference Data Management (RDM) entails control over defined domain values and their definitions. The goal of RDM is to ensure the organization has access to a complete set of accurate and current values for each concept represented.

Because it is a shared resource and crosses internal organizational boundaries, ownership and responsibility for Reference Data are challenging for some organizations. Some reference data originates outside of the organization, other Reference Data may be created and maintained within a department but have potential value elsewhere in an organization. Determining responsibility for obtaining data and applying updates is part of managing it. Lack of accountability introduces risk, as differences in Reference Data may cause misunderstanding of data context (for example, when two business units have different values to classify the same concept).

Reference data often seems simpler than other data because reference data sets are generally smaller than other kinds of data. They have fewer columns and fewer rows. Even a large reference data set, like the USPS ZIP code file, is small relative to the daily financial transactions of even a medium-sized retailer. Reference data is also generally less volatile than other forms of data. With a few notable exceptions (like currency exchange rate data), reference data changes infrequently.

The challenge with managing reference data comes with its usage. For Reference Data Management to be effective (values up-to-date and consistent across multiple applications and uses), it needs to be managed through technology that enables human and system data consumers to access it in a timely and efficient way across multiple platforms.

As with managing other forms of data, managing reference data requires planning and design. Architecture and reference data models must account for how reference data will be stored, maintained, and shared. Because it is a shared resource, it requires a high degree of stewardship. To get the most value from a centrally managed reference data system, an organization should establish governance policies that require use of that system and prevent people from maintaining their own copies of reference data sets. This may require a level of organizational change management activity, as it can be challenging to get people to give up their spreadsheets for the good of the enterprise.

Master Data Management

Like Reference Data, Master Data is a shared resource. Master Data is data about the business entities (e.g., employees, customers, products, vendors, financial structures, assets, and locations) that provide context for business transactions and analysis. An entity is a real-world object (like a person, organization, place, or thing). Entities are represented by entity instances, in the form of data / records. Master Data should represent the authoritative, most accurate data available about key business entities. When well-managed, Master Data values are trusted and can be used with confidence.

Master Data Management (MDM) entails control over Master Data values and identifiers that enable consistent use, across systems, of the most accurate and timely data about essential business entities. The goals include ensuring availability of accurate, current values while reducing the risk of ambiguous identifiers.

Put more simply: when people think of high-quality data, they usually think of well-managed Master Data. For example, a record of a customer that is complete, accurate, current, and usable is considered “well-managed.” From this record, they should be able to bring together an historical understanding of that customer. If they have enough information, they may be able to predict or influence the actions of that customer.

Master Data Management is challenging. It illustrates a fundamental challenge with data: people choose different ways to represent similar concepts and reconciliation between these representations is not always straightforward. As importantly, information changes over time and systematically accounting for these changes takes planning, data knowledge, and technical skills. In short, it takes work, including data stewardship and governance work, to manage Master Data.

Any organization that has recognized the need for MDM probably already has a complex system landscape, with multiple ways of capturing and storing references to real world entities. As a result of organic growth over time and from mergers and acquisitions, the systems that provided input to the MDM process may have different definitions of the entities themselves and very likely have different standards for data quality. Because of this complexity, it is best to approach Master Data Management one data domain at a time. Start small, with a handful of attributes, and build out over time.

Planning for Master Data Management includes several basic steps. Within a domain:

  • Identify candidate sources that will provide a comprehensive view of the Master Data entities
  • Develop rules for accurately matching and merging entity instances
  • Establish an approach to identify and restore inappropriately matched and merged data
  • Establish an approach to distribute trusted data to systems across the enterprise

Executing the process, though, is not as simple as these steps make it sound. MDM is a lifecycle management process. In addition, Master Data must not only be managed within an MDM system, it must also be made available for use by other systems and processes. This requires technology that enables sharing and feedback. It must also be backed up by policies that require systems and business processes to use the Master Data values and prevent them from creating their own “versions of the truth.”

Still, Master Data Management has many benefits. Well-managed Master Data improves organizational efficiency and reduces the risks associated with differences in data structure across systems and processes. It also creates opportunity for enrichment of some categories of data. For example, customer and client data can be augmented with information from external sources, such as vendors that sell marketing or demographic data.

Document and content management

Documents, records, and content (for example, the information stored on internet and intranet sites) comprise a form of data with distinct management requirements. Document and Content Management entails controlling the capture, storage, access, and use of data and information stored outside relational databases.33 Like other types of data, documents and unstructured content are expected to be secure and of high quality. Ensuring their security and quality requires governance, reliable architecture, and well-managed Metadata.

Document and content management focuses on maintaining the integrity of and enabling access to documents and other unstructured or semi-structured information; this makes it roughly equivalent to data operations management for relational databases. However, it also has strategic drivers. The primary business drivers for document and content management include regulatory compliance, the ability to respond to litigation and e-discovery requests, and business continuity requirements.

Document Management is the general term used to describe storage, inventory, and control of electronic and paper documents. It encompasses the techniques and technologies for controlling and organizing documents throughout their lifecycle.

Records Management is a specialized form of document management that focuses on records – documents that provide evidence of an organization’s activities. These activities can be events, transactions, contracts, correspondence, policies, decisions, procedures, operations, personnel files, and financial statements. Records can be physical documents, electronic files and messages, or database contents.

Documents and other digital assets, such as videos, photographs, etc., contain content. Content Management refers to the processes, techniques, and technologies for organizing, categorizing, and structuring information resources so that they can be stored, published, and reused in multiple ways. Content may be volatile or static. It may be managed formally (strictly stored, managed, audited, retained or disposed of) or informally through ad hoc updates. Content management is particularly important in web sites and portals, but the techniques of indexing based on keywords and organizing based on taxonomies can be applied across technology platforms.

Successful management of documents, records, and other forms of shared content requires:

  • Planning, including creating policies for different kinds of access and handling
  • Defining information architecture and Metadata required to support a content strategy
  • Enabling management of terminology, including ontologies and taxonomies, required to organize, store, and retrieve various forms of content
  • Adopting technologies that enable management of the content lifecycle, from creating or capturing content to versioning, and ensuring content is secure

For records, retention and disposal policies are critical. Records must be kept for the required length of time, and they should be destroyed once their retention requirements are met. While they exist, records must be accessible to the appropriate people and processes and, like other content, they should be delivered through appropriate channels.

To accomplish these goals, organizations require content management systems (CMS), as well as tools to create and manage the Metadata that supports the use of content. They also require governance to oversee the policies and procedures that support content use and prevent misuse; this governance enables the organization to respond to litigation in a consistent and appropriate manner.

Big data storage

Big Data and data science are connected to significant technological changes that have allowed people to generate, store, and analyze larger and larger amounts of data and to use that data to predict and influence behavior, as well as to gain insight on a range of important subjects, such as health care practices, natural resource management, and economic development.

Early efforts to define the meaning of Big Data characterized it in terms of the Three V’s: Volume, Velocity, Variety.34 As more organizations start to leverage the potential of Big Data, the list of V’s has expanded:

  • Volume: Refers to the amount of data. Big Data often has thousands of entities or elements in billions of records.
  • Velocity: Refers to the speed at which data is captured, generated, or shared. Big Data is often generated and can also be distributed and even analyzed in real-time.
  • Variety / Variability: Refers to the forms in which data is captured or delivered. Big Data comes requires storage of multiple formats. Data structure is often inconsistent within or across data sets.
  • Viscosity: Refers to how difficult the data is to use or integrate.
  • Volatility: Refers to how often data changes and therefore how long the data is useful.
  • Veracity: Refers to how trustworthy the data is.

Taking advantage of Big Data requires changes in technology and business processes and in the way that data is managed. Most data warehouses are based on relational models. Big Data is not generally organized in a relational model. Data warehousing depends on the concept of ETL (extract, transform, load). Big Data solutions, like data lakes, depend on the concept of ELT – loading and then transforming. This means much of the upfront work required for integration is not done for Big Data as it is for creating a data warehouse based on a data model. For some organizations and for some uses of data, this approach works, but for others, there is a need to focus on preparing data for use.

The speed and volume of data present challenges that require different approaches to critical aspects of data management, not only data integration, but also Metadata Management, and data quality assessment, and data storage (e.g., on site, in a data center, or in the cloud).

The promise of Big Data – that it will provide a different kind of insight – depends on being able to manage Big Data. In many ways, because of the wide variation in sources and formats, Big Data management requires more discipline than relational data management. Each of the V’s presents the opportunity for chaos.

Principles related to Big Data management have yet to fully form, but one is very clear: organizations should carefully manage Metadata related to Big Data sources in order to have an accurate inventory of data files, their origins, and their value. Some people have questioned whether there is a need to manage the quality of Big Data, but the question itself reflects a lack of understanding of the definition of quality – fitness for purpose. Bigness, in and of itself, does not make data fit for purpose. Big Data also represents new ethical and security risks that need to be accounted for by data governance organizations (see Chapter 4).

Big Data can be used for a range of activities, from data mining to machine learning and predictive analytics. But to get there, an organization must have a starting point and a strategy. An organization’s Big Data strategy needs to be aligned with and support its overall business strategy. It should evaluate:

  • What problems the organization is trying to solve. What it needs analytics for: An organization may determine that the data is to be used to understand the business or the business environment; to prove ideas about the value of new products; to explore a hypothesis; or to invent a new way to do business. It is important to establish a gating and check point process to evaluate the value and feasibility of initiatives.
  • What data sources to use or acquire: Internal sources may be easy to use, but may also be limited in scope. External sources may be useful, but are outside operational control (managed by others, or not controlled by anyone, as in the case of social media). Many vendors are competing as data brokers and often multiple sources exist for the desired data sets. Acquiring data that integrates with existing ingestion items can reduce overall investment costs.
  • The timeliness and scope of the data to provision: Many elements can be provided in real-time feeds, snapshots at a point in time, or even integrated and summarized. Low latency data is ideal, but often comes at the expense of machine learning capabilities – there is a huge difference between computational algorithms directed to data-at-rest versus streaming. Do not minimize the level of integration required for downstream usage.
  • The impact on and relation to other data structures: You may need to make changes to structure or content in other data structures to make them suitable for integration with Big Data sets.
  • Influences to existing modeled data: Including extending the knowledge on customers, products, and marketing approaches.

The strategy will drive the scope and timing of organization’s Big Data capability roadmap.

Many organizations are integrating Big Data into their overall data management environment (see Figure 21). Data moves from source systems into a staging area, where it may be cleansed and enriched. It is then integrated and stored in the data warehouse (DW) and/or an operational data store (ODS). From the DW, users may access data via marts or cubes, and utilize it for various kinds of reporting. Big Data goes through a similar process, but with a significant difference: while most warehouses integrate data before landing it in tables, Big Data solutions ingest data before integrating it. Big Data BI may include predictive analytics and data mining, as well as more traditional forms of reporting.

What you need to know

  • The processes used to enable and maintain data are wide, varied, and constantly evolving.
  • Different types of data have specific maintenance requirements, but for all types an organization must account for data volatility (the rate, timing and types of expected changes) as well as quality (fitness for purpose).
  • Good planning and design can help reduce the complexity associated with these processes.
  • Reliable and appropriate technology and disciplined execution of operational processes are critical to an organization’s ability to manage its data.
  • Even as data and technology evolve (e.g., from documents to Big Data), the same fundamental principles apply in managing it.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset