4

XML and Data Engineering

Introduction

This chapter describes the intersection between XML and data engineering, pictured in Figure 4.1.. We believe that the area of overlap is large, and this permits speculation about leveraging investments. The short lesson to take away from this chapter is that organizations must simultaneously consider reengineering their data structures as they prepare their data for use in an XML environment. Before embarking on XML-related initiatives, organizations should ensure that the maturity of their DM practices is sufficient to implement the concepts described in this chapter. XML holds a lot of promise, but when it is used in the wrong context, or used in conjunction with data structures and architectures that are already lacking, most people find that it does not live up to its reputation. The structure and thought that surround the use of XML must be coherent and reasonable in order for XML to achieve the desired results. Given a poor architecture or design for a building, the strongest building materials will fail. This is not the fault of the building materials, but the structure into which they were put.

image

Figure 4.1 The intersection between XML and data engineering.

The chapter begins with a discussion of where not to stop with your initial XML implementation. We next describe data engineering as it relates to participant roles, followed by system migration. We continue, describing various data engineering tasks as well as data structure difficulties. These concepts are illustrated with a case study that shows the value XML can have when performing something as seemingly routine as the implementation of a large Enterprise Resource Planning (ERP) application. XML-related data security engineering issues are briefly described in the next section. As with quality, security (and particularly data security) cannot be engineered into finished products after the fact of production. Instead, it must be an integrated portion of the development process. The latter third of this chapter is devoted to a discussion of the fundamentals when it comes to engineering an organizational data architecture. After reading this chapter, the relationship between data engineering and the effective use of XML will be clear, enabling data managers to apply XML where it is most appropriate.

As many data managers have experienced in their careers, projects can live or die based on what data is available and how it is structured. Well-structured data that is understandable is not a blue-sky wishlist item for organizations—it is a necessity. Those who do not address poor structure typically get bitten by it until they do. But in many situations, management is not willing to invest in an enterprise-wide evaluation of data structures. In most cases, this is wise on their part, since the task can be unmanageably large. The challenge is to find a way to break down a daunting problem into chunks that can be dealt with.

Another theme that runs throughout this chapter is that work done on day-to-day systems projects affects information architecture. Data managers have the opportunity to build metadata components from the systems work that they do, and reuse these components in other ways. Some of the techniques in this chapter describe how to break things down so that the resulting components are reusable, and how to structure those components so that they can be reassembled into a larger organization goal in the future. Before we do this, though, we need to briefly discuss the way organizations typically adopt XML in the early stages, and to see how those first steps look from a data engineering perspective. (Note: Portions of this chapter were co-authored by Elizabeth R. White.)

Typical XML First Steps

The most typical approach to XML seen in organizations is to start using XML documents as “wrappers” on existing data—a one-to-one mapping between the attributes and data items in the existing system is established with XML elements, and documents are created that reflect the system in every architectural detail, the only difference being that the syntax used to express the data happens to be XML. In these situations, is should be expected that the XML documents created in such a way will have the same set of advantages and disadvantages as the original system. In such cases, there may be an incremental gain in functionality based on some of the characteristics and related technologies that the use of XML brings to the table, but the important point to note is that the system as a whole is still constrained by its original design.

There is an alternative to this, however. Motivated by the desire to get the most out of XML, we hope that organizations might address the larger issue of their data architecture and the structures that are contained within that architecture. Some might view this as a distraction from the core goal of getting XML implemented, but it is one of the most important steps. People are attracted to XML partly based on what they have seen and have been told is possible with XML. Unfortunately, the underlying structure that allowed the judicious application of XML technology to shine through and deliver all of those benefits has not been discussed outside of this book. The main idea to take from this chapter is this: Architecture and structure matter, and if there is any area of a system that needs to be right, this is it. With a well-designed structure, benefits are only limited by the creativity of those working with it. With poorly designed structures, it takes all of one’s creativity just to work around the system to make it do what it was supposed to do. Next, we will take a look at what data engineering really is, and what types of talents and capabilities are necessary to make it work.

Engineering XML-Based Data Structures as Part of an Organizational Data Architecture

Data engineering is fundamentally concerned with the design and manufacture of complex data products using calculated manipulation or direction.* Controlling data evolution, moving data between systems, and ensuring the structural soundness and appropriate transformation of data are typical tasks that we associate with the definition of data engineering for the purposes of this chapter. In many places, the concept of data engineering is somewhat new. Consider for example the typical evolution of data systems within organizations. Traditionally, as a small organization starts out, individual systems are developed according to what is needed. As the organization segments into many different areas based on function, new systems are built. At some point, all of the different systems must be integrated (for example, integrating a sales system that takes orders with a manufacturing system that schedules production), but in many places there has not been a calculated, well-planned effort at putting together a system that is coherent overall. Working in the right directions requires getting the right people in the right roles to bring their talents to bear on the project.

If Software Companies Made Bridges

Let us look at an example of what is possible with real world engineering of solid objects, such as a bridge over a river. Just like an information system, a bridge is built with certain requirements in terms of how much load it must be able to bear, how frequently, where it should be located, and what function it should perform. Bridges are built of concrete, steel, and bolts, just as information systems are built of data structures, people, and processes.

In real world engineering, problems often arise with structures that must be dealt with based on the understanding of how the structure was engineered in the first place. For example, take a train bridge over a river. In the case of this particular bridge, when the river floods and comes close to covering the entire structure, the response from the engineers is to pull a train full of coal onto the bridge. The weight of the fully loaded train provides the extra weight to the bridge that prevents it from being overcome by the horizontal pressure and weight of the water flowing down the river.

The engineers knew how heavy the train could be, and that the bridge could accommodate the train. They also were able to figure out how much pressure the water would bring to bear, so they could come up with the solution. The point here is that the engineers were not thinking of the nuts and bolts in the bridge coming when devising a solution, but of the overall structure. In other words, even though a bridge is a complex structure, the construction and architecture of the bridge as a whole could be assessed, rather than its individual components. The engineers’ design of the bridge structure is complex enough to serve its purpose, and simple enough to be understandable. The important thing to note is that only by understanding the properties of the structure as a whole was it possible to come up with a solution to this particular problem.

In information systems, the focus is too frequently on low-level aspects of the system, with an incomplete understanding of how they combine to form the whole. Inevitably, the solutions that are often employed end up being brittle and somewhat lacking.

Part of this problem is due to the fact that civil engineers have advantages over data engineers. They receive more intensive training specific to their field, and they have a long history of knowing what works and what does not. Their industry also does not change as quickly or drastically as the data industry. Civil engineers have quantifiable requirements for their projects—a bridge should be able to carry so many tons, etc.—while it takes special skills in the data world just to elicit all of the critical requirements from users. All of this together means that data engineers already have the deck stacked against them. They need a greater-than-average amount of skill to accomplish something that is merely usable. This perspective makes it a bit clearer why assigning teams of end users, business analysts, or even database administrators yields structures that are so often suboptimal.

The process of building specific data structures really should not be that different from building a bridge. For example, there should be a clear and solid plan for accomplishing the task, and there should be a clear definition of what roles and people are needed for the project. Most people would not want a house architect building a large data structure, just as they would not want to drive across a bridge that was designed by a software engineer.

So now let’s look at the results of a project where there was not a clear plan to begin with. The Winchester House* in San Jose, California, was built by the wealthy heiress to the Winchester fortune, earned from the sales of the world’s first repeating rifle. The house started with a reasonable base, but was under continuous construction for a period of more than 30 years. Subsequent construction was somewhat haphazard, with new additions made according to whim and available space. The result is a unique house of 160 rooms that contains such puzzling features as staircases that lead nowhere and two incomplete ballrooms. There isn’t even an overall set of blueprints that describe the current structure of the house.

The experience of a person wandering around in this house is similar to that of a software developer wandering around through ad hoc data structures. In both cases, the path that must be taken is often not the intuitive one, and there is a feeling of needing to work around the system rather than with it to get something accomplished. As time goes on, just as the Winchester house had strange features added, information systems have been modified in strange ways, including tacking new fields onto existing structures, and repurposing existing fields for new (and unrelated) data. Over time, these modifications add up and mutate the original system into a monstrosity whose primary purpose seems to be to provide job security for the few that fully understand it. Systems that are built and evolved without a plan end up feeling like cranky co-workers that make even the simplest tasks more complicated than they should be.

One of the reactions to system organization of this type was a framework for building enterprise architecture that is directed and purposeful—the Zachman Framework (Zachman, 1987). This framework is of interest to data managers working with XML because it represents a method of structuring systems and information that avoids the “Winchester House” syndrome described above. Data management planning and strategy must be going on simultaneously with XML work—an XML project should not be used as a surrogate for that planning. The Zachman framework provides a skeletal structure for how information components can be arranged within an organization to get the most benefit out of systems investments. It is therefore worthy of serious consideration for data managers working with XML who want to understand how their efforts should fit into the overall picture of the organization.

While much has been written of John Zachman’s framework, it is always good to read the original article, and so we advise data managers to get a copy of the paper and read it. It has been reproduced in numerous books and collections—searching for it on the web will yield a number of sources, including the Zachman Institute for Framework Advancement at http://www.zifa.com/.

In his article, Zachman articulates the case for large and complex technology systems to be developed using frameworks similar to those used to build airplanes and buildings. There is a well-defined process for huge physical engineering projects, and in some ways the goal of the Zachman framework is to learn from those processes in order to get better results in the data world. Large, complex system development is articulated using a series of plans representing different metadata, categorized according to the six primary questions and five levels of abstraction for large complex projects (see Figure 4.2). The intersections in the grid refer to areas that must be systematically understood rather than approached in an ad hoc way.

image

Figure 4.2 Zachman framework organization.

While we will address repository technologies and strategies later in the book, it is sufficient at this point to know two facts. Fact 1: Data managers cannot expect to find the features in a product that will enable a multi-year investment, so they should set their repository aspirations lower and build up their capabilities. Fact 2: XML-based metadata is so flexible that data managers can create a formal metadata evolution strategy, making it easy to transform XML-based metadata to meet changing requirements.

So data managers can begin to build up their “repository capabilities” by creating small implementations such as maintaining their XML in a small online database. Some organizations are able to begin implementing repositories using just operating system file naming conventions. As these become more sophisticated, the actual metadata can be migrated and evolved with increasingly sophisticated repository management techniques. To achieve this, the XML-based metadata is stored as XML documents. These documents can be modified using XSLT technologies. The changes to the metadata and aspects of metadata implementation strategy can then be implemented using database management technologies. Figure 4.3 shows how XML-based metadata is stored as XML documents, permitting them to be incorporated into an XSLT environment in order to evolve the metadata.

image

Figure 4.3 XML-based metadata repository architecture.

What is kept in these frameworks and how is it organized? A combination of rough categorization of the row (perspective)/column (interrogative) and XML-based metadata itself argues strongly for maintaining the metadata as XML and creating what we refer to as framework hosted, XML-based metadata, or XM. From the planning perspective, the “what” question corresponds to a list of the business “items” that are of most interest to the organization. For a credit card company, this might be a cardholder, or a statement. These items are maintained as an XML list, along with the associated metadata, making the list useful to different applications. Aside from other applications, this list could be published on the organization’s intranet for perusal by key staff. The XML document containing the “whats” from the owner’s perspective could be repeatedly accessed. Framework-based XML documents containing XML metadata are stored, published on the data management group’s intranet, and printed for strategic planning guidance. These documents are also available to other business executives via their electronic dashboards (Figure 4.4).

image

Figure 4.4 Metadata as operational and strategic support.

The purpose of populating the framework is to acquire information about the overall enterprise architecture. Gone are the days when organizations will permit data managers the luxury of a five-year data management plan. Textbooks indicate that the proper means of developing enterprise architecture is to develop the high-level enterprise model and various tactical-level models. Figure 4.5 shows framework-based XML documents containing XML-based metadata. Reasonable estimates indicate this approach has increased from a 5- to a 10-year project, and organizations generally will not invest in such long-term projects.

image

Figure 4.5 Information engineering via the original version.

What must happen now is the discovery of the enterprise architecture as a byproduct of other data engineering activities. Since data managers clearly cannot stop what they are doing and go work on enterprise architecture for a few years, it is important that some effort be made to derive enterprise architecture components from necessary work that is being performed anyway. As metadata is developed as part of other projects, it is added to the “repository” and classified according to the framework cell or cells that it occupies. It is necessary to develop enough critical mass in the components to permit users to discover combinations of components where one plus one is equal to three. For example, in order to have the list of business processes that interact with data items, an organization must first have both the list of data items and of business processes. By combining those two pieces of information, the whole is greater than the sum of the parts.

There is a reason that the architecture must be developed from the bottom up. While large complex structures such as the National Cathedral in Washington, DC, were developed top down, guided by detailed blueprints for more than 80 years and 4 generations of craftsmen, our legacy environment has not been maintained as well as we would have liked. Those detailed blueprints that cathedral builders have are nowhere to be found for many complex systems. Simply put, “plans and documentation … are poor or non-existent, and the original designers and craftsmen are no longer available. The result is increasing maintenance costs and decreasing programmer productivity—a situation that is inefficient, wasteful and costly to our businesses” (Spewak, 1993).

Architecture-based understanding is important because only from an architectural perspective can we consider a combination as more than the sum of its parts. Enterprise architecture is the combination of the data, process, and technology architectures. Data managers need to know that what they do has potential strategic implications beyond simply supporting the business with an information system. Just as engineers must be aware of the various strengths and weaknesses of raw materials used to create the architectural components, so too must data engineers understand the various strengths and weaknesses of their system’s architectural components.

Let us go back to the bridge discussion for another example of what most organizations have done with respect to development of their data structures. The approach that some have taken is analogous to an attempt to build a bridge across a body of water that is strong enough to support a continuous line of fully-loaded, 100-ton coal cars passing each other, by throwing lots of pebbles into the river, knowing that enough pebbles will eventually create a bridge of sorts that will allow passage of trains over the river. It is probably possible to do this so that the requirements of the train weight will be met, but even under the best conditions it will be slow, costly, prone to shifting, and may have unexpected side effects for the organization (in this case, the river would be dammed).

The introduction of XML permits the organization to perform what has perhaps been viewed as of questionable value to the organization—data standardization. But it cannot be directly referred to as “data standardization.” Business owners will pay for the development of XML data structure tags but not for data standardization because they perceive that it does not contribute to the achievement of organizational goals. The difference in an XML-based data management environment is that you need analyze only bite-sized chunks, permitting the smaller and better-defined projects to be executed within a budget cycle, with the results made visible to the remainder of the organization.

So each time any development work is done in the business or technical environment, it will be well worth it to spend the extra 5% and formally extract, analyze, understand, and improve the organizational metadata that informs and is produced by the development work. Figure 4.6 illustrates how development activities, input, maintenance, improvement, and new development activities and produce more useful metadata that is captured in a repository. The tasks are easier to conceive and manage if the development activities are prepared and treated as metadata engineering analyses.

image

Figure 4.6 Virtually all engineering activities produce metadata for the XML-based repository.

The next section will discuss several specific data engineering challenges.

image Metadata engineering analysis magnitude

image XML and data quality engineering

image XML and metadata modeling

image Data structure difficulties

image Data engineering roles

image Measuring data engineering tasks

Metadata Engineering Analyses

Metadata engineering analyses can be defined using Figure 4.6 as focused analyses developed specifically to obtain a “chunk” of metadata that will inform a larger IT or business-development task. For example, data incorporating the owner’s descriptions of the major business items can be used as input to a metadata engineering task specifying the next layer down—the designer perspective. To justify the extra 5% investment in the metadata engineering analyses, the organization must first come to view data as an asset. XML-based metadata itself has several characteristics that are good to keep in mind while working with it—using a resource effectively always entails understanding its properties and taking advantage of them. The most important characteristics of data that are often overlooked include the following:

image Data cannot be used up. In this, data is unique—it is the only organizational resource that cannot be exhausted, no matter how it is applied. Unlike the time, skills, and patience of good workers, or even budget dollars, data does not disappear after it has been used.

image Data tends to be self-perpetuating. The more data organizations have, the more they use and tend to want. This can in some cases become a problem in which organizations capture more data than they can realistically manage and use.

image It contains more information than expected. Individual pieces of data are nice, but combining a number of different data sources often yields new insights. The field of data mining shows us that volumes of data often contain information about hidden trends. Synthesis of disparate data sources also frequently yields more information. In the data world, the whole is often more than the sum of its parts.

Each metadata engineering project should use a combination of framework specific and metadata engineering specific vocabularies. All of this information is maintained as Repository-based, XML Metadata Documents (RXMD).

The metadata engineering diagram is presented in Figure 4.7. This form of XML-Based Metadata Engineering (XBME) reinforces the point that reengineering can only be done when the forward engineering is based on coordinated metadata engineering. It illustrates the required coordination among various metadata engineering activities. When reengineering, the basic procedure is to begin with the as-is implemented assets and reverse engineer them into the logical as-is assets. Only when the logical business-data requirements and related metadata are understood can any reengineering hope to avoid problems with the existing data implementation. An example of a repository-based XML metadata project would be expressed as “a target analysis to take the as-is implemented data pertaining to the customer data and reverse engineer it from the Subcontractor/What cell into the Builder/What cell.”

image

Figure 4.7 XBME activity specification framework.

Consider another example, focusing this time on the billing information from an existing system. At first observation, the system’s organization by queuing the A–F requests on a single machine is puzzling. On deeper analysis, it turns out that the hard disk drives could only maintain a 20-MB file system. Recreating that structure would not necessarily provide the best support of organizational strategy. The previous discussion of how bad ideas from past systems can creep into new systems still applies here. New systems should start off as a clean slate. Of course, good features from the previous system can be intentionally brought forward. Which features or aspects of the legacy system worked well? The important distinction to make is that some features should be brought forward because they were good and they worked, not simply because they were there. During XML-based metadata reengineering, data gets wrapped into XML structures and is available for reuse immediately. This is the point at which quality must be engineered into both data and the metadata of the repository functions.

XML and Data Quality

Data quality is an important topic when discussing the effectiveness of XML or any other data technology, because ultimately XML documents can only be as useful as the data that is contained within them. Important decisions get made based on the contents of data stores, and if the source information that the decisions are based on is wrong, there is not much reason to put faith in the decision that was made in the first place.

XML increases the importance of the quality of the metadata layer. One important point is that there is a one-to-many relationship between the architect and the user. This means that there is more information closer to the user end and less toward the architect. Programmatic correction fixes multiple individual problems, and if fixes are made upstream, they will truly trickle down and not only save money but prevent future problems.

XML and Metadata Modeling

When you have prepared the data quality engineering aspects of your metadata engineering analysis, you can begin the process of extracting metadata. Use the Common Metadata Model (CM2) described below as the basis for organizing your information within each framework cell. If you do not implement the right or perfect format for the data structure, you can have XML transformations developed to correct and improve your existing metadata. The CM2 forms the basis for developing the XML-based data structures according to three specific model decompositions. In this section, we present an overview of the metamodel framework (see Figure 4.8). More on the common metadata model, including business rule translations, can be found in Chapter 9 of Finkelstein and Aiken (1998).

image

Figure 4.8 Three primary and seven second level model view decompositions used to organize the system metadata. This comprises the required reverse engineering analysis and conceptual structure decompositions.

The same figure also illustrates the required metadata structure to support basic repository access requirements. There are two key success factors to each XML-based metadata reengineering analysis. First, the analysis chunks should be sized in a way that makes results easy to show. Second, the analysts should be rather choosy about which framework and common metadata model components get the most focus. Smaller, more focused projects make it easier to demonstrate results.

The set of optional, contextual extensions can also be managed in a separate decomposition. Some of these metadata items may be easily gathered during the process and should be gathered if they are relevant to the analysis. Contextual extensions might include:

image Data usage type—operational or decision support

image Data residency—functional, subject, geographic, etc.

image Data archival type—continuous, event-discrete, periodic-discrete

image Data granularity type—defining the smallest unit of addressable data as an attribute, an entity, or some other unit of measure

image Data access frequency—measured by accesses per measurement period

image Data access probability—probability that an individual logical data attribute will be accessed during a processing period

image Data update probability—probability that a logical data attribute will be updated during a processing period

image Data integration requirements—the number and possible classes of integration points

image Data subject types—the number and possible subject area breakdowns

image Data location types—the number and availability of possible node locations

image Data stewardship—the business unit charged with maintaining the logical data attribute

image Data attribute system component of record—the system component responsible for maintaining the logical data attribute

Data Structure Problem Difficulties

Getting the right data structures in the XML that you use with your systems is important because the cost associated with doing it with poor data structures at an individual attribute level is very high. When the amounts of data that have to be considered get so large, it makes sense for the data engineer to shift focus from spending all of his or her time thinking at the attribute level to concentrating on the data structure level. After all, every attribute should be just that—an attribute of a higher-level data structure, whether that is the relational concept of the entity, or anything else. An attribute is an aspect of a phenomenon or item that an organization wants to capture data about. When viewed in isolation, an attribute is not very useful. For example, what good is the information “123 Pine Lane” when there is no customer with whom to associate this address information? The thinking process should not be focused on addresses or capturing the date on which a customer placed an order—the level of granularity is too low.

Taking the initiative to implement XML in an organization should be an introduction into the process of learning about and improving existing data structures. To use XML as a stopgap solution, substitute for real analysis work, or way of avoiding the real issue of data engineering is to exacerbate the problem in the long run. In the next sections, we will take a look at why this approach has been used so frequently, and how the cycle might be broken.

With all of the benefits of sound architecture articulated, it is still often difficult for organizations to understand the importance of data structures. One of the arguments is typically that if data have been managed in the traditional way the entire time, and the organization has gotten along, why not continue to follow this route? In fact, while continuing along the same path may or may not be a feasible option, the efficiency gain resulting from appropriate data management can never be realized without some effort on the front end. What are other reasons that organizations have had a hard time understanding the importance of data structures?

The design of data structures is a discipline that does not have many metrics or judgment guidelines associated with it. It is difficult to take two examples of a particular data model and compare them, since there may be dozens of bases on which they could be compared, each of which may be hard to quantify. Many experienced data managers can gauge the quality of a design based on ethereal concepts such as elegance or beauty, but it is extremely difficult to assess exactly what characteristics of a design within a particular context indicate that it is elegant or beautiful. The reasons why one particular data structure would work and why another would not work also tend to be subtle. How much programming code is going to be needed to support the data structure and data flow? How many people are going to be required to maintain the system? How much flexibility does the structure provide for future expansion? Even given a set of answers to all of these and a host of other questions, along with weightings for the relative importance of the answers, it is difficult to judge a “best” solution.

This is a primary reason why organizations have had difficulty understanding the importance of data structures—because they have difficulty judging which is better or best. From the perspective of the enterprise, data structures either work or they do not work, for a very complicated set of reasons, and the analysis often does not go much deeper than that.

Data structures tend to be complicated, and many of those in common use are built on foundations whose soundness is questionable. This is usually the case when a legacy system has gone through many revisions and updates, stretching its functionality well beyond its originally intended limits. The more complex a structure is, the more brittle it tends to be, which can also necessitate huge changes in code to accommodate small data changes, along with many other unintended implications. When things get complicated in this way, workers tend to approach the problem by decomposing the system into smaller and smaller pieces until the problem is manageable. This is a completely reasonable approach, but if the system is complicated enough, this approach can yield to the “can’t see the forest for the trees” type of problem. Attempting to understand a massive system in terms of individual attributes is an example of understanding a system at too low a level.

This illustrates another reason for the difficulty in understanding data structures. In many cases, organizations simply are not paying attention to them. Mired down in the details of managing complexity, thoughts of changing entire structures are unthinkable since it would require code changes, reporting changes, and perhaps even process changes. It is indeed difficult to perform brain surgery on an existing legacy system by removing brittle components and replacing them with more robust ones. This underscores the importance of getting the data structures right from the start, rather than assuming that reform sometime in the future will be an option. Examining the way that physical structures are built in a little more depth will shed some light on what lessons can be learned from an engineering practice that is already much more advanced than that of data engineering.

Engineering Roles

Systems built in the past have frequently been constructed in a very ad hoc way. This approach is fine for small systems whose mission is simple and whose volume is low. Scaling these systems up as the organization grows and evolves, though, can be enormously difficult. New systems sometimes get implemented simply because the old system has been stretched to the absolute limit of scalability, and other times because the old system contained too many brittle assumptions about the organization that simply do not apply anymore. Regardless of whether one is building a new system, or attempting to update an older system, the work involved typically falls into one of three categories—“physical data work,” “logical data work,” and “architectural data work,” and are characterized by the types of questions about the data and the systems that are asked by people in these respective roles, as illustrated in Figure 4.9.

image

Figure 4.9 Different layers of data work.

Systems are complicated, and building comprehensive, flexible systems is something that takes many people with many different types of expertise. To come back to our earlier analogy, most people understand that it would take more than one skill set to design and build something like a bridge over a river—it takes a group of professionals who all specialize in different areas. The same is true of large data systems. The high-level goal of the system might be very simple, for example, to provide human resources information to an organization, just like the high-level goal of a bridge is very simple—perhaps to get a train over a river safely. But that high-level goal has many smaller requirements. In the case of the bridge, it has to be able to withstand a particular amount of weight, the horizontal force of the water in the river, the effect of the climate and weather on the materials, and so on.

One of the reasons that many traditional systems have been inadequate in the eyes of those who work with them is that they have not been properly engineered. Inside your organization, is there a clear delineation between the physical, logical, and architectural roles? In many cases, the worker in charge of running the physical platform is creating the entity-relationship description of the system; the business analyst, who is ultimately an end user of the system and a data consumer, is trying to map out the flow of data through the system; and there may or may not be anyone assigned to the architectural role. It is not that these people cannot perform the necessary duties—they can. But there is a qualitative difference between the result of an effort by someone trained in that area, and an effort by someone who must learn the area and get moving by necessity. If these assignments, levels of expertise, and role confusions were the state of events in the engineering of bridges, people might be wise to think twice about crossing rivers.

Still other considerations must be taken into account. Unlike many new physical structures that are built from scratch, earlier systems influence subsequent systems heavily, because ultimately the data must be evolved from the old system to the new.

Measuring Data Engineering Tasks

We have discussed that the best case scenario might be one attribute per minute in terms of data mapping. From anecdotal information in the industry, however, it appears that a more realistic estimate would be 3–5 hours per attribute on average. Clearly, some attributes will take more time than others, as some data structures are more heavily modified in new systems. When estimating the amount of time that it will take to work with one attribute, a number of things must be taken into consideration:

image What type of data is in the attribute, and how does that data change in the new system?

image Which coding standards are in use, and do they change? For example, in many systems, a 1-character code might have an alphabet of possibilities, each of which has a different high-level meaning for the way that record is dealt with or understood within the context of the system.

image How does this attribute match other attributes in other structures? From a relational perspective, is this attribute a primary or foreign key? If so, how does that change the requirements of what can be allowed in the attribute? Does the attribute have the same restrictions in the legacy system, and if not, how does one reconcile those differences?

image Given profiling information about the attribute (the number of distinct occurrences, total number of records, and so on), does that information change as the attribute is migrated? If so, how does that affect performance, normalization, and how it interacts with other attributes?

All of these factors and a host of others go into estimates of the actual amount of time it would take to migrate a particular attribute. As the complexity of both the legacy and target systems increase, the number of attributes tends to grow, and the implications of these issues also tend to increase.

It becomes easier to see why the average is closer to 3–5 hours per attribute, rather than 1 minute per attribute. The effects of the new estimate on the overall length of the project are profound. If data managers retain but temper their optimism, and estimate that for a particular systems migration project, each attribute will take 1 hour, the estimate grows to approximately 1 decade of person work. That might be doable in 1 year of work for 10 people, but likely could not be accomplished by 120 people in 1 month, even if there was not the massive burden of training and coordinating so many people.*

With these estimates in mind, it becomes quite clear why getting these efforts right the first time becomes critical. If the mapping is done incorrectly, or a portion of the data cannot be migrated due to an error in understanding of the attributes that were mapped, there is a good chance that substantial extra cost and effort will have to be undertaken to repair the damage. Typically, the consequences of putting the wrong roles on the project are significant budget and time overruns, paired with less than expected functionality. Keeping these ideas in mind allows us to make the most out of XML efforts related to these projects.

XML, Security, and Data Engineering

Of all of the various aspects of data management, security is one of the areas that XML did not thoroughly address until recently. In this section, we will briefly discuss what XML has to offer in the way of security, and what data managers might expect.

In many organizations, security is not really thought of as a core mission of data management. Still, it is something that most who work in the field will tell you should be part of the overall process, and not something that is tacked onto an existing system. Security should be integrated into the process from the start. When people talk about security, there are many topics they refer to that are worth considering, such as data authenticity, confidentiality, integrity, and availability. In addition, some refer to the security topic of nonrepudiation, which deals with auditing to determine who performed which action at which time, and to minimize deniability of actions.

When people refer to XML and security, there are two large areas that they speak about. The first deals with security in situations where queries are executed against XML documents, such as when a user wishes to pull a certain set of elements out of an XML document. The second deals with the security of the structure of the data themselves as they are in transit over a network. This can be loosely thought of as security within a document, versus the security of the overall document.

As of the writing of this book, there were several standards and approaches to security in querying XML data sources, but they were not particularly advanced or widely implemented. While those standards are developing and will be used in the near future, for this discussion we will focus on protecting documents in transit.

Figure 4.10 illustrates how XML security is actually considered another “layer” of security on top of several that already exist. The widely accepted OSI model of networking breaks up the functions of networking into several layers, such as the transport, protocol, and application layers. Security concepts have been developed at each layer, which is why some might suggest that XML security can be done using other technologies that already exist. Those technologies include IP Security (IPSec) at the Internet-protocol level, Transport Layer Security (TLS) for the transport-control protocol (TCP) at the transport level, and Secure Socket Layer (SSL) at the protocol layer. When people refer to XML security standards that protect documents in transit, they are usually referring to those that would logically fall into the very top row of Figure 4.10. When XML documents travel over networks, the process of communication involves all of the different layers, even if the specific technologies in those layers may differ.

image

Figure 4.10 XML security and the OSI model of layered standards.

There are three main security standards that data managers should be aware of for working with XML. These standards are already mature and can be implemented by data managers in existing systems, or as part of new development to secure XML documents. They are SAML, XML Signatures, and XKMS, described below.

SAML–Security Assertions Markup Language

The Security Assertions Markup Language (SAML) has been ratified by OASIS as a standard, and provides a public key infrastructure (PKI) interface for XML. The standard gives XML a framework for exchanging authentication and authorization information, but it does not provide a facility for actually checking or revoking credentials. In other words, SAML offers the XML structure needed to request and understand authentication information, but it does not actually check the information. This might sound like a drawback, but actually, it is an advantage; SAML provides a generic mechanism for gathering the data that can then be verified according to what is right for the application. In some cases, that may be a username and password, in others a cryptographic key, or in others just a pass phrase.

There are three kinds of SAML statements in the assertions schema:

1. Authentication, stating who the requester is

2. Authorization decision, stating whether or not the request was granted

3. Attribute, stating what is actually being requested

Through combinations of these statements, SAML allows application designers to determine how authorization information should be exchanged, and when that transaction should happen. SAML is a good example of an XML meta-language for security.

XML Signatures

XML Signatures is a standard that enables the recipient to be sure that the message was not tampered with, and to verify the identity of the sender. This is actually an extension to XML of a very common security mechanism—the signature. If documents have to pass through several hands or many different networks before they reach their final destination, it is crucial to have a signature attached that lets the recipient be confident of the information being received. If an XML signature is verified, the recipient knows that the document is in the same form in which the sender sent it, and if the signature fails to verify, the recipient would be wise to ignore the document and contact the sender. The two security concepts that are most relevant to data signatures are data integrity and nonrepudiation.

The one important thing to keep in mind about signatures is that they are only as good as the signing party. If someone or some organization that you know and trust created the XML signature, then it may be useful. But if a signature was created by an identity that you are not familiar with, then the signature is only as trustworthy as the identity behind it. For this reason, when people create signatures they usually use standard identities that can be checked and verified using the public key infrastructure (PKI). The concepts behind PKI and encryption are covered at length in other texts, which describe in more detail the important aspects of trust, how it is cultivated and revoked in digital environments. Before working with XML signatures, it would be wise to familiarize yourself with this information.

XKMS–The XML Key Management Services/System

XKMS provides a facility for performing PKI operations as XML requests and replies. PKI, the public key infrastructure, is the basic trust mechanism behind encryption and digital signatures. Normally, individuals and organizations have standard identities that they publish and maintain. Those identities are published on central servers where they can be requested and updated. When information is needed about a particular identity, XKMS is an XML-based way to ask for the information and receive a response.

This might be used in conjunction with XML signatures as part of a larger security solution. When documents come in with particular XML signatures, the application might utilize XKMS to fetch the identity that corresponds to the signature in the document. That identity could then be checked against a list of known identities to determine whether or not the data in the document was trustworthy.

Overall, XML security standards, like other components of the XML architecture, are meant to handle a piece of a problem, and to be combined to solve larger problems. There is no one single XML technology that addresses all security concerns, nor is there likely to be. But by surveying the landscape of available XML security technologies, data managers already have the tools necessary to ensure the confidentiality and integrity of critical information transmitted in XML form.

Data Mapping Case Study

Let us take a look at the issue of migrating from a legacy system to a new system, and how XML-based architecture engineering plays into it. This example is essential in the next chapter. Often, the process of migrating data from a legacy system to a new system consists of an analyst trying to come up with a set of mappings between two systems, old and new or “as is” and “to be.” Given a particular attribute, a number of questions immediately arise about how this attribute might be migrated.

image Does this attribute map to the new system?

image Does it map to just one element or data item in the new system, or more than one?

image Do multiple attributes in the old system have to be combined for insertion into the new system (as is often the case with ill-advised “smart keys”)?

image Does this attribute remain associated with the same entity or a different one? What is the implication for the data values and normalization?

When analysis is approached in this way, it tends to remain unfocused and much too granular. In such a case, the form of the data in the legacy system has an inappropriately large impact on the way the data is thought of and moved over to the new system. One of the reasons that we implement new systems is to free ourselves from the mistakes and shortcuts of the past that caused organizational pain—and to think of the new solution and/or data in terms of the old system is to invite some of those old, ill-advised practices to crop up in the new system, albeit perhaps under a different guise.

Attempting to migrate individual attributes focuses on the wrong level of the problem and results in solving individual many to many individual mappings. This is shown as the top half of Figure 4.11. Rather than looking at attributes, or even individual entities, what really should be considered is the overall data structure. What information is being captured, and why? Migration efforts must take into account the thinking that went into the architecture of both the old and the new systems. For example, the old legacy system stored address information at the customer “household” level (a common approach for certain types of businesses such as catalog marketers, etc.), with each household potentially containing many customers. The new system, however, stores address information at the individual customer level. If this migration is attempted at the attribute level, the complexity is profound, since the total number of records for address will likely increase substantially; the attribute has to be moved into a different entity, the ability to aggregate customer records into households changes, and so on. Considering the architectural goals of both the old and new systems can simplify things. Why was “address” put in this place, and how might we manipulate data from the old system into a coherent data structure as it would appear in the new system? Understanding not only the hows, but the whys of systems is critical to making work as efficient and accurate as possible.

image

Figure 4.11 The traditional systems analysis approach.

image

Figure 4.12 Characteristics of the old system and the new system.

Let us take a look at a concrete example of what types of figures might be dealt with in terms of a systems migration effort. In this figure, we see that the old payroll and personnel systems are going to be replaced by a new system. The platform is changing; the operating system and even the core data structure are changing from a flat and hierarchical system to a client/server relational database management system.

One of the first things that catches the eye is the large disparity between the number of logical and physical records in the old system and the new. The old system has a total of more than 5 million physical records, with just over 300,000 total logical records. The new system seems to reduce the total number of logical records to 250,000, and the number of physical records is being reduced by a factor of 7! How is this even possible? On the other hand, the total number of attributes is ballooning from about 2,000 in the combined old systems to 15,000 in the new system. In many system migration cases, what we find is that the new systems may appear to have substantially more or less data than the old systems. This tends to make an attribute level mapping of data from the old system to the new extraordinarily difficult. But why the huge change in data storage statistics? There are several reasons for these changes.

image As systems age, they have more and more support code and data put into them to keep them running in response to changing requirements. Almost any system that is older than a few years will have typically required modification to allow it to perform some task that was outside of its original requirements. These modifications are rarely clean or elegant, and often require “strut” data structures that are not required in the new system, since it hopefully was designed from the start to accommodate that requirement.

image Different requirements for the new system sometimes cause extra data to be moved into or out of the system. For example, mappings between zip codes and state names, past client histories, and other information might be obtained from another source in a new system, so that the new system would appear smaller than the one it is replacing. This is due to a change in processes and a change in data flow.

image Typically, new systems implementation efforts are targeted rightly or wrongly to satisfy many data needs that the users may not have had satisfied in the past. This can cause “feature creep” in the design of new systems that leads to the appearance that they are larger or smaller than the original.

One of the problems that new system development frequently falls victim to is something akin to “data packrats.” In many cases, attempts are made to bring all of the data from an old system forward to a new system, when no analysis has been done as to whether or not that old data is actually needed in the new system. This data can crop up in strange places. For each data item that a system captures and manipulates, it is always a good idea to know why that item is being captured, and what it will eventually be used for. In many cases, particular pieces of information are captured because they are needed for some small sub-process, for example, validating customer information or running a particular validation check on an account. As time moves on and processes change, that data might no longer have any reason for existing, yet still it is brought forward because everyone is “pretty sure that it is there for one reason or another.”

If moving to a new system is starting with as clean a slate as possible, why not make an effort to minimize the amount of information the new system has to cope with by taking a hard look at whether or not data-capture requirements have changed since the architecture of the old system? Then again, sometimes data is captured simply because the organization thinks that it might have a use for the data sometime in the future. For example, one major national credit card company captures information on the shoe size of its customers where that data is available. Perhaps the data will be useful in some data mining effort in the future, but before adding large volumes of data that must be managed to a system, it is always wise to look at the proposed benefit relative to the very real current costs of managing that data.

The tendency is to capture as much data as possible, since the thinking is that no one ever got in trouble for having too much information on which to base decisions and actions. Unfortunately, that is not the case. Each piece of data that is captured has to be managed and dealt with, taking into account all of the quality issues that can arise in data. The only thing worse than not capturing a specific piece of data is thinking that the data you have is accurate when in fact it is not.

Now let us look at the proposed metrics of this attribute mapping process, which came from a professional systems migration project (Figure 4.13). Given the problem, a mapping of 2,000 attributes onto a new set of 15,000, it was proposed that two data engineers can accomplish the work in 1 month of constant work. Bearing in mind those numbers, we can look at the figure above and see that all of this work comes out to averaging 0.86 attributes per minute! In other words, we would expect these data engineers to be able to understand and map close to 1 attribute per minute, after they have successfully located the attribute. They would then of course be allowed the remainder of their 1 minute per attribute analysis of the document and do quality assurance on their work. Clearly this is not a sustainable rate of work. In fact, it is an utterly unrealistic pace for the work to proceed at for all but the most rudimentary systems migrations that involve a minimum of data. Perhaps these numbers could be thought of as an optimistic projection for simple projects.

image

Figure 4.13 Metrics of an “extreme” data engineer.

Project Planning Metadata

Figure 4.14 shows a project plan that was submitted by one of the final four mega-consulting firms on a project. They told the customer that it would require more than $35 million to implement while the authors’ organization brought it in for under $13 million. In this case, a copy of the actual Microsoft Project document was taken and examined carefully before exporting it to a format where it could be repurposed—XML. Once available in XML, it was imported into spreadsheets, databases, and data-analysis tools. Analysis of this data enabled us to determine that the organization was attempting to bill customers for several full years of a single partner’s time. It also enabled us to point out early on the infeasibility of their data conversion planning.

image

Figure 4.14 Representative project plan.

The original consultant’s proposal was that the data conversion for the ERP would take 2 person-months. When we did the math, we found the following:

image Two person-months is equal to 40 person-days.

image In 40 person-days, the consultant was stating that they could map 2,000 attributes onto 15,000!

image On the source side, 2,000 attributes mapped in 40 person-days required a rate of 50 attributes each person-day; 6.25 attributes analyzed each hour.

image On the target side, the team would have to work to analyze 15,000 data attributes in 40 days—a rate of 375 attributes each person-day and 46.875 attributes each hour.

image To do a quality job, locating, identifying, understanding, mapping, transforming, and documenting tasks would have to be accomplished at a combined rate of 52 for each hour or .86 attributes for each and every minute!

There all sorts of reasons that these numbers are not credible, and it implies that the organization is either underbidding the project and hoping for an add-on contract to extend the work, or they are hopelessly naive about the process required. Whether organizations believe a metric of one attribute per hour or not, if they go to the trouble of analyzing the metadata they can collect about the project, they will be better informed as to the project’s size and shape. Next, we will describe the process of repurposing the ERP metadata—wrapping it in XML and reusing it in a number of ways to help the project implementation.

Extracting Metadata from Vendor Packages

As this chapter is written in 2003, the war between Oracle and PeopleSoft that has also involved J. D. Edwards is being played out. It serves to emphasize the important ERP-derived data management products that can be managed using XML. Imagine the amount of resources that will be wasted converting from J. D. Edwards to PeopleSoft and then converting it again to others.

Figure 4.15 shows the first 23 of approximately 1,000 business processes contained in the PeopleSoft Pay, Personnel, and Benefits modules—a popular combination of products purchased to replace legacy systems. This metadata can be accessed relatively easily—a fact that is not well known. However, obtaining ERP metadata in an XML-based format is not difficult. Understanding the internal structures used by the ERP can vastly simplify the process of evolving your data into its new forms. Understanding the value of the XML-wrapped business processes is easier when viewing business processes as hierarchical structures.

image

Figure 4.15 A view of PeopleSoft businesss processes.

Figure 4.16 Illustrates the PeopleSoft Internal Metadata Structure. Figure 4.17 shows the physical implementation of the hierarchy described below. This hierarchy of workflow, system, and data metadata and other structures play a number of roles in the implementation of the ERP, including

image

Figure 4.16 PeopleSoft internal metadata structure.

image

Figure 4.17 PeopleSoft process metadata.

image Workflow Metadata can be used to support business practice analysis and realignment. This permits accurate representations of what exactly the ERP does, which can have important implications if the organization is planning to become ISO-9000 certified to do business in the EU.

image System Structure Metadata can be used in requirements verification and system change analysis. XML-wrapped, system-structure metadata allow the development of more precise requirements validation techniques. We practice a requirements technique that shows how each requirement can be precisely identified as system functionality existing on one or more uniquely identified screens of the application. For example, panels X, Y, and Z are used to implement the hiring process.

image Data Metadata is data describing specific aspects of the implemented data, and it can be used to support data-related evolution sub-tasks such as data conversion, data security, and knowledge worker training. Expressing the metadata in XML permits them to be reused in support of data conversion by automating the building of extraction queries and organizing training sessions.

ERP metadata expressed in XML have many other novel uses. See Aiken and Ngwenyama (1999) for more technical details of these uses. Let us close this section on ERP metadata products by describing two products that are seen as useful by all managers.

The first is a quick tool developed for managers who are interested in exploring more about the new ERP. Illustrated in Figure 4.18, a small Java application reads the XML and allows the user to display summaries of ERP metadata depending on where the mouse is placed. The figure illustrates use of workflow metadata showing the processes in each home page and the Administer Workforce components. It indicates the processes associated with each homepage, particularly the two with the most: Develop Workforce, and Administer Workforce. The same tool was used to display why the recruiters were receiving separate training. In this manner, every manager who visited the web site understood that they were benefiting from the use of XML.

image

Figure 4.18 Tangible use of XML by management: XML wrapped PeopleSoft Metadata.

The second management oriented product is the ability to phase various cut-overs during the implementation. Figure 4.19 shows a chart that all management understood. XML provided a means of buffering users from transitions with high failure costs. Legacy data expressed in XML interacting with the ERP data can be managed using XML schemas, focused data engineering. As a data management product, all who have suffered through poorly managed cut-overs will welcome the phasing ability.

image

Figure 4.19 XML wrapping PeopleSoft Metadata. (TheMAT is short for The Metadata Access Technology.)

Real world projects, however, tend to be of a level of complexity that rules out the use of this information as a basis for estimations. In truth, the actual amount of time that it takes these efforts is many times this estimate. Given this information, it is not difficult to see why dataengineering projects of this type tend to run late and over budget. Even the engineer of a train bridge over a river is not burdened in his or her job with trying to figure out a way to take old architectural pieces of the previous bridge and somehow work them into the design. The task of simply understanding the complexity of systems migration in context is considerable. In some cases, that preliminary step may take as much time as an organization has allotted for the entire effort!

So the question arises, how is it exactly that we measure these dataengineering tasks? In order to effectively plan projects and understand what one is getting into before the project is launched, it is important to have some ability to measure what it will take.

Chapter Summary

We have described the interrelationships that exist between XML and data engineering. At the very least, we hope that you are motivated to begin reverse engineering your new systems in order to obtain an understanding of the data architecture of the target components. This discussion led to an argument for the importance of architecture concepts supporting the analysis. We have also covered the role of data quality in XML metadata recovery, development of vocabularies supporting metadata-engineering activities, and the approach to XML-based data engineering.

The purpose of these discussions is to get data managers thinking in the right direction about some of the projects currently going on inside their organizations. When looking at the Zachman framework, it is good for data managers to know that things they are doing in their projects today can work toward the goal of a more coherent overall architecture, even if management will not spring for an enterprise architecture initiative. Toward the end of the chapter, we discussed sizing components so that value could be shown from metadata engineering projects, and how to structure these engagements to get the most benefit out of the XML-based metadata.

Distilled into its simplest form, we hope the lessons of this chapter show that first and foremost, organizations need well-structured data upon which to build their XML structures. Attacking this issue at the global organizational level is too much and too expensive, so data managers need to take a bottom-up approach to building these solutions. Throughout these projects, there of course needs to be a way of fitting these components together, and this coordination of components can be accomplished through effective metadata management techniques we have discussed.

References

Aiken, P.H., Ngwenyama, O., et al. Reverse engineering new systems for smooth implementation. IEEE Software. 1999;16(2):36–43.

Finkelstein, C., Aiken, P.H. Building corporate portals using XML. New York: McGraw-Hill; 1998.

Merriam-Webster’s unabridged dictionary. Springfield, MA: Merriam-Webster, 2003.

Spewak, S.H. Enterprise architecture planning. Boston: QED Publishing; 1993.

Zachman, J. A framework for information systems architecture. IBM Systems Journal. 1987;26(3):276–292.


*Merriam Webster’s Unabridged Dictionary. (2003).

*Example from Spewack (1993). http://www.winchestermysteryhouse.com

*For more information on this issue of “too many cooks in the kitchen,” see Fred Brooks’s work, The Mythical Man-Month (1995).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset