Chapter 1.5

Corporate Data Analysis

Abstract

Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there is much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the “great divide.” The divide is so large that many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

Keywords

Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

Data are fairly worthless unless it can be analyzed. So, the data architect must always keep in mind that ultimately, the purpose of data is to support analysis.

The analysis of corporate data is pretty much like the analysis of other kinds of data with one exception. And that exception is that most of the time, corporate data come from multiple sources and multiple types of data. The fact that the origins of corporate data are multifaceted colors all of the analysis of corporate data. Fig. 1.5.1 depicts the need to analyze corporate data.

Fig. 1.5.1
Fig. 1.5.1 Examining the details.

As is the case with all data analysis, the first consideration of analysis is whether the analysis will be a formal analysis or an informal analysis. A formal analysis is one with corporate or even legal consequences. Occasionally, an organization has to do an analysis that is governed under rules of compliance. Typical governing bodies are Sarbanes-Oxley or HIPAA. And there are plenty other types of compliance, such as audit compliance. When a formal analysis is occurring, the analyst has to concern himself/herself with the validity and the lineage of the data. If incorrect data are used for a formal analysis, the consequences can be dire. Therefore, if a formal analysis is to occur, then veracity and the lineage of the data are very important. In the case of public corporations, an external public accounting firm must sign off on the quality and accuracy of the data.

The other type of analysis to be done is an informal analysis. An informal analysis is done really quickly and can use any available numbers. While it is nice if the data used for an informal analysis is accurate, the consequences of using less than accurate information for an informal analysis are not severe.

When doing data analysis, constant awareness must be made as to whether the analysis is formal or informal.

The first step in doing corporate data analysis is physically gathering the data to be analyzed. Fig. 1.5.2 shows there are usually many diverse sources of corporate data.

Fig. 1.5.2
Fig. 1.5.2 Diverse sources of textual data.

In many cases, the sources of data are automated, so physically gathering data is not much of a problem. But in some cases, the data exist on a physical medium such as paper, and the data must pass through technology such as optical character recognition (OCR) software. Or in other cases, the data exist as conversations and must pass through voice recognition/transcription technology.

Usually, the physical gathering of data is the easiest part of doing analysis across the corporation. Much more challenging is the logical resolution problem. The logical resolution aspect of corporate data management addresses the issue of bringing together many disparate sources of data and reading and processing the data seamlessly. There are MANY problems with the logical resolution of corporate data. Some of the many problems are as follows:

  • Resolving key structures—a key in one part of the corporation is different from a similar key in another part of the corporation.
  • Resolving definitions—data defined one way in the corporation are defined another way in a different part of the corporation.
  • Resolving calculations—a calculation made one way in the corporation is made using a different formula in another part of the corporation.
  • Resolving data structures—data structured one way in the corporation are structured differently in another part of the corporation.

And the list goes on.

In many cases, the difficulties of resolution are so difficult and so ingrained in the data that resolution cannot be satisfactorily done. In this case, the corporation ends up having different analyses being done by different organizations in the corporation. The problem with different organizations doing their own separate analysis and calculation is that the result is parochial among the different organizations. No one at the corporate level is able to see what is going on at the highest level of the corporation.

The problem of resolution of data is magnified with corporate data when data cross the boundary of structured data and big data. And even within big data, when data cross the boundary between repetitive unstructured data and nonrepetitive unstructured data, there is a challenge.

There are then serious challenges when the corporation attempts to create a cohesive, holistic view of data across the entire corporation. If there is to be a true corporate foundation of data, it is necessary to integrate data, as seen in Fig. 1.5.3.

Fig. 1.5.3
Fig. 1.5.3 Integration of data.

Once data are integrated (or at least once as much data as can be integrated are in fact integrated), it is then reformatted into a normalized fashion. There is nothing particularly magical about a normalized structuring of data other than

  • - normalization is a logical way to organize data,
  • - tools that do much of analytic processing operate best on normalized data.

Fig. 1.5.4 shows that once data are normalized, it is easy to analyze.

Fig. 1.5.4
Fig. 1.5.4 Normalized data.

The result of normalization is that data can be placed into flat file records. Once data are placed into normalized, flat file records, the data can be easily calculated and compared and all the other aspects of normalization.

Normalization is an optimal state for data to be analyzed because in a normalized state, the data are at a very low point of granularity. Because the data are at a very low point of granularity, it can be categorized and calculated in many different ways. From an analogical standpoint, data in a normalized state are similar to grains of silicon. Raw grains of silicon can be recombined and remanufactured into many different forms—glass, computer chips, body implants, and so forth. By the same token, normalized data can be reworked into many different forms of analysis.

(As a side note, normalizing data does not necessarily mean that data will be placed into a relational structure. Most of the time, the normalized data are placed into a relational structure. But it is entirely possible to place normalized data in a structure other than a relational structure if that makes sense.)

Whatever structuring of data is used, the result is that normalized data are placed into records of data that may or may not have a relational foundation, as seen in Fig. 1.5.5.

Fig. 1.5.5
Fig. 1.5.5 Normalized records of data.

Once the data are structured into a granular state, the data can then be analyzed in many different ways. In truth, once corporate data are integrated and placed into a granular state, the analysis of corporate data is not very different than the analysis of any other kind of data.

Typically, the first step of analysis is categorization of data. Fig. 1.5.6 suggests the categorization of data.

Fig. 1.5.6
Fig. 1.5.6 Categorization of data.

Once data are categorized, many sorts of analysis can ensue. One of the typical forms of analysis is the identification of exceptional data. For example, the analyst may wish to find all customers who have spent more than $1,000 in the past year. Or the analyst may want to find days when production peaked over 25 units a day. Or the analyst may want to find what products were painted red that weighed more than 50 pounds. Fig. 1.5.7 depicts an exception analysis.

Fig. 1.5.7
Fig. 1.5.7 Exceptions analysis.

Another simple form of analysis is that of categorizing data and counting the data. Fig. 1.5.8 shows a simple categorization and count.

Fig. 1.5.8
Fig. 1.5.8 Simple counts of records.

And of course, once counts by category can be done, comparisons across categories can be done as well, as seen in Fig. 1.5.9.

Fig. 1.5.9
Fig. 1.5.9 Comparisons of different records.

Another typical form of analysis is that of comparing information over time, as seen in Fig. 1.5.10.

Fig. 1.5.10
Fig. 1.5.10 Comparing information over time.

And finally, there are key performance indicators (KPIs). Fig. 1.5.11 shows the calculation and tracking of a KPI over time.

Fig. 1.5.11
Fig. 1.5.11 Key performance indicators (KPIs).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset