Image241123.jpg

Chapter 11
Search and Analysis

There is much confusion concerning the meaning of analysis and analytics. Obfuscating the marketplace are vendors. Vendors always try to sell their solution as if it were the only solution. Vendors don’t like architectures because vendors look at an architecture as an obstacle to making a sale. In reality, vendors don’t like anything except a sale. Which leads vendors into the bad habit of really confusing customers and the marketplace.

In order to hear a non-vendor influenced discussion on what analysis/analytics are, consider the following. A corporation has a simple desire to find out how many xxxxxxx does yyyyyyyy use in a zzzzzzzz time frame. Fig 11.1 depicts this typical analytical question.

Image251059.jpg

Fig 11.1 Answering the typical analytical question

When you stop to analyze the question, it is seen that there are two elements that have been posed:

  • Find the data that can be used to answer the question
  • Analyze the data once found.

Fig 11.2 shows the two elements of analysis/analytics.

Image251067.jpg

Fig 11.2 Understanding the two elements of an analytical question

If the criteria for finding the data are straightforward and if the data is indexed, then finding the data is an almost trivial thing to do. But there can arise some complications. Suppose the search is for something that is hidden or disguised, such as encrypted data. Or what if the data is marked by only very faint markers, say for a bank account that was opened fictitiously and has been operated for clandestine purposes. There are many ways data can hide and in these cases, finding data may not be a trivial task at all.

Another way data can hide is by lurking behind a lot of mundane data points. Suppose you wanted to find a particular man in the US and you only knew that he was a man. You would have to search through each male in the US and see if he was the man you were interested in. Such a search would be anything but easy and efficient.

Once the data is found, then it needs to be analyzed. Analyzing data too can be complex. If all data analysis means is to display selected elements of data then analysis is easy. But sometimes analysis of data entails complex algorithms and complex calculations. In any case, there are two very different facets of what is meant by data analysis. Fig 11.3 shows these two sides of the analysis.

Image251076.jpg

Fig 11.3 Doing data analysis

There are technologies dedicated to these two aspects of analysis. One type is called machine learning and concept search. Machine learning and concept search are dedicated to searching for data where the criteria for searches are murky.

Analysis has the technology of summarization and visualization. Not only is analysis divided up into two distinct facets – search and analysis, but there are different kinds of search. One kind of search looks for very finite sets of data. A person may go looking for the last medical checkup record for Bill Inmon, since there is only one such record at any moment in time.

Or a search may be for a large set of data. Looking for the medical records for a population is one such search. There are many, many medical records for the population of a state or even a city, for example. Fig 11.4 shows that there are different kinds of searches.

Image251085.jpg

Fig 11.4 Understanding the two basic types of search

The whole subject of doing a search is complicated by the data that is being operated on. When it comes to finding data, the untransformed data lake is very difficult to find anything in. That is because data is very unintegrated inside the untransformed data lake. The lack of data integration inside the untransformed data lake greatly contributes to the difficulty of finding data inside the lake. The criteria for searching for data inside the unintegrated data lake are very unclear. The lack of clarity inside the data lake makes for a difficult experience.

But once the data lake becomes integrated, once there are data ponds and the ponds are conditioned, then searches become much, much easier and straightforward. Fig 11.5 shows the difference between searching the data lake and searching the conditioned data inside the data ponds.

In fact, there are a lot of reasons why trying to find the right data inside the data lake is so difficult:

  • There is so much data that data “hides” or is indistinguishable from other data
  • Once you have found something, you are not sure it is actually the data you want
  • The criteria for finding data is very unclear
  • Even after data has been found, it needs to be converted before it can be used
  • The qualifications for data are unclear.

Image251092.jpg

Fig 11.5 Searching the data lake vs. searching the conditioned data inside the data ponds

And data inside the pond – once it has been conditioned – is easy to access and analyze. Fig 11.6 shows why data inside the ponds are suitable for analysis.

Image251099.jpg

Fig 11.6 Finding data in the data lake is easy for several reasons

After data is found, it’s then time to analyze. Data analysis software and technology has been around for a long time, so there are many ways to analyze data once found. Fig 11.7 shows that analysis of data follows after the search.

Image251108.jpg

Fig 11.7 Analyzing the data that has been found

There are many forms of analysis. Some of those forms include:

  • The mere sorting of data. Sometimes sorting data allows important data to surface and become obvious when that data would not otherwise be so.
  • Summarizing data. On occasion, summaries of data bring to light data that would otherwise be lost or overlooked.
  • Comparing data. Looking at data and comparing and contrasting to other sets of data often yields insight.
  • Exception analysis. Finding outliers and exceptions often lead to insight.

Perhaps the most powerful form of analysis is visualization, studying data in a diagram or picture representation. Visualization is popular because with a properly created visual setup, massive amounts of data can be depicted in such a fashion that important conclusions are immediately obvious. See Fig 11.8.

Image251117.jpg

Fig 11.8 Visualizing the data

Confusion Spread by the Vendors

So how do vendors confuse the marketplace when talking about analysis and analytics?

  • Vendors present their product as a final solution when it is only part of a solution
  • Vendors hate architecture because it lengthens their sales cycle
  • Vendors make assumptions about data that simply are unrealistic
  • Vendors confuse search with analysis
  • Vendors don’t recognize that they are part of a solution.

These are the most common, but there are many other ways that vendors confuse the marketplace. It is in the vendor’s best interest to sow seeds of confusion.

In Summary

There are two aspects to analysis – the search of data and the analysis of that data once the search is complete. The data search is much easier and accurate when the search is done against transformed data as found in the data lake/data pond architecture.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset