Chapter 13. Scaling Up

Up until now, we have reviewed a steady stream of pertinent topics concerning statistics and specifically, predictive analytics. In this chapter, we look to provide a tutorial dedicated to applying those concepts and practices to very large datasets. First, we'll begin by defining the phrase very large – at least as it is used to describe data defined (that we want to train our predictive models on or run our statistical algorithms against). Next, we will review the list of the challenges imposed by using bigger data sources, and finally, we will offer some ideas for meeting these challenges.

Our chapter is broken down into the following sections:

  • Getting started
  • The phases of an analytics project
  • Experience and data of scale
  • The characteristics of big data
  • Training models at scale
  • The specific challenges (of big data)
  • A path forward

Starting the project

The phases of a general purpose predictive analytics project may be straightforward and perhaps easy (it's the practice of carrying out each of these phases effectively that is challenging).

Starting the project

The Phases of a predictive analytics project

These phases are:

  1. Define (the data).
  2. Profile & Prepare (the data).
  3. Determine the Question (what to predict).
  4. Choose the algorithm.
  5. Apply the model.

Data definition

An interesting thought:

"…Once you have enough data, you start to see patterns," he said. "You can build a model of how these data work. Once you build a model, you can predict…"

– Bertolucci, 2013

At the beginning of any (and every) analytics project, data is defined – reviewed and analyzed: source, format, state, interval, and so on (some refer to this as the process of investigating the breadth and depth of available data).

One exercise demanded is to perform what is referred to as profiling the data source, or to establishing your data's profile by determining its characteristics, relationships, and patterns (and context). This process will, hopefully, produce a clearer view of the content and quality the data to be used in the project – that is, the data profile.

Then, after the exercise of profiling is completed, one would most likely proceed with performing some form of data scrubbing (this is also sometimes referred to as cleansing or in some cases preparing) in an effort to improve its level of quality. During the process of cleansing or scrubbing your data, you would most likely perform tasks such as aggregation, appending, merging, reformatting fields, changing variable types or adding missing values, and so on.

Note

Data profiling techniques can include specific analysis types such as a univariate analysis which involves frequency analysis for categorical variables and understanding distribution and summary statistics for continuous variables. This aids in missing value treatment, understanding distribution, and outlier treatment.

Experience

When soliciting advice from a subject matter expert (SME), one would probably likely agree that an individual with more experience most likely will be able to provide a better service. With predictive analytics projects, the objective is not what the data can tell us, but what the data can tell us about an objective or problem, therefore, the size or amount of the data source (the amount of experience) available for the project becomes much more important. Typically, the more the data, the better.

So, at what point is it acceptable to say you have enough data for your predictive project? The politically correct answer to this question is that it depends. Some types of data science and predictive analysis projects require more specific data requirements than others will, effectively setting what the minimal data volumes might be.

In an extreme case, predicting may require data spanning many years or even many decades – as larger amounts of data can yield a breadth of patterns surrounding behaviors and decisions, and so on. Why? Because typically, analyzing (or training a model with) more data develops a more comprehensive understanding or a better prediction.

With this in mind, perhaps a general rule of thumb is to collect as much data as possible (depending upon the objective or type of application). Some experts might suggest collecting at least three years ', and preferably five years', worth of data before beginning any predictive analysis project. Of course, years may not be the appropriate measure depending upon the type of application. For example, cases might be more appropriate or lines of text, and so on.

Note

In practice, if an application was built around hospital visits, the more patient cases (typically millions) the better; a word predictor application would want to have as many text sentences or word phrases (tens of millions) as it could (to be effective).

Another predictive analytics data controversy might be understanding the idea of sufficient versus enough.

In some cases, given a shortage of volume or quantity, the wise data scientist would always focus on the quality or suitability of the data. This means that even though the volume of data is less than hoped for, the quality of the data, based upon the objective of the project, is deemed sufficient.

Given an understanding of all of the preceding points, it is important to gauge your data to determine if the volume of your data has reached the tipping point – that point where typical analytical activities begin to become onerous to perform.

In the next section, we will cover how to establish that data volume tipping point as it is always better to understand and expect challenges before you begin your heavy model training rather than finding out the hard way, after you've already begun.

Data of scale – big data

When we use the phrase, data of scale we are not referring to the statistical measurement scales of interval, ordinal, nominal, and dichotomous. We are using the phrase loosely to convey the size, volume, or complexity of the data source to be used in your analytics project.

The, by now, well-known buzz word, big data might (loosely) fit here, so let us take pause here to define how we are using the term big data.

A large assemblage of data, datasets that are so large or complex that traditional data processing applications are inadequate, and data about every aspect of our lives have all been used to define or refer to big data.

The following diagram illustrates big data's three v's:

Data of scale – big data

In 2001, then Gartner analyst Doug Laney introduced the 3Vs concept to describe the occurrence of big data. The 3Vs, according to Laney, are volume, variety, and velocity. The Vs make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data).

Using the volume, variety, and velocity concept, it is easier to foresee how a big data source can be or quickly become increasingly challenging to work with, and as these dimensions' increase or expand they will only encumber the ability to effectively train predictive models on the data further.

Using Excel to gauge your data

Microsoft Excel is not a tool to be used to determine if your data qualifies as big data.

If your data is too big for Microsoft Excel it still really doesn't necessarily qualify as big data. In fact, gigabytes of data, still manageable with various techniques, enterprise, and even open source tools, especially with the lower cost of storage today.

It is important to be able to realistically size or scale the data technology (keeping in mind expected data growth rates) you will be using in your predictive project before selecting an approach or even beginning any profiling or preparing work effort. This time is well spent as it will save time later that may be lost due to performance bottlenecks or rewriting scripts to use a different approach (one that can handle bigger data sources).

So, the question becomes, how do you gauge your data – is it really big data? Is it manageable? Or does it fall into that category that will require special handling or pre-processing before it can be effectively used for your predictive analytics objective?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset