A path forward

So, the inkling of having more than enough data for training a model seems very appealing.

Big data sources would appear to answer this desire, however in practice, a big data source is not often (if ever) analyzed in its entirety. You can pretty much count on performing a sweeping filtering process aimed to reduce the big data into small(er) data (more on this in the next section).

In the following section, we will review various approaches to addressing the various challenges of using big data as a source for your predictive analytics project.

Opportunities

In this section, we offer a few recommendations for handling big data sources in predictive analytic projects using R. Also, we'll offer some practical use case examples.

Bigger data, bigger hardware

We are starting with the most obvious option first.

To be clear, R keeps all of its objects in memory, which is a limitation if the data source gets too large. One of the easiest ways to deal with big data in R is simply to increase the machine's memory.

At the time of writing, R can use 8 TB of RAM if it runs on a 64-bit machine (compared to only 2 GB addressable RAM on 32-bit machines). Most machines used for predictive analytics projects are (at least should be) 64-bit already, so you just need to add RAM.

Note

There are both 32-bit and 64-bit versions of R. Do yourself a favor and use the 64-bit version!

If you know your data source well and have added appropriate amounts of memory to your machine, then you'll most likely be OK to work with a big data source efficiently, especially if you use one of the approaches outlined in the following sections of this chapter.

Breaking up

One of the most straightforward and proven approaches to taming a big data source with R (or any language for that matter) is to create workable subsets of data prepared from the big data resource.

For example, suppose we have patient health records making up a current big data source. There are literally trillions of patient case records in the data with more added almost every minute. These cases record both the basics (sex, age, height, weight, and so on) as well as specifics around the patient's background (such as if the patient is a smoker, drinker, currently on medications, has ever been operated on, and so on). Luckily, our file does not contain any information that can be used to identify the patient (such as name or social security number) so we won't be in violation of any laws.

The data source is fed by hospitals and doctors' offices all over the country. Our predictive project is one that is looking to determine relationships between a patient's health and the state that they live in. Rather than attempting to train on all of the data (a mostly impractical effort), we can use some logic to prepare our series of smaller, more workable subsets. For example, we could simply separate out our overall data source into 50 smaller files – one for each state. This would help, but the smaller files may still be massive, so with a little profiling of the data, we may be able to identify other measurements that we can use to divide our data.

The process of data discovery and separation might look pretty close to the following steps:

  1. Since we are dealing with a big data source and are not sure of the number of cases or records within the file, we can start by creating an R data object from our comma separated file and restricting the number of records to be read:
    x<-read.table(file="HCSurvey20170202.txt", sep=",", nrows=150)
  2. x now contains 150 records that we can review looking for interesting measures we may be able to use to logically split our data on. You can also utilize the summary function to evaluate variables within the data source. For example, we see that column 9 is the patients' home state, column 5 is the patients' current body weight, and column 79 indicates the patients' weight 1 year ago:
    Breaking up
  3. Now, we can perhaps create a series of smaller subsets where there are 50 state files, but each containing only cases that have patients who have gained more than five pounds in the past year:
    Breaking up

We do end up with 50 files, but each file should be much smaller and easier to work with then a single, large big data source. This is also a simple example and in practice, you may (and probably will) end up rerunning the split code and stitching together multiple state files.

The preceding is one example of how big data research typically works—by constructing smaller datasets that can be efficiently analyzed!

Sampling

Another method for dealing with the volume of a big data source is with population sampling.

Sampling is a selection of or a subset of cases from within a statistical population intended to estimate or represent characteristics of the whole population. The net-net is the size of the data to be trained on, is reduced.

There is some concern that sampling may decrease the performance (not as in processing time, but in the accuracy of results generated) of a model. This may be somewhat true as typically the more data the model is trained on, the better the result, but depending upon the objective, the decrease in performance can be negligible.

Overall, it is safe to say that if sampling can be avoided, it is recommendable to use another big data strategy. But if you find that sampling is necessary, it still can lead to satisfying models.

When you use sampling as a big data predictive strategy, you should try to keep the sample as big as you can, consider carefully the size of the sample in proportion to the full population and ensure as best you can that the sample is not biased.

One of the easiest methods for creating a sample is with the R function sample. Sample takes a sample of the specified size from the elements of x using either with or without replacement.

The following lines of R code are a simple example of creating a random sample of 500 cases from our original data. Notice the row counts (indicated by using the R function nrow):

Sampling

Aggregation

Another method for reducing the size of a big data source (again depending on your projects' objectives) is by statistical aggregation of the data. In other words, you simply may not require the level of granularity in the data that is available.

In statistical data aggregation, the data can be combined from several measurements. This means that groups of observations are replaced with summary statistics based on those observations. Aggregation is used a lot in descriptive analytics, but can also be used to prepare data for a predictive project.

For larger and especially disparately located big data sources one might use a Hadoop and Hive (or similar technology) solution to aggregate the data. If the data is in a transactional database, you may even be able to use native SQL. In an all-natural R solution, you have more work to do.

R provides a convenient function named aggregate that can be used for big data aggregation, once you have determined how you want to (or need to) use the data in your project.

For example, the following code shows applying the function to the original data (stored in the data object named x) being aggregated on the 3 variables (patient sex):

aggregate(x, by=x["sex"], FUN=mean, na.rm=TRUE)

Going back to an earlier section, to our example of splitting data into 50 state files, we could potentially instead use the R code shown as follows to aggregate and generate summary statistics by state. Notice that the original case count was 5,994 and after aggregating the data, we have a case count of 50 (one summary record for each state):

Aggregation

Dimensional reduction

In Chapter 8, Dimensionality Reduction, we introduced the process of dimensional reduction, which (as we pointed out then) allows the data scientist to minimize the data's dimensionality, but can also reduce the overall volume of a big data source, thereby reducing the amount of time and memory required for processing the data, allowing it to be more easily visualized, and eliminating features irrelevant to the model's purpose, reduce model noise, and so on.

Like breaking up the data into smaller more manageable files, using dimensional reduction will help, but takes a good understanding of the data as well as perhaps plenty of processing steps to eventually produce a workable data population.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset