This is the most technical chapter of the book. Metric selection is discussed in some detail before moving on to new visualization techniques to represent complex spatial processes evolving over time. It then digs deeper into the technical aspects of a range of topics.
This chapter presents some interesting techniques developed recently to handle modern business data. It is more technical than previous chapters, yet it's easy to read for someone with limited statistical, mathematical, or computer science knowledge. The selection of topics for this chapter was based on the number of analytics practitioners that found related articles useful when they were first published at Data Science Central. This chapter is as broad as possible, covering many different techniques and concepts.
Advanced statistics, in particular cross-validation, robust statistics, and experimental design, are all part of data science. Likewise, computational complexity is part of data science when it applies to algorithms used to process modern large, unstructured, streaming data sets. Therefore, many computer science and statistical techniques are discussed at a high level in this chapter.
NOTE Material that is traditionally found in statistical textbooks (such as the general linear model), as well as run-of-the mill, old computer science concepts (such as sorting algorithms) are not presented in this book. You can check out Chapter 8, “Data Science Resources,” for an introduction to those subjects.
Identifying powerful metrics—that is, metrics with good predictive power—is a fundamental role of the data scientist. The following sections present a number of metrics used in optimizing digital marketing campaigns and in fraud detection. Predictive power is defined in Chapter 6 in the section Predictive Power of a Feature, Cross-Validations. Metrics used to measure business outcomes in a business model are called KPIs (key performance indicators). A metric used to predict a trend is called a leading indicator.
Some metrics that are not KPIs still need to be identified and tracked. For instance, if you run e-mail campaigns to boost membership, two KPIs are the number of new members per month (based on number of e-mails sent), and churn. But you still need to track open rates, even open rates broken down by ISP, segments, and message sent, as these metrics, while clearly not KPIs, are critical to understand variations in new members and improve new members stats/reduce churn. For instance, analyzing open rates will show you that sending an e-mail on a Saturday night is much less efficient than on a Wednesday morning; or that no one with a Gmail e-mail address ever opens your message—a problem that is relatively easy to fix, but must be detected first. You can't fix it if you can't detect it; you can't detect it if you don't track open rates by ISP.
Big data and fast, clustered computers have made it possible to track and compute more sophisticated metrics, test thousands of metrics at once (to identify best predictors), and work with complex compound metrics: metrics that are a function of other metrics, such as type of IP address. These compound metrics are sometimes simple transformations (like a logarithmic transformation), sometimes complex combination and ratios of multiple base metrics, and sometimes derived from based metrics using complex clustering algorithms, such as the previous example: type of IP address (which will be discussed later in this chapter in the section Internet Topology Mapping).
Compound metrics are to base metrics what molecules are to atoms. Just like as few as seven atoms (oxygen, hydrogen, helium, carbon, sodium, chlorine, and sulfur) can produce trillions of trillions of trillions of molecules and chemical compounds (a challenge for analytical and computational chemists designing molecules to cure cancer), the same combinatorial explosion takes place as you move from base to compound metrics. Base metrics are raw fields found in data sets.
The following list of metrics (mostly compound metrics) is by no means comprehensive. You can use it as a starting point to think about how to create new metrics.
If you run an online newsletter, here are a number of metrics you need to track:
You should note that e-CPM is the revenue generated per thousand impressions. It is your most important metric, together with churn.
Many data scientists work on fraud detection projects. Since 1995, metrics dealing with fraud detection have been refined so that new powerful metrics are now used. The emphasis in this discussion of modern fraud detection metrics is on web log data. The metrics are presented here as rules. More than one rule must be triggered to fire an alarm if you use these metrics to build an alarm system. You may use a system such as hidden decision trees to assign a specific weight to each rule.
The key point here is that fraud detection metrics keep getting more complicated and diversified, as criminals constantly try to gain a competitive advantage by reverse-engineering fraud detection algorithms. There will always be jobs for data scientists working on fraud detection. Fraud detection also includes metrics to fight non-criminals—for instance, marketers who want to game search engines to get their websites listed first on Google, as well as plagiarists and spammers.
As a data scientist, you will often use existing tools rather than reinventing the wheel. So before diving deeper into the technical stuff, it's important that you identify and choose the right tools for your job. Following are some important questions you need to ask and criteria to keep in mind when looking for a vendor for analytics software, visualization tools, real-time products, and programming languages.
Here are a few good criteria and questions to consider that will help you select appropriate analytics software. Those I've found to be very important are indicated with an asterisk.
In some cases, three products might be needed: one for visualization, one with a large number of functions (neural networks, decision trees, constrained logistic regression, time series, and simplex), and one that will become your “work horse” to produce the bulk of heavy-duty analytics.
Related questions you should ask include:
Ask yourself the following questions to help you select the proper visualization tools for each of your projects:
These questions will help you get a good handle on the use and selection of real-time products:
Python and R have become very popular: They're open source, so you can just download them from the web, get sample scripts, and start coding. Python offers analytics libraries, known as Pandas. R is also very popular. Java, Perl, and C++ have lost some popularity, though there are still people coding in C (in finance) because it's fast and lots of libraries are available for analytics as well as for visualization and database access. Perl is great and very flexible, especially if you are an independent creative data scientist and like to code without having to worry about the syntax, code versioning, team work, or how your code will be integrated by your client. Perl also comes with the Graph library.
Of course, you also need to know SQL. Hadoop has become the standard for MapReduce programming. SAS is widespread in some environments, especially government and clinical trials. I sometimes use R for advanced graphing (including videos) because it works well with palettes and other graphic features. R used to be limited by the memory available in your machine, although there are solutions that leverage the cloud (RHadoop).
This section is not a review of visualization tools or principles. Such a review is beyond the scope of this book and would constitute a book by itself. Readers can check the Books section in Chapter 8 for references. You can also get more information on how to choose a visualization tool at http://bit.ly/1cGlFA5.
This section teaches you how to produce advanced charts in R and how to produce videos about data, or data videos—information. It also includes code to produce any kind of color pattern you want, because of the rgb (red/green/blue) R function embedded in an R plot command. By creating videos rather than pictures, you automatically add more dimension to your visuals—most frequently but not always, that dimension is time, as in time series. So this section is a tutorial on producing data videos with R. Finally, you can upload your data videos on YouTube. R is chosen for this discussion because it is very easy to use, and it is one of the most popular tools used by data scientists today.
Following is a step-by-step process for producing your video. The first three sections explain how to do it. Then the subsequent sections provide additional detail you may find helpful.
Videos are important because they allow you to present spatial information that changes over time (spatial-temporal data), or display one additional dimension to what is otherwise a static image. They're also used to see the convergence (or lack thereof) and speed of convergence of data science algorithms. Song-Chun Zhu is one of the pioneers in producing “data videos” (see http://www.stat.ucla.edu/∼sczhu/publication.html).
Another useful aspect of videos is the ability to show rotating 3-D charts so that the user can see the chart from all angles and find patterns that traditional 2-D projections (static images) don't show (unless you look at dozens of projections). The basic principles are:
Two important points are how the data should be visualized and who should make this decision—the data scientist, the business analyst, or the business user. The business user typically decides, but the data scientist should also propose new visualizations to the business user (depending on the type of data being analyzed and the KPIs used in the company), and let the business user eventually decide, based on how much value he can extract from a visualization and how much time it takes to extract that value.
Visualizations communicating business value linked to KPIs, solving a costly issue, or showing a big revenue opportunity, and that can be understood in 30 seconds or less, are the best, The worst graphs have one or more of the following features: (1) they're cluttered, (2) the data is rescaled, (3) unimportant data masks or hides valuable information, and (4) they display metrics that are not comparable.
Using Python, R, Perl, Excel, SAS, or any other tool, produce a text file called rfile.txt, with four columns:
Or, if you want to exactly replicate my experiment, you can download my rfile.txt (sample data to visualize) at http://www.datashaping.com/first.r.txt.
Note that the first variable in the R script is labeled iter. It is associated with an iterative algorithm that produces an updated data set of 500 (x,y) observations at each iteration. The fourth field is called new, which indicates if point (x,y) is new for a given (x,y) and given iteration. New points appear in red; old ones appear in black.
vv<-read.table(“c:/vincentg/rfile.txt”,header=TRUE); iter<-vv$iter; for (n in 0:199) { x<-vv$x[iter == n]; y<-vv$y[iter == n]; z<-vv$new[iter == n]; plot(x, y, xlim=c(0,1),ylim=c(0,1),pch=20,col=1+z, xlab=““,ylab=““,axes=FALSE); dev.copy(png, filename=paste(“c:/vincentg/Zorm_“,n,“.png”,sep=““)); dev.off (); }
There are four different ways to produce your video. When you run the R script, the following happens:
Then your three options to produce the video are as follows:
You can read about other possible solutions (such as open source ffmpeg or the ImageMagick library) at http://stackoverflow.com/questions/1298100/creating-a-movie-from-a-series-of-plots-in-r. You could also watch the animation “An R Package for Creating Animations and Demonstrating Statistical Methods” published in the April 2013 issue of the Journal of Statistical Software.
A nice thing about these videos is that you can upload them to YouTube. You can check out mine at these URLs:
You can use ActivePresenter (http://atomisystems.com) screen-cast software (free edition), as follows:
I recently created two high quality videos using ActivePresenter based on a data set with two additional columns using the following, different R script (you can download the data as a 7 MB text file at http://www.datashaping.com/rfile2.txt):
vv<-read.table(“c:/vincentg/rfile.txt”,header=TRUE); iter<-vv$iter; for (n in 0:199) { x<-vv$x[iter == n]; y<-vv$y[iter == n]; z<-vv$new[iter == n]; u<-vv$d2init[iter == n]; v<-vv$d2last[iter == n]; plot(x,y,xlim=c(0,1),ylim=c(0,1), pch=20+z, cex=3*u, col=rgb(z/2,0,u/2),xlab=““, ylab=““,axes=TRUE); Sys.sleep(0.05); # sleep 0.05 second between each iteration }
You could also try changing the plot command to:
plot(x,y,xlim=c(0,1),ylim=c(0,1),pch=20,cex=5,col=rgb(z,0,0), xlab=““,ylab=““,axes=TRUE);
This second R script differs from the first as follows:
Note that d2init (fourth column in the rfile2.txt input data used to produce the video) is the distance between the location of (x,y) at the current iteration and its location at the first iteration. In addition, d2last (fifth column) is the distance between the current and previous iterations for each point. The point will be colored in a more intense blue if it made a big move between the current and previous iterations.
The function rgb accepts three parameters with values between 0 and 1, representing the intensity, respectively in the red, green, and blue frequencie. For instance, rgb (0,0,1) is blue, rgb (1,1,0) is yellow, rgb (0.5,0.5,0.5) is gray, rgb (1,0,1) is purple, and so on. Make sure 0 ≤ rgb ≤ 1 otherwise it will crash.
You can also create videos that involve points and arrows (not just points) continuously moving from frame to frame. Arrows are attached to points and indicate which cluster a point is moving toward or away from at any given time. Transitions from frame to frame are accelerated over time, and color gradients are also used. I recently produced a couple of these videos. You can find additional information on them at the following URLs:
Here is the R source code used in this version.
vv<-read.table(“c:/vincentg/rfile3.txt”,header=TRUE); iter<-vv$iter; for (n in 1:199) { x<-vv$x[iter == n]; y<-vv$y[iter == n]; z<-vv$new[iter == n]; u<-vv$d2init[iter == n]; v<-vv$d2last[iter == n]; p<-vv$x[iter == n-1]; q<-vv$y[iter == n-1]; u[u>1]<-1; v[v>0.10]<-0.10; s=1/sqrt(1+n); if (n==1) { plot(p,q,xlim=c(-0.08,1.08),ylim=c(-0.08,1.09),pch=20,cex=0, col=rgb(1,1,0),xlab=““,ylab=““,axes=TRUE); } points(p,q,col=rgb(1-s,1-s,1-s),pch=20,cex=1); segments(p,q,x,y,col=rgb(0,0,1)); points(x,y,col=rgb(z,0,0),pch=20,cex=1); Sys.sleep(5*s); segments(p,q,x,y,col=rgb(1,1,1)); } segments(p,q,x,y,col=rgb(0,0,1)); # arrows segments points(x,y,col=rgb(z,0,0),pch=20,cex=1);
Statistical “modeling without models” is a way to do inference, even predictions or computing confidence intervals, without using statistical distributions or models. Sometimes, understanding the physical model driving a system (for instance, the stock market) may be less important than having a trading system that works. Sometimes (as in the stock market), you can try to emulate the system (using Monte Carlo simulations to simulate stock market prices) and fit the data to the simulations without knowing exactly what you are emulating (from a physical or theoretical point of view), as long as it works and can be adapted when conditions or environment (for instance, market conditions) change.
You may be wondering how to get the data needed to produce the previously discussed videos. This section presents details on the algorithms used to produce the data for the videos. I will use my shooting stars video as the example, which means you need a bit of history on the project.
The project began as the result of several of my personal interests and research, including the following:
The statistical models behind the videos are birth and death processes—gravitational forces that create clusters that form and die—with points moving from one cluster to another throughout the process, and points coming from far away. Eventually, it became modeling without models, a popular trend in data science.
There is a generic mathematical model behind the algorithm used to create the video data, but the algorithm was created first without having a mathematical model in mind. This illustrates a new trend in data science: less and less concern about modeling, and more and more about results.
My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want—be it a simulation of a Neyman-Scott cluster process or a simulation of some no-name stochastic process. It's a bit similar to how modern mountain climbing has evolved: from focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), and possibly to mountain climbing on Mars in the future.
You can fine-tune the parameters to:
The algorithm starts with a random distribution of m mobile points in the [0,1] × [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus, over time, they group into clusters.
The algorithm has the following components:
Some special features of this algorithm include the following:
In the Perl source code shown here, the birth process for point $k is simply encoded as:
if (rand()<0.1/(1+$iteration)) { # birth and death $tmp_x[$k]=rand(); $tmp_y[$k]=rand(); $rebirth[$k]=1; }
In this source code, in the inner loop over $k, the point ($x,$y) to be updated is referenced as point $k, that is, ($y, $y) = ($moving_x[$k], $moving_y[$k]). Also, in a loop over $l, one level deeper, ($p, $q) referenced as point $l, represents a neighboring point when computing the weighted average formula used to update ($x, $y). The distance d is computed using the function distance, which accepts four arguments ($x, $y, $p, and $q) and returns $weight, the weight w.
In summary, realistic simulations used to study real-life mechanisms help you understand and replicate the mechanism in the laboratory, even if the mechanism itself is not understood. In this case, these simulations (visualized in videos) help you understand how multiple celestial bodies with various sizes, velocities, and origins work (or anything subject to forces similar to gravitational forces—possibly competitive forces in the market), including how they coalesce, are born, and die. In turn, by fitting data to simulations, you can make predictions, even if the mechanism is not well understood.
Both Perl and R have been used to produce the datasets for the video. You can download the source code for both at http://bit.ly/11Jdi4t.
The Perl code runs much faster than the R code, not because Perl itself is faster, but because of the architecture used in the program: Perl recomputes all of the frames at once and loads them into memory, while R produces each frame only as needed. The advantage of the R version is that it completes all aspects of the process, including producing the data and displaying the video frames.
This section marks the beginning of a deeper dive into more technical data science information. Let's start with a simple concept. Statistical textbooks focus on centrality (median, average, or mean) and volatility (variance). They don't mention the third fundamental class of metrics: bumpiness. This section begins by considering the relationship between centrality, volatility, and bumpiness. It then discusses the concept of bumpiness and how it can be used. Finally, you will see how a new concept is developed, a robust modern definition is materialized, and a more meaningful definition is created based on, and compatible with, previous science.
Two different data sets can have the same centrality and volatility, but a different bumpiness. Bumpiness is linked to how the data points are ordered, whereas centrality and volatility completely ignore order. So bumpiness is useful for data sets where order matters—in particular, time series. Also, bumpiness integrates the notion of dependence among the data points, whereas centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is also true.
Detecting changes in bumpiness is important. The classical example is stock market strategy: an algorithm works well for a while, and then a change in bumpiness makes it fail. Detecting when bumpiness changes can help you adapt your algorithm earlier, and hopefully decrease losses. Low bumpiness can also be associated with stronger, more accurate predictions. When possible, use data with low bumpiness or use bumpiness reduction techniques.
Given a time series, an intuitive, data science, scale-dependent, and robust metric would be the average acute angle measured on all the vertices. This metric is bounded by the following:
This metric is non-sensitive to outliers. It is by all means a modern metric. However, I don't want to reinvent the wheel, and thus I will define bumpiness using a classical “statistic” (as opposed to “data science”) metric that has the same mathematical and theoretical appeal and drawbacks as the old-fashioned average (to measure centrality) or variance (to measure volatility).
Bumpiness can be defined as the auto-correlation of lag 1. Figure 4-1 shows three time series with the same mean, variance, and values, but different bumpiness.
Note that the lag 1 auto-correlation is the highest of all auto-correlations, in absolute value. Therefore, it is the single best indicator of the auto-correlation structure of a time series. It is always between –1 and +1. It is close to 1 for smooth time series, close to 0 for pure noise, negative for periodic time series, and close to –1 for time series with huge oscillations. You can produce an r close to –1 by ordering pseudos-random deviates as follows: x(1), x(n), x(2), x(n–1), x(3), x(n–2)… where x(k) [k = 1, …, n] represents the order statistics for a set of n points, with x(1) = minimum, x(n) = maximum.
A better but more complicated definition would involve all the auto-correlation coefficients embedded in a sum with decaying weights. It would be better in the sense that when the value is 0, it means that the data points are truly independent for most practical purposes.
You will now consider an Excel spreadsheet showing computations of the bumpiness coefficient r for various time series. It is also of interest if you want to learn new Excel concepts such as random number generation with RAND, indirect references with INDIRECT, RANK, and LARGE, and other powerful but not well-known Excel functions. Finally, it is an example of a fully interactive Excel spreadsheet driven by two core parameters.
You can download the spreadsheet from http://bit.ly/KhU41L. It contains a base (smooth, r > 0) time series in column G and four other time series derived from the base time series:
Two core parameters can be fine-tuned: cells N1 and O1. Note that r can be positive even if the time series is trending down: r does not represent the trend. Instead, a metric that would measure trend would be the correlation with time (also computed in the spreadsheet).
The creation of a neutral time series (r = 0), based on a given set of data points (that is, preserving average, variance, and indeed all values) is performed by reshuffling the original values (column G) in a random order. It is based on using the pseudo-random permutation in column B, itself created using random deviates with rand, and using the rank Excel formula. The theoretical framework is based on the Analyticbridge second theorem.
A random permutation of non-independent numbers constitutes a sequence of independent numbers. This is not a real theorem per se; however, it is a rather intuitive and easy way to explain the underlying concept. In short, the more data points, the more the reshuffled series (using a random permutation) looks like random numbers (with a prespecified, typically non-uniform statistical distribution), no matter what the original numbers are. It is also easy to verify the theorem by computing a bunch of statistics on simulated reshuffled data: all these statistics (for example, auto-correlations) will be consistent with the fact that the reshuffled values are (asymptotically) independent of each other.
NOTE You can check out the first Analyticbridge theorem at http://www.analyticbridge.com/profiles/blogs/how-to-build-simple-accurate-data-driven-model-free-confidence-in.
Excel has some issues In particular, its random number generator has been criticized, and values get recomputed each time you update the spreadsheet, making the results nonreplicable (unless you “freeze” the values in column B).
Economic time series should always be studied by separating periods with high and low bumpiness, understanding the mechanisms that create bumpiness, and detecting bumpiness in the first place. In some cases, the bumpiness might be too small to be noticed with the naked eye, but statistical tools should detect it.
Another application is in high-frequency trading. Stocks with highly negative bumpiness in price (over short time windows) are perfect candidates for statistical trading because they offer controlled, exploitable volatility—unlike a bumpiness close to zero, which corresponds to uncontrolled volatility (pure noise). And of course, stocks with highly positive bumpiness don't exist anymore. They did 30 years ago: they were the bread and butter of investors who kept a stock or index forever to see it automatically grow year after year.
QUESTION
How do you generalize this definition to higher dimensions—for instance, to spatial processes? You could have a notion of directional bumpiness (north-south or east-west).
Potential application: flight path optimization in real time to avoid serious bumpy air (that is, highly negative wind speed and direction bumpiness).
Clustering techniques have already been explored in Chapter 2. Here I describe an additional technique to help cluster large data sets. It can be added to any clustering algorithm. It consists of first creating core clusters using a sample, rather than the entire data set.
Say you have to cluster 10 million points—for instance, keywords. You have a dissimilarity function available as a text file with 100 million entries, each entry consisting of three data points:
So, in short, you can perform kNN; (k-nearest neighbors) clustering or some other type of clustering, which typically is O(n2) or worse, from a computational complexity point of view.
The idea is to start by sampling 1 percent (or less) of the 100 million entries and perform clustering on these pairs of keywords to create a “seed” or “base-line” cluster structure.
The next step is to browse sequentially your 10 million keywords, and for each keyword, find the closest cluster from the baseline cluster structure. If there are 1,000 base clusters, you'll have to perform 10 billion (10,000,000 × 1,000) computations (hash table accesses), a rather large number, but it can easily be done with a distributed architecture (Hadoop).
You could indeed repeat the whole procedure five times (with five different samples), and blend the five different final cluster structures that you obtain. The final output might be a table with 30 million entries, where each entry consist of a pair of keywords A and B and the number of times (if >0) when A and B belong to a same cluster (based on the five iterations corresponding to the five different samples). The final cluster detection algorithm consists of extracting the connected components from these 30 million entries. These entries, from a graph theory point of view, are called ridges, joining two nodes A and B. (Here a node is a keyword.)
What are the conditions for such an algorithm to work? You can assume that each keyword A has on average between 2 and 50 neighboring keywords B such that d(A,B) > 0. The number of such neighbors is usually much closer to 2 than to 50. So you might end up after classifying all keywords with 10 or 20 percent of all keywords isolated (forming single-point clusters).
Of course you'll solve the problem if you work with 50 rather than 5 samples, or samples that represent 25 percent of the data rather than 1 percent, but this is a time-prohibitive approach that poorly and inefficiently leverages statistical sampling. The way to optimize these parameters is by testing: If your algorithm runs very fast and requires little memory, but leaves too many important keywords isolated, then you must increase the sample size or number of samples, or a combination of both. Conversely, if the algorithm is very slow, stop the program and decrease the sample size or number of samples. When you run an algorithm, it is always a good idea to have your program print on the screen (or in a log file) the size of the hash tables as they grow (so you will know when your algorithm has exhausted all your memory and can fix the problem), as well as the steps completed. In short, displaying progress status. For instance:
10,000 keywords processed and 545 clusters created 20,000 keywords processed and 690 clusters created, 30,000 keywords processed and 738 clusters created
As the program is running, after 21 seconds the first line is displayed on the screen, after 45 seconds the second line is displayed below the first line, after 61 seconds the third line is displayed, and so on. The symbol > is used because it typically shows up on the window console where the program is running. Sometimes, instead of > it's $ or something like $c://home/vincentg/myprograms>.
You can also compute the time used to complete each step (to identify bottlenecks) by calling the time function at the end of each step, and measuring time differences (see http://bit.ly/1e4eRgP for details).
An interesting application of this keyword clustering is as follows: Traditional yellow pages have a few hundred categories and subcategories. The restaurant category has subcategories that are typically all related to type of cuisine: Italian, American, sushi, seafood, and so on. By analyzing and clustering online user queries, you can identify a richer set of subcategories for the restaurant category: type of cuisine, of course (as usual), type of restaurant (wine bar, upscale, pub, family dining, restaurant with view, romantic, ethnic, late dining), type of location (downtown, riverfront, suburban, beach, mountain), or stuff not related to the food itself (restaurant jobs, recipes, menus, restaurant furniture).
NOTE You should prioritize areas that have a high node density, and then assume those areas have a higher chance of being sampled.
With big data, you sometimes have to compute correlations involving thousands of buckets of paired observations or time series. For instance, a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of the best predictive set of metrics.
In large data sets, some buckets contain outliers or meaningless data, and buckets might have different sizes. You need something better than the tools offered by traditional statistics. In particular, you need a correlation metric that satisfies the following conditions:
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks and benefits from the same cures as traditional correlation. So I will focus here on the correlation.
To illustrate the first condition (dependence on n), consider the following made-up data set with two paired variables or time series X, Y:
X Y
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
Here n = 8 and r (the traditional correlation) is equal to r = 0.22. If n = 7 (delete the last observation), then r = 0. If n = 6, r = 0.29. Clearly I observe high correlations when n is even, although, they slowly decrease to converge to 0, for large values of n. If you shift Y one cell down (assuming both X and Y are infinite time series), then correlations for n are now negative! However, this (X, Y) process is supposed to simulate a perfect 0 correlation. The traditional correlation coefficient fails to capture this pattern, for small n.
This is a problem with big data, where you might have tons of observations (monthly home prices and square feet for 300 million homes) but you compute correlations on small buckets (for instance, for all specific types of houses sold in 2013 in a particular ZIP code) to refine home value estimates for houses not on the market by taking into account local fluctuations. In particular, comparisons between buckets of different sizes become meaningless.
Our starting point to improve the standard correlation will be Spearman's rank correlation. It is the traditional correlation but measured on ranked variables. All correlations from the new family that I will introduce in the next section are also based on rank statistics. By working on ranked variables, you satisfy conditions #3 and #4 (although #4 will be further improved with my new correlation). Now denote Spearman's coefficient as s.
It is easy to prove:
s = 1 − S(X, Y)/q(n)
where:
Of course, s satisfies condition #2, by construction. An interesting and important fact, and a source of concern, is that q(n) = O(n3). In short, q(n) grows too fast.
Without loss of generality, from now on you can assume that X is ordered and thus x(j) = j. The new correlation will be denoted as t or t(c) to emphasize that it is a family of correlations governed by the parameter c. Bayesian statisticians might view c as a prior distribution.
The new correlation t is computed as follows:
Step A
Step B
Here are some properties of this new correlation:
NOTE Question: In Figure 4-3, which correlation makes the most sense, s or t? Answer: Neither is statistically different from 0. However, t makes more sense since the two points in the top-right and bottom-left corners look like outliers and make s positive (but not t, which is less sensitive to outliers).
Again, I assume that t is defined using c = 1. Simulations where X = (0, 1… n–1) and Y is a random permutation of n elements (obtained by first creating n random numbers then extracting their rank) show that
Finally, when t and s are quite different, usually the t value looks more natural, as in Figure 4-3; t < 0 but s > 0 because t is better at eliminating the effect of the two outliers (top-right corner, bottom-left corner). This is critical when you compute correlations across thousands of data buckets of various sizes: You are bound to find outliers in some buckets. And if you look for extreme correlations, your conclusions might be severely biased. (Read the section “The Curse of Big Data” in Chapter 2 for more information on this.)
Figure 4-4 shows the distribution for T, when n = 10. When n = 10, 97.2 percent of t values are between –0.5 and +0.5. When n = 7, 92.0 percent of t values are between –0.5 and +0.5. So the distribution of t depends on n. Theoretically, when c gets close to 0, what happens?
NOTE An interesting fact is that t and s agree most of the time. The classic correlation r between s and t, computed on all n! = 362,880 potential (X, Y) vectors with n = 9 observations, is equal to 0.96.
To estimate the asymptotic distribution of t (when n becomes large), you need to compute q(n). Is there an exact formula for q(n)? I ran a competition to see if someone could come up with a formula, a proof that none exists, or at least an asymptotic formula. Read the next section to find out the details. Let's just say for now that there were two winners—both experts in mathematical optimization—and $1,500 in awards were offered.
Because the correlation is in the range [−1, +1] and is thus bounded, I would like a correlation of, for example, 0.5 to always represent the same percentile of t, regardless of n. There are various ways to accomplish this, but it always involves transforming t. Some research still needs to be done. For r or s, you can typically use the Fisher transformation. But it will not work on t, plus it transforms a bounded metric in a nonbounded one, making interpretation difficult.
Most of this research and these simulations involve the generation of a large number of permutations for small and large n. This section discusses how to generate these permutations.
One way is to produce all permutations: You can find source code and instructions at http://bit.ly/171bEzt. It becomes slow when n is larger than 20; however, it is easy to use MapReduce to split the task on multiple virtual machines, in parallel. Another strategy is to sample enough random permutations to get a good approximation of t's distribution, as well as q(n). There is a one-to-one mapping between permutations of n elements and integer numbers between 1 and n. Each of these numbers can be uniquely represented by an n-permutation and the other way around. Also check the “numbering permutations” section of the Wikipedia article on permutations for details: http://en.wikipedia.org/wiki/Permutation#Numbering_permutations). It is slow: O(n2) to transform a number into a permutation or the other way around. So instead, I used a rudimentary approach (you can download the source code at http://bit.ly/H4Y1oV), which is just as fast:
The algorithm to generate a random permutation is as follows:
(p(0), p(1), … , p(n–1))
For k = 0 to n–1, do
Step 1: Generate p(k) as a random number on {0, … , n–1}.
Step 2: Repeat Step 1 until p(k) is different from p(0), p(1), … , p(k–1).
The following comments provide more insight into the differences between the two correlations s and t.
The purpose of this section is twofold. First, to solve the problem discussed in the previous section—namely, to find an explicit formula for q(n)—using algorithms (simulations) that involve processing large amount of data. This can help you get familiarized with a number of computer science concepts related to computational complexity. Second, to display an exact mathematical solution proving that sometimes mathematical modeling can beat even the most powerful system of clustered computers to find a solution. Though usually, both work hand in hand.
This function q(n) is at the core of a new type of statistical metrics developed for big data: a nonparametric, robust metric to measure a new type of correlation or goodness of fit. This metric generalizes traditional metrics that have been used for centuries, and it is most useful when working with large ordinal data series, such as rank data. Although based on rank statistics, it is much less sensible to outliers than current metrics based on rank statistics (Spearman's rank correlation), which were designed for rather small n, where Spearman's rank correlation is indeed robust.
Start with a caveman approach, and then go through a few rounds of improvements.
Brute force consists of visiting all n! permutations of n elements to find the maximum q(n). Computational complexity is therefore O(n!)—this is terribly slow.
Here's the code that iteratively produces all the n! (factorial n) permutations of n elements stored in an array. It allows you to easily compute stats associated with these permutations and compute aggregates over all or specific permutations. For n>10, it can be slow, and some MapReduce architecture would help. If n>15, you might be interested in sampling rather than visiting all permutations to compute summary stats—more on this later. Because q(n) is a maximum computed on all permutations of size n, this can help you compute n.
This code is also useful to compute all n-grams of a keyword, or, more interesting, in the context of nonparametric statistics. On Stackoverflow.com (http://bit.ly/1cKCAEA) you can find the source code in various modern languages, including Python. A Google search for calculate all permutations provides interesting links. Here is the Perl code:
use List::Permutor; $n=5; @array=(); for ($k=0; $k<$n; $k++) { $array[$k]=$k; } $permutIter=0; my $permutor = new List::Permutor(@array); while ( my @permutation = $permutor->next()) { for ($k=0; $k<$n; $k++) { print “$permutation[$k] “; } print “ ”; $permutIter++; }
It couldn't be any easier. You need the Permutor library in Perl. (In other languages no library is needed; the full code is provided.) The easiest way to install the library is as follows, and this methodology applies to any Perl library.
Don't believe people who claim you can use ppm or CPAN to automatically in one click download and install the library. This is by far the most difficult way. Instead, use a caveman install (supposedly the hardest way, but in fact the easiest solution) by doing the following:
That's it! Note that if you don't have the List library installed on your system (the parent Library for Permutor), you'll first have to download and install List using the same process, but everybody has List on their machine nowadays because it is part of the default Perl package.
First improvement: sampling permutations to estimate q(n). Rather than visiting all permutations, a lot of time can be saved by visiting a carefully selected tiny sample. The algorithm to generate random permutations (p(0), p(1), …, p(n–1)) was previously mentioned:
For k=0 to n-1, do: Step 1. Generate p(k) as a random number on {0, … , n-1} Step 2. Repeat Step 1 until p(k) is different from p(0), p(1), … , p(k-1) End
You should also consider the computational complexity of this algorithm, as you compare it with two alternative algorithms:
You would expect the rudimentary algorithm to be terribly slow. Actually, it might never end, running indefinitely, even just to produce p(1), let alone p(n-1), which is considerably harder to produce. Yet it turns out that the rudimentary algorithm is also O(n log n). (Proving this fact would actually be a great job interview question.) Here's the outline.
For positive k strictly smaller than n, you can create p(k) in these ways:
Note that the number of shots required might be infinite (with probability 0). Thus, on average, you need M(k) = SUM{ j * L(k,j) } = n/(n–k) shots to create p(k), where the sum is positive integers j = 0, 1, 2…. The total number of operations is thus (on average) SUM{M(k)} = n * {1 +1/2 + 1/3 + … + 1/n}, where the sum is on k = 0,1,…,n–1. This turns out to be O(n log n). Note that in terms of data structure, you need an auxiliary array A of size n, initialized to 0. When p(k) is created (in the last, successful shot at iteration k), you update A as follows: A[p(k)] = 1. This array is used to check if p(k) is different from p(0), p(1), …, p(k–1), or in other words, if A[p(k)] = 0.
Is O(n log n) the best that you can achieve? No, the algorithm in the next section is O(n), which is better.
The Fisher–Yates shuffle algorithm is the best for your purpose. It works as follows:
p(k) ← k, k=0,…,n-1 For k=n-1 down to 1, do: j ← random integer with 0 ≤ j ≤ k exchange p(j) and p(k) End
This is the best that can be achieved, in terms of computational complexity, for the random permutation generator (that is, it is the fastest algorithm).
Sometimes, processing vast amounts of data (billions or trillions of random permutations in this case) is not the best approach. A little bit of mathematics can solve your problem simply and elegantly.
In 2013, Jean-Francois Puget, PhD, Distinguished Engineer, Industry Solutions Analytics and Optimization at IBM, found the exact formula and came up with a proof. He was awarded a $1,000 prize by Data Science Central for his solution as the winner of their first theoretical data science competition. If you are interested in applied data science competitions (involving real data sets), you should check http://www.Kaggle.com.
Consider Dr. Puget's result:
You can find the rather lengthy and complicated proof for this at www.datashaping.com/Puget-Proof.pdf.
This section discusses a metric to measure whether a data set has some structure in it. This metric measures the presence or absence of a structure or pattern in data sets. The purpose is to measure the strength of the association between two variables and generalize the modern correlation coefficient discussed in the section Correlation and R-Squared for Big Data in a few ways:
The structured coefficient is denoted as w. I am working under the following framework:
This type of structured coefficient makes no assumption about the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points, have holes, and so on. This type of structured coefficient can be applied to non-numeric data (for example, if the data consists of keywords), time series, spatial data, or data in higher dimensions. It is more generic than many metrics currently used to measure structure, and is entirely data-driven.
Clustering algorithms have been discussed at large in the previous chapters, as well as this chapter. Here you find a rule of thumb that can be automated and does not require visual inspection by a human being to determine the optimum number of clusters associated with any clustering algorithm.
Note that the concept of cluster is a fuzzy one. How do you define a cluster? Nevertheless, in many applications there is a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a “don't know” answer to cases that are ambiguous.
The methodology is described by the following three steps:
This is based on the fact that the piece-wise linear plot of number of clusters versus percentage of variance explained by clusters is a convex function with an elbow point; (see Figure 4-5). The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the second derivative of the approximating smooth curve. This point corresponds to the inflection point on the curve in question. This methodology is simply an application of this “elbow-point detection” technique in a discrete framework (the number of clusters being a discrete number).
Row #1 represents the number of clusters (X-axis), and Row #2 is the percentage of variance explained by clusters (Y-axis) in Figure 4-5. Then the third, fourth, and fifth rows are differences (sometimes called deltas) from the row just above each of them. For instance, the second number in Row #3 is 15=80−65.
The optimum number of clusters in this example is 4, corresponding to maximum = 8 in the third differences.
NOTE If you already have a strong minimum in the second difference (not the case here), you don't need to go to the third difference: stop at level 2.
This section discusses a component often missing, yet valuable for most systems: algorithms and architectures that are dealing with online or mobile data, known as digital data, such as transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism, and spam detection.
I will call it an Internet topology mapping. It might not be stored as a traditional database. (It could be a graph database, a file system, or a set of lookup tables.) It must be prebuilt (for example, as lookup tables with regular updates) to be efficiently used.
Essentially, Internet topology mapping is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (for example, an online credit card transaction, to decide whether to accept or decline it), your system has only a few milliseconds to score the transaction to make the decision. In short, you have only a few milliseconds to call and run an algorithm (subprocess), on the fly, separately for each credit card transaction, to decide whether to accept it or reject it. If the algorithm involves matching the IP address with an ISP domain name (this operation is called nslookup), it won't work: direct nslookups take between a few hundred and a few thousand milliseconds, and they will slow the system to a grind.
Because of that, Internet topology mapping is missing in most systems. Yet there is a simple workaround to build it:
When processing a transaction, access the lookup table created in the last step (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.
Example of rules and metrics based on domain names are:
This is the first component of the Internet topology mapping. The second component is a clustering structure—in short, a structure (text file is OK) where a cluster is assigned to each IP address or IP range. Examples of clusters include:
Armed with these components (IP address/domain mapping + IP address cluster structure, aka Internet topology), you can now develop better algorithms: real-time or back-end, end-of-day algorithms. You need the IP address/domain mapping to build the cluster structure. If you have a data scientist on board, it should be easy for her to create this Internet topology mapping and identify the great benefits of using it.
The only issue with creating this product (assuming it will contain 20 million IP address ranges and get updated regularly) is the large amount of time spent in doing millions of slow (0.5 second each) caveman nslookups. Now, there are well-known ranges reserved for AOL and other big ISPs, so probably you will end up doing just 10 million nslookups. Given that 1 percent of them will fail (timing out after two seconds) and that you will have to run nslookup twice on some IP addresses, say that in short, you are going to run 10 million nslookups, each taking on average one second. That's about 2,777 hours, or 115 days.
You can use a MapReduce environment to easily reduce the time by a factor of 20, by leveraging the distributed architecture. Even on one single machine, if you run 25 versions of your nslookup script in parallel, you should make it run four times faster—that is, it would complete in less than a month. That's why I claim that a little guy alone in his garage could create the Internet Topology Mapping in a few weeks or less. The input data set (for example, 100 million IP addresses) would require less than 20 GB of storage, even less when compressed. Pretty small.
Finally, here's a Perl script that automatically performs nslookups on an input file ips.txt of IP addresses and stores the results in outip.txt. It works on my Windows laptop. You need an Internet connection to make it run, and you should add an error management system to nicely recover if you lose power or you lose your Internet connection.
open(IN,“<ips.txt”); open (OUT,“>outip.txt”); while ($lu=<IN>) { $ip=$lu; $n++; $ip=∼s/ //g; if ($ip eq ““) { $ip=“na”; } `nslookup $ip | grep Name > titi.txt`; open(TMP,“<titi.txt”); $domainName=“n/a”; while ($i=<TMP>) { $i=∼s/ //g; $i=∼s/Name://g; $domainName=$i; } close(TMP); print OUT “$ip $domainName ”; print “$n> $ip | $domainName ”; } close(OUT); close(IN);
In summary, Internet topology mapping is a tool that needs to be created using the steps described here to help in a number of business problems, including spam detection, fraud detection, attribution modeling, digital marketing (including e-mail marketing), publishing (to customize content based on IP address category), and advertising (to optimize relevancy). Traditional algorithms use blacklists or whitelists of IP addresses or IP ranges (a component of the Internet topology mapping), but none, to my knowledge, use IP clusters produced using a clustering algorithm. I have seen business dashboards that show the top 100 IP addresses for a specific metric (traffic volume, amount of fraud, and so on), but this is a really bad business practice since IP addresses are very granular (so the top 100 does not communicate many actionable insights) and a few IP addresses are shared by a very large number of users (mobile IP address proxies in particular). Dashboards based on IP categories derived from Internet topology mapping are more useful to decision makers.
I have also seen IP addresses used as a metric in decision trees, which is also terrible. IP categories or IP clusters should be used in decision trees, rather than a highly granular metric such as IP address. Examples of potential fraud detection rules derived from Internet topology mapping include “corporate IP addresses (unless hacked) should be white-listed” and “IP addresses from universities, associated with computers shared by many students (proxies), are riskier (except some of them where the sys-admin is doing a great job making these networks safe).” The difficulty is not coming up with these two intuitive rules, but rather classifying an IP address as a corporate or university proxy—a problem solved with Internet topology mapping.
This section is important, as it illustrates how to develop a real-life application with JavaScript code, how to create a tool that any end user can use from their browser without having to install R, and how to download and/or purchase software or write code. The JavaScript code provided here can be run online or offline from your browser. I believe that in the future, there will be more data science tools that are browser-based and lighter than a full API (I like the fact that it can be run offline from your browser). This code also provides another example of random number generation.
A big issue with data is how to securely transmit it. Here you will see some simple JavaScript code to encode numbers, such as credit card numbers, passwords made up of digits, phone numbers, Social Security numbers, dates such as 20131014, and so on.
The following is an example of an app that anyone can download as a text file, save as an HTML document, and run on their laptop, locally, from a browser (without any Internet connection). I call it an offline app, as opposed to a mobile or web or social (Facebook) app. As with most apps, you don't need any programming skills to use it. This one is original in the sense that you don't even need an Internet connection to make it work.
Here's how it works:
This code is simple (it is by no means strong encryption) and less sophisticated than uuencode. But uuencode is for the tech-savvy, whereas our app is easy to use by any non-technical person. The encoded value is also a text string, and thus easy to copy and paste in any e-mail client. The encoded value has some randomness, in the sense that encoding the same values twice will result in two different encoded values. Finally, it is more secure than it seems at first glance, if you don't tell anyone (except over the phone) where the decoder can be found.
You can make it even more secure by creating a version that accepts parameters. Here's the JavaScript/HTML code (this is the source code of the web page where our application is hosted). You could save it as an HTML document on your local machine, with the filename (for example, encode.html) in a folder (for example, C://Webpages) and then open and run it from a browser on your local machine. The URL for this local web page would be \/C:/Webpages/encode.html if you use Chrome.
<html> <script language=“Javascript”> <!-- function encrypt2() { var form=document.forms[0] if (form.encrypt.checked) { form.cardnumber.value=crypt(form.cardnumber.value) } else { form.cardnumber.value=decrypt(form.cardnumber.value) } } function crypt(string) { var len=string.length var intCarlu var carlu var newString=“e” if ((string.charCodeAt(i)!=101)&&(len>0)) { for (var i=0; i<len; i++) { intCarlu=string.charCodeAt(i) rnd=Math.floor(Math.random()*7) newIntCarlu=30+10*rnd+intCarlu+i-48 if (newIntCarlu<48) { newIntCarlu+=50 } if (newIntCarlu>=58 && newIntCarlu<=64) { newIntCarlu+=10 } if (newIntCarlu>=90 && newIntCarlu<=96) { newIntCarlu+=10 } carlu=String.fromCharCode(newIntCarlu) newString=newString.concat(carlu) } return newString } else { return string } } function decrypt(string) { var len=string.length var intCarlu var carlu var newString=““ if (string.charCodeAt(i)==101) { for (var i=1; i<len; i++) { intCarlu=string.charCodeAt(i) carlu=String.fromCharCode(48+(intCarlu-i+1)%10) newString=newString.concat(carlu) } return newString } else { return string } } // --> </script> <form> Enter Number <input type=text name=cardnumber size=19><p> Encrypt / Decrypt <input type=checkbox name=encrypt onClick=“encrypt2()“> </form> </html>
This chapter discussed many original, new techniques and recipes used by data scientists to process modern data, including big data. I dived deep enough into the details so that you can reproduce them when needed, but without being too technical, and avoiding jargon and mathematical formulas so that business people can still quickly extract the essence.
Chapter 5 discusses additional techniques that are also part of core data science. While Chapters 4 and 5 contain the most technical material of the book, the transition has been made as smooth as possible by introducing some of the technical material in Chapter 2 (see the section Clustering and Taxonomy Creation for Massive Data Sets).