CHAPTER
2

Big Data Is Different

In Chapter 1, you considered what data science is and is not, and saw how data science is more than data analysis, computer science, or statistics. This chapter further explores data science as a new discipline.

The chapter begins by considering two of the most important issues associated with big data. Then it works through some real-life examples of big data techniques, and considers some of the communication issues involved in an effective big data team environment. Finally, it considers how statistics is and will be part of data science, and touches on the elements of the big data ecosystem.

Two Big Data Issues

There are two issues associated with big data that must be discussed and understood: the “curse” of big data and rapid data flow. These two issues are discussed in the following sections.

The Curse of Big Data

The “curse” of big data is the danger involved in recklessly applying and scaling data science techniques that have worked well for small, medium, and large data sets, but don't necessarily work well for big data. This problem is well illustrated by the flaws found in big data trading (for which solutions are proposed in this chapter).

In short, the curse of big data is that when you search for patterns in large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power. Even worse, the strongest patterns might

  • Be caused entirely by chance (like winning the lottery)
  • Be non-replicable
  • Have no predictive power
  • Include obscure weaker patterns that are ignored yet have strong predictive power

So the questions is, how do you discriminate between a real and an accidental signal in vast amounts of data? Let's focus on an example: identifying strong correlations or relationships between time series.

Example

If you have 1,000 metrics (time series), you can compute 499,500 = 1,000 * 999 / 2 correlations. If you include cross-correlations with time lags (for example, stock prices for IBM today with stock prices for Google two days ago), you deal with many millions of correlations. Out of all these correlations, a few will be extremely high just by chance, and if you use such a correlation for predictive modeling, you will lose. Keep in mind that analyzing cross-correlations on all metrics is one of the first steps statisticians take at the beginning of any project — it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provides a more robust mechanism to identify true relationships.

To further illustrate this issue, say that you have a k time series, each with n observations for something like price deltas (price increases or decreases) computed for k different stock symbols with various time lags over the same time period consisting of n days. You want to detect patterns such as “When the Google stock price goes up, the Facebook stock price goes down 1 day later.” To detect such profitable patterns, you must compute cross-correlations over thousands of stocks with various time lags: one day and/or two days, or maybe one second and/or two seconds, depending on whether you do daily trading or extremely fast, intraday, high frequency trading.

Typically, you use a small number of observations (for example, n = 10 days or n = 10 milliseconds) because these patterns evaporate quickly. (When your competitors detect the patterns in question, those patterns stop becoming profitable.) In other words, you can assume that n = 10 or maybe n = 20. In other cases based on monthly data (environmental statistics and emergence of a new disease), maybe n = 48 for monthly data collected over a two-year time period. In some instances n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is acute when n is smaller than 200 and k is moderately large, say k = 500. However, it is rare to find instances in which both n is large (>1,000) and k is large (>5,000).

Not all big data involves computing millions of cross-correlations. But the concept described here applies to transactional data in general, or data that is being scored after being binned into billions of (usually small) buckets. Also, big data ranges from 10 GB to petabytes, and involves more than 100 million observations — usually billions of observations per year, or sometimes billions per day or thousands per second. Note that in some cases, big data can be sparse or highly redundant and contain far less information than one might think. This is the case with social network data or streaming videos and, generally speaking, with any data set that can be compressed efficiently.

CROSS-REFERENCE Cross-correlation analysis is one of the first steps (part of the exploratory data analysis) in analyzing big data, along with the creation of a data dictionary. You can read more on data-processing steps in the following sections of this book:

Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a high correlation. You could do Monte Carlo simulations to compute the chance in question, but here you'll use plain old-fashioned statistical modeling.

Consider a new parameter, denoted as m, representing the number of paired (bivariate) independent time series selected out of the set of k time series at your disposal. You want to compute correlations for these m pairs of time series. Here's a theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being, for example, a simulated normalized stock price at a given time for two different stocks), what are the chances among the m correlation coefficients that at least one is higher than 0.80?

Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically — that is, approximately when n is not small) normal, with mean = 0 and variance = 1 / (n−1). The probability that r is larger than a given large number a (say, a = 0.80, meaning a strong correlation) is p = P(r>a), with P representing a normal distribution with mean = 0 and variance = 1/(n−1). The probability that among the m bivariate (paired) time series at least one has a correlation above a = 0.80 is thus equal to 1−[(1−p)^m] (1−p at power m).

Take the following for instance:

  • If n = 20 and m = 10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a = 0.80) is 90.93 percent.
  • If n = 20 and m = 100,000 (still a relatively small value for m), then the chance that your conclusion is very wrong (that is, a = 0.90) is 98.17 percent.

Now, in practice it works as follows: You have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations; that is, m = k*(k−1)/2. However, the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a high value such as a = 0.90. Also, your data are not random numbers — it's not white noise. So the theoretical correlations are much higher than absolute 0, maybe approximately 0.15 when n = 20. Further, m will be much higher than, for example, 10,000 or 100,000 even when you have as few as k = 1,000 time series (for example, one time series for each stock price). These three factors (non-independence, theoretical r different from 0, and very large m) balance out and make the preceding computations quite accurate when applied to a typical big data problem. Note that the probabilities p=P(r>a) were computed using the online calculator stattrek. This online tool provides an immediate answer — you don't need to run a SAS, Python, or R program to get the number — you can compute that number from your iPad on a plane.

Conclusion

As this problem shows, it is crucial to have the right data scientist working on big data projects such as this one. You does not need to be highly technical, but you must think in a way similar to the preceding argumentation to identify possible causes of model failures before developing a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of statistics. Being a computer scientist also helps in scaling your algorithms and making them simple and efficient. Being an MBA analyst can help with understanding the problem that needs to be solved. Being all three of these at the same time is even better — and yes, such people do exist and are not that rare.

A real data scientist, faced with processing big data sets, would first know the intuitive fact that if you look at billions of patterns, some will be rare just by chance. The real data scientist will favor robust over standard metrics as outlined here, especially when looking at, comparing, and scoring billions of data bins. He will use simulations to assess whether a pattern is indeed as rare as initially thought, and will use metrics that compare whole distributions (adjusted for size) rather than metrics representing a narrow, single summary, such as mean or correlation, depending on the context.

Finally, fake data scientists can be identified by asking people if they agree with the following three statements. If they answer “yes” to any or all of them, they are not real data scientists:

  • Algorithm complexity (the O notation) is more important than time spent in data transfers. (Wrong: With Hadoop and distributed clustered architectures, optimizing data transfer and in-memory usage is often more important than algorithmic complexity.)
  • Statistics, computer science, and machine learning cannot be mastered by one person. (Wrong: A data scientist should have good knowledge and some domains of specialization in all these areas.)
  • Business expertise is not important. A data set is a data set, no matter the business model. (Wrong: This is tunnel vision. Business models are now far more complex than in the past, and domain expertise is necessary. For example, if you develop a CAPTCHA system but are unaware that spammers have developed systems to bypass your captcha entirely, your captcha system has little if any value.)

TECHNICAL NOTE

Here is an example that illustrates the complex structure of correlations discussed in this section, and the fact that they are not independent of each other. Say you have three random variables X, Y, and Z, with corr(X,Y) = 0.70 and corr(X,Z) = 0.80. What is the minimum value for corr(Y,Z)? Can this correlation be negative?

The answer is 0.13, which is a positive number, as it must be. The proof is based on the fact that the correlation matrix is semi-definite positive and thus its determinant must be > 0 (see http://bit.ly/1evFjBJ for details).

When Data Flows Too Fast

A second big data challenge is figuring out what to do when data flows faster and in larger amounts than can be processed. Typically, this challenge presents itself in one of two ways: data that cannot be processed using current technology, and data that can be processed now but is still difficult to manage due to its size.

Handling Unprocessable Data

In some situations, no matter what processing and what algorithm are used, enormous amounts of data keep piling up so fast that you need to delete some of it (a bigger proportion every day) before you can even look at it.

An example of this is astronomical data used to detect new planets, new asteroids, and so on. The data keep coming faster and in larger amounts than can be processed on the cloud using massive parallelization. Maybe good sampling is a solution — carefully selecting which data to analyze — and which data to ignore before even looking at the data.

Another possibility is to develop better compression algorithms so that one day, when you have more computing power, you can analyze all of the data previously collected, compressed, and stored. Maybe in the year 2030, for example, you'll be able look at the hourly evolution of a far-away supernova that began in 2010 and continued over a period of 20 years but was undetected in 2010 because the data was parked on a sleeping server.

In general, solutions include sampling and/or compression in cases where it makes sense. Data scientists should get familiar with robust sampling techniques, unbalanced samples, and experimental design to design predictive models. Too frequently the approach being used is pure computer science (using the whole data set), due to lack of proper training. Sometimes it is argued that you cannot sample if you compute statistics, such as unique users in a month, for a website. This is because of the lack of statistical knowledge and the fact that an untrained analyst using samples to make such computations is likely to get his numbers wrong. For instance, he might sample 3 days out of 30 (including a Sunday with low traffic), then multiply by 10 to get a total count. The solution works for page views (assuming bias correction is addressed due to weekdays versus weekends), but it does not work for unique users. For unique users, the multiplier must be carefully estimated, and it will be well below 10 unless you only have one-time users.

Sampling has been used extensively on rather big data — for instance, by the census bureau. Compression can take two forms: text or image compression for data rarely accessed, or saving compact statistical summaries (tables) of old data rather than saving the entire data set.

CROSS-REFERENCE See the section Using Big Data Versus Sampling in Chapter 7 for more information on cases where sampling is not feasible.

Processable Yet Difficult Data

This type of situation is when data is coming in fast (sometimes in real time) and in big amounts, yet all of it can be processed with modern, fast, distributed, MapReduce-powered algorithms or other techniques. The problem that presents itself is that the data are so vast, the velocity so high, and sometimes the data so unstructured that it can be processed only by crude algorithms, which typically result in bad side effects. Or at least that's what your CEO thinks. How do you handle this type of situation?

Let's focus on a few examples, particularly ones that relate to the CEO's perception of the problem, which is the most relevant to businesses. Consider the following data situations:

  • Credit card transaction scoring in real time
  • Spam detection
  • Pricing/discount optimization in real time (retail)
  • Antimissile technology (numerical optimization issue)
  • High-frequency trading
  • Systems relying on real-time 3-D streaming data (video and sound) such as autopilot (large planes flying on autopilot at low elevation in overcrowded urban skies)
  • Facebook Likes, Twitter tweets, Yelp reviews, matching content with user (Google), and so on

Using crude algorithms on data sets such as these could result in the following:

  • Too many false positives or false negatives and undetected fake reviews. For example, fake Tweets could result in stock the market repercussions, or fake press releases about the White House being attacked could cause a nationwide panic, yet these fake Tweets may fail to be detected in real time by Twitter. This could be addressed with appropriate data governance policies, as well as by the website companies themselves. (http://www.vancouversun.com/technology/personal-tech/ Yelp+sues+North+Vancouver+reviewer+bogus+reviews/8958494/story.html.) Google is doing the same — punishing black hat search-engine optimization (SEO) and improving their algorithms at the same time.
  • Billions of undetected fake Likes on Facebook, creating confusion for advertisers, and eventually a decline in ad spending. It also has a bad impact on users who eventually ignore the Likes, further reducing an already low click-through rate and lowering revenue. Facebook would have to come up with some new feature to replace the Likes, and the new feature would eventually be abused again, with the abuse not detected by machine learning algorithms or other means, perpetuating a cycle of microbubbles that must be permanently kept alive with new ideas to maintain revenue streams.

By crude algorithms, I mean Naive Bayes classifiers or algorithms that are too sensitive to outliers, such as huge decision trees or linear regressions on nonlinear (skewed) data. More sophisticated algorithms include a blending of multiple classifiers, or using resampling such as in bagged or boosted classifiers. Some of them are fast and easy to implement in MapReduce.

CROSS-REFERENCE See Chapter 5 for a discussion of hidden decision tress.

At issue is deciding which data can be dumped, which data is rarely accessed, and which fields can be ignored. (I've seen log files where the user agent takes 80 percent of the space. Why not use a lookup table of user agents to reduce data volume?) Also, MapReduce environments such as Hadoop become interesting when you start collecting a lot of unstructured data (for instance, text data) or blending data from multiple sources. Eventually you might be tempted to categorize and add a structure to your unstructured data using taxonomy creation algorithms such as the one discussed in this book, or by getting all text tagged when it is created or using metadata stored in file headers. Then the data is easier to retrofit and process in traditional databases. But the trend now is for vendors to develop solutions where analysts can use SQL to access low-level data buried under the Hadoop file management system, not the other way around.

Solution

In many cases, there is too much reliance on crowdsourcing and a reluctance to use sampling techniques, mainly because there are few experts who know how to get great samples out of big data. In some cases, you still need to come up with granular predictions (not summaries) — for instance, house prices for every house (Zillow) or weather forecasts for each ZIP code. Yet even in these cases, good sampling would help.

Many reviews, Likes, Tweets, or spam flags are made by users (sometimes Botnet operators, sometimes business competitors) with bad intentions of gaming the system on a large scale. Greed can also be part of the problem. Let's consider a hypothetical scenario: If fake Likes generate revenue for Social Network A and advertisers don't notice the fakes, Social Network A could feed the advertisers with more fake Likes because Social Network A thinks it doesn't have enough relevant traffic to serve all of its advertisers, so it wants to send good traffic to the bigger clients (with higher ad spend). When clients notice this is happening, Social Network A could either discontinue the practice or wait until it gets get hit with a class action lawsuit — $90 million is peanuts for today's enormous social networks, and that's what Google and others settled for when they were hit with a class action lawsuit for delivering fake traffic.

Yet there is a solution that benefits everyone — users, and companies such as Google, Amazon.com, Netflix, Facebook, Twitter, and clients): better use of data science. This isn't about developing sophisticated, expensive statistical technology; it is about simply using the following:

  • Better metrics
  • Different weights (for example, put less emphasis on data resulting from crowdsourcing)
  • Better linkage analysis, association rules to detect collusion, Botnets, and low frequency yet large-scale fraudsters
  • Better and more frequently updated lookup tables (white lists of IP addresses)

All of this can be done without slowing down existing algorithms. Here's one example for social network data: instead of counting the number of Likes (not all Likes are created equal), you could do any of the following:

  • Look at users that produce hundreds of Likes per day, as well as Likes arising from IP addresses that are flagged by Project Honeypot or Stop Forum Spam. But don't put all your trust in these two websites — they also, at least partially, rely on crowdsourcing (users reporting spam) and thus are subject to false positives and abuse.
  • Identify situations such as 200 Likes that result in zero comments or 50 versions of the same “this is great” or “great post” comments — clearly you are dealing with a case of fake Likes. The advertiser should not be charged and the traffic source should be identified and discontinued.
  • Look at buckets of traffic with a high proportion of low-quality users coming back too frequently with two Likes a day, with red flags such as no referral domain or tons of obscure domains, IP address frequently not resolving to a domain, traffic coming from a sub-affiliate, and so on.
  • Use metrics such as unique users that are much more robust (more difficult to fake) than page views (but you still need to detect fake users).
  • Use a simple, robust, fast, data-driven (rather than model-driven) algorithm, such as hidden decision trees, to score Likes and comments. You can even compute confidence intervals for scores without any statistical modeling.
  • Create or improve ad relevancy algorithms with simple taxonomy concepts. This applies to many applications where relevancy is critical, not just ad delivery. It also means better algorithms to detect influential people (in one example, you had to be a member of PeerIndex to get a good score), to better detect plagiarism, to better detect friends, to better score job applicants, to improve attribution systems (advertising mix optimization, including long-term components in statistical models), and the list goes on and on.

The problem of detecting Facebook Likes is part of a more general class of business techniques that includes scoring Internet traffic such as clicks, keywords, members, page impressions, and so on. The purpose is twofold: detecting fraud (fake users, fake clicks, duplicate users), and assigning a quality score (a proxy for conversion rate or value) to each element (click, keyword, user) being measured.

CROSS-REFERENCE You will find more information on fraud detection in Chapter 6.

Of course, in the case of Facebook advertising, savvy advertisers use other metrics to measure return on investment (ROI) on ad spend. They use “customer lifetime” value and/or engagement on their own websites. For example, Starwood Hotels presented at a global conference saying the company is measuring an “engagement index” that includes how its fans are sharing the hotel content (virality and fan acquisition), patronage of the hotel, and engagement rate on their website. As the industry is getting more mature (advertisers, publishers, and users), the black market selling Facebook Likes will get smaller and smaller, but bad guys will still find new ways to exploit new features introduced by Facebook or any publisher, as they've always done.

CROSS-REFERENCE The section Attribution Modeling in Chapter 6 provides information on a “macro-economic” way to measure ROI on ad spend. This macro-economic approach is much cheaper in terms of data storage, tracking, and processing, because you only look at big aggregates.

Note that traffic scoring is especially useful when bidding in real time, or when purchasing large numbers of new keywords for which no historical conversion data is available (eBay is doing this every day). Better algorithms and data governance (creating standards as the IAB, Internet Advertising Bureau, did with click and user metrics) will help to eradicate these problems, for instance by defining what counts as a Like and which Likes should be discarded.

When Data Insights Flow Too Fast

When you can, you should analyze data (usually in real time with automated algorithms) and extract insights faster than they can be delivered to and digested by the end users (executives and decision makers). But you need to do something with those insights before passing them along to the end users — it's bad when the decision makers get flooded with tons of unprioritized reports.

The solution is to prioritize reports or alerts (in a dashboard system), determine who receives what, provide good visualizations, and allow users to unsubscribe from automated reports sent by e-mail. It is also good policy to provide only summaries and light graphs in an e-mail message, along with a link to the full report; some CEOs complain about their e-mail inboxes being regularly full due to multiple large daily reports.

CROSS-REFERENCE See Chapter 4 for information on visualization tools.

Sometimes this type of situation arises from machines talking to machines; for example, eBay automatically pricing millions of bid keywords every day and automatically feeding those prices to Google AdWords via the Google API. Too many engineers receive detailed reports every day about this activity, when a summary would be good enough.

Examples of Big Data Techniques

Let's now look at some of the techniques and tools that data scientists use to handle data in light of these two big data issues. For details on how a data science project is executed, including the different steps, see the section Life Cycle of Data Science Projects in Chapter 5.

Big Data Problem Epitomizing the Challenges of Data Science

The example problem discussed here is about building better search tools. This is the kind of problem from which big data and data science originate.

Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two different keywords: mining data usually means data about mining (as in coal, gold, and so on). If you search for data about mining you get the same results.

Yet Google has one of the best search algorithms out there. Imagine an e-store selling products: it allows users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20 percent of the time. What a loss of money! Indeed, if you were an investor looking on Amazon.com to purchase a report on mining data, all you would find are books on data mining and you wouldn't buy anything, which could lead to a substantial loss for Amazon.com. Repeat this millions of times a year, and the opportunity cost is billions of dollars.

Following is a discussion of a workable solution for this type of problem and an explanation in simple terms that can help any business and analytic team to easily fix such a problem.

Any search query entered by a user is processed using the following steps:

  1. The user query is cleaned: typos are removed.
  2. It is then stemmed: plural is replaced by singular, verbs (ing form) are replaced by nouns (thus mining becomes mine), and so on.
  3. Stop words are removed: the, or, and, of, from, about, it, with, and so on. For instance, IT jobs becomes jobs, and data about mining becomes data mining and (because of step 2) data mine.
  4. The query is normalized: the tokens are rearranged in alphabetical order (mine data becomes data mine).
  5. Algorithms are used to determine if the query contains the name of a city, book, song, product, or movie, using lookup tables.
  6. Other algorithms are used to detect whether two tokens must be attached. For example, San Francisco is one token, not two, even though it looks like two.
  7. Synonyms are detected. For example, perhaps automobile can be replaced by car to reduce the size of the index.
  8. An algorithm is used to detect user intention (sometimes based on a user profile) for queries spanning multiple categories.

If you search for IT jobs, you get all jobs. In my case, when I search for “IT jobs,” the search results include all jobs in Issaquah, Washington, where I live. Google correctly identified that I live in Issaquah (via my IP address), but wrongly assumed that I am interested in local jobs, and forgot that I'm only interested in IT jobs. The first job ad I clicked was for an insurance agent.

EXERCISE

Question: Should removing stop words be step 1 rather than step 3?

Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data, which could trigger search results associated with one of the most popular data keywords: data entry. In practice, you remove stop words before stemming. So step 3 should indeed become step 1.

If you use exact match in your Google search, you get results slightly better for “IT jobs,” “mining data,” and “data about mining.” So Google seems to have a solution. I believe the real fix is to make exact match the default option, rather than broad match. How many people know about exact match? Probably less than 1 percent. Anyway, as stated, it's more a business than a technical issue.

I can imagine Google benefiting from showing poor results (setting the default option to broad match). Indeed, it makes users more likely to click the Google ads, and thus boost Google revenue. But for Amazon.com, it does not make sense. Maybe Amazon.com's search box is powered by Google. If that's the case, then Amazon.com should configure the default search setting as exact match, not broad match. But I tried exact match on “data about mining” and it did not improve the search results on Amazon.com. (In fact, it made it even worse.)

Clearly, Amazon.com has a more severe problem, both from technical and business points of view. I think the problem is worse on Amazon.com than Google because Amazon.com has a much smaller inventory of search results pages from which to choose, because Amazon.com's inventory is mostly products, such as books, whereas Google's inventory is the entire Internet.

Clustering and Taxonomy Creation for Massive Data Sets

This section discusses potential algorithms that can perform clustering extremely fast on big data sets, as well as the graphical representation of such complex clustering structures. By “extremely fast” I mean a computational complexity on the order of O(n) and sometimes even faster, such as O(n/log n). This is much faster than good hierarchical agglomerative clustering (see http://en.wikipedia.org/wiki/Hierarchical_clustering), which is typically O (n^2 log n). For big data, this means several million or even possibly a billion observations.

Clustering algorithms are slow, and matrices used in clustering algorithms take a lot of in-memory space (the square of the number of observations — in this case it would be more than a quadrillion elements). No memory is big enough to store such matrices. Yet these matrices are extremely sparse. In this section, we address these issues: speed improvement, memory reduction, and memory optimization with MapReduce. Clustering is one of the most fundamental machine learning, statistical, and big data techniques. Its output is called a taxonomy in the context of unstructured text data.

This technology has been used to produce keywords representing “user intent.” Blending keywords representing the same intent (for instance, buying car insurance in California) into the same cluster allows advertisers to purchase these pre-computed clusters rather than individual keywords. The advantage to them is getting better targeting by focusing on intent, and a bigger reach by making sure all potential keywords of interest are included in the clusters in question.

Potential Applications

Creating a keyword taxonomy categorizes the entire universe of cleaned (standardized) valuable English keywords. This is about 10 million keywords made up of one, two, or three tokens — approximately 300 times the number of keywords found in an English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon.com on Google (for pay-per-click ad campaigns) to better price them. The application of this is discussed here in terms of the following:

  • Clustering millions of documents (for example, books on Amazon.com)
  • Clustering web pages, or even the entire Internet, which consist of approximately 100 million top websites and billions of web pages
  • Determining when it makes sense to perform such massive clustering, and how MapReduce can help

NOTE In case you are not familiar with this terminology, a token is a term in a user query. The keyword car insurance Alabama has three tokens. N-grams are different combinations of these tokens — for instance, insurance Alabama car and Alabama insurance car.

Part 1: Building a Keyword Taxonomy

The solution described here involves two steps: preprocessing and clustering.

Step 1: Preprocessing

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories) and compute the frequencies for each keyword and for each keyword pair. A keyword pair is two keywords found on the same web page or close to each other on the same web page. A keyword might be something like California insurance; a keyword usually contains more than one token, but rarely more than three. Using all the frequencies you can, create a table (typically containing millions of keywords, even after keyword cleaning) like the following one where each entry is a pair of keywords and three numbers.

A = California insurance
B = home insurance
x = 543
y = 998
z = 11

where:

  • x is the number of occurrences of keyword A in all the web pages that you crawled.
  • y is the number of occurrences of keyword B in all the web pages that you crawled.
  • z is the number of occurrences where A and B form a pair (for example, they are found on the same page).

This “keyword pair” table can indeed be easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a keyword pair; in other words, z = 0. So by ignoring these null entries, your “keyword pair” table is still manageable and might contain as few as 100 million entries.

NOTE Step 1 constitutes the final step of a number of interesting applications. For instance, it is used in search engine technology to identify or recommend keywords related to other keywords. An example of such an application is presented in Chapter 5 in the section Source Code for Keyword Correlation API. Interestingly, a few years ago a similar app was licensed to search engines by Ask for $500,000 per year.

Step 2: Clustering

To create a taxonomy, you want to put the identified keywords into similar clusters. One way to do this is to compute a dissimilarity d(A,B) between two keywords A, B. For instance, d(A, B) = z / SQRT(x * y), although other choices are possible. Note that the denominator prevents extremely popular keywords (for example, “free”) from being close to all the keywords and from dominating the entire keyword relationship structure. Indeed, it favors better keyword bonds, such as lemon with law or pie, rather than lemon with free.

The higher d(A, B) is, the closer keywords A and B are to each other. Now the challenge is to perform some kind of clustering — for example, hierarchical — on the “keyword pair” table using any kind of dissimilarity. Part 2 presents the solution and a potential alternative solution for your consideration.

Part 2: Using a Fast Clustering Algorithm

Although the following algorithm is described in the context of keyword clustering, it is straightforward to adapt it to other contexts. Assume that you have n = 10,000,000 unique keywords and m = 100,000,000 keyword pairs {A, B}, where d(A,B)>0. That is, you have an average of r=10 related keywords attached to each keyword.

The algorithm incrementally proceeds in several (five or six) rounds, as follows:

  • Initialization (Round 0): the small data (or seeding) step. Select 10,000 seed keywords and create, for example, 100 categories and create a hash table $hash where key is one of the 10,000 seed keywords, and value is a list of categories to which the keyword is assigned.
    $hash{“cheap car insurance”} = {“automotive”, “finance”}

    The choice of the initial 10,000 seed keywords is very important. Until more research is done on this subject, I suggest picking the top 10,000 keywords in terms of number of associations — that is, keywords A with many Bs where d(A,B)>0. This will speed up the convergence of the algorithm.

  • Round 1: The big data step. Browse the table of m keyword pairs, from beginning to end. When you find a pair {A, B} where, for example, $hash{A} exists and $hash{B} does not, do the following:
    • $hash{B} = $hash{A}
    • $weight{B} = d(A,B)

    When you find a pair {A, B} where both A and B are already in $hash, do this:

    • If $d(A,B) > $weight(B) then { $hash{B} = $hash{A}; $weight{B} = $d(A, B) } ; # B gets re-categorized to A's category.
    • If $d(A,B) > $weight(A) then { $hash{A} = $hash{B}; $weight{A} = $d(A, B) } ; # A gets re-categorized to B's category.
  • Round 2: Repeat Round 1. ($hash and $weight are kept in memory and keep growing at each subsequent round.)
  • Round 3: Repeat Round 1.
  • Round 4: Repeat Round 1.
  • Round 5: Repeat Round 1.

The computational complexity is q * m=O(n), with q being the number of rounds. This is n=10,000,000 times faster than good clustering algorithms. However, all these hash-table accesses will slow it a bit to O(n log n), as $hash and $weight grow bigger at each subsequent round.

Would presorting the big table of m pairs help (sorting by d(A, B))? It would allow you to drastically reduce the number of hash-table accesses by eliminating the need for the recategorizations, but sorting is an O(n log n) process, so you would not gain anything. Note that sorting can be efficiently performed with MapReduce. The reduce step, in this case, consists of merging a bunch of small, sorted tables.

This clustering algorithm seems (at first glance) easy to implement using MapReduce. However, because the big table has only 100,000,000 entries, it might fit in RAM.

You can improve computational complexity by keeping the most important m/log n entries (based on volume and d(A,B)) in the big table and deleting the remaining entries. In practice, deleting 65 percent of the big table (the very long tail only, but not the long tail from a keyword distribution point of view) will have little impact on the performance — you will have a large bucket of uncategorized keywords, but in terms of volume, these keywords might represent less than 0.1 percent.

Alternative Algorithm

Alternatively, you could use Tarjan's strongly connected components algorithm to perform the clustering. To proceed, you first bin the distances: d(A, B) is set to 1 if it is above some pre-specified threshold; otherwise it is set to 0. This is a graph theory algorithm, where each keyword represents a node, and each pair of keywords where d(A, B) = 1 represents an edge. The computational complexity of the algorithm is O(n+m), where n is the number of keywords and m is the number of pairs (edges). To take advantage of this algorithm, you might want to store the big “keyword pair” table in a graph database (a type of NoSQL database).

Other Considerations

Visualization: How do you represent these keywords, with their cluster structure determined by d(A, B), in a nice graph? Ten million keywords would fit in a 3,000 x 3,000-pixel image. If you are interested in graphical representations, see the Fruchterman and Reingold algorithm that is extensively used to display cluster structures. . Note that its computational complexity is O(n^3), so you need to significantly improve it for this keyword clustering application — including the graphical representation. The graphical representation could be a raster image with millions of pixels, like a heat map where color represents category, and when you point to a pixel, a keyword value shows up (rather than a vector image with dozens of nodes). Neighboring pixels would represent strongly related keywords.

Need: Do you really need this clustering algorithm? Most of the time you are trying to answer a simpler question (for example, which keywords are related to keyword A), or you already have a taxonomy and want to extend or improve it. In this case, full clustering is not necessary. But it's nice to know that efficient, simple algorithms exist, if you need them.

Type of science: Is this stuff statistical science, computer science, or data science?

Data sets: My own big table of keyword pairs, including the d(A, B) values, is available for you to download, as described in the next section. In addition, DMOZ data (one million categorized URLs) can be downloaded for free at https://ak.quantcast.com/quantcast-top-million.zip and is a good starting point to extract millions of keywords and categories, either from the data set itself or by crawling all the URLs and related links using distributed architecture. Quantcast also offers a free list of the top one million domain names. Finally, a good source of keyword data is query logs from search engines. These query logs are particularly useful for the preprocessing discussed in Step 1.

Excel with 100 Million Rows

Like most data scientists, I've used Excel a lot in my career, and it definitely has some powerful features. Probably the greatest one is its ability to help design, test, and update models in real time — just change the value of a few core parameters, and all your cells (thousands of them) and all your charts are updated at once.

NOTE To see an example of an interactive spreadsheet where all cells and charts get updated with just one click, download the spreadsheet at http://www.analyticbridge.com/profiles/blogs/three-classes-of-metrics-centrality-volatility-and-bumpiness.

Other nice features includes Excel's ability to connect to databases or Microsoft APIs via the Internet (for example, to the Bing API), extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a strong feel of being an old-fashioned, SQL relational database environment, they are still useful in many contexts.

The main drawback is Excel's slowness. It is slow in ways that are unexpected. If you sort 500,000 observations in one column it's actually quite fast. But say that you simultaneously sort two columns: A and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, quickly and automatically.

As a general rule, any operation that involves recomputing, for example, 200,000+ rows across multiple cells linked via dependence relationships, is done slowly in these cases:

  • One of the columns includes functions such as VLOOKUP at each cell.
  • SORT or other sublinear processes are required (sublinear means processes that are O(n log n) or worse).

There's an easy and efficient (but ugly) way around this, and it seems a bit odd that it's not a built-in feature of Excel and transparent to the user:

  • Replace VLOOKUP formulas with hard-coded values, perform the update on hard-coded values, and then put back the formulas.
  • Perform SORT only on the minimum number of columns where necessary, and then update cells in columns involving formulas.

It is also odd that VLOOKUP is so slow in the first place. I use it all the time to join, for example, a three-million-row dataset with a 200,000-entry lookup table in Perl, and it's fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days (if it was even possible) to perform this “join.” (It isn't possible.) It may be that Excel indirectly performs a full join, thus exponentially slowing down the operation. But few people ever do a full join on large datasets. This is an area in which significant improvements could be made.

Finally, leveraging the cloud would be a way to further speed computations and process data sets far bigger than one million rows. Microsoft could allow the user to export the data to some cloud in a transparent way, in one click (for example, just click an Export to Cloud button). Then you would simulate the time-consuming operations on the local Excel version on your desktop, which amounts to using a local Excel spreadsheet as a virtual spreadsheet. When done, you click Retrieve from Cloud and get an updated spreadsheet. The Retrieve from Cloud would do the following:

  1. Send your Excel formulas to the cloud via the Internet.
  2. Apply your formula to the cloud version of your data, leveraging MapReduce as needed.
  3. Get the processed data back to your local spreadsheet on your desktop.

Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Say you want to store the product of A and B in column C. You would do the following:

  1. Select the whole Column C.
  2. Enter the formula: =A1*B1.
  3. Press the Ctrl key and Enter key together.

I wish it were easier than that, something like = =A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user friendly.

Many times, there are some obscure ways to efficiently do something. You will see another example in the section Regression With Excel in Chapter 5, which can teach you how to write a formula that returns multiple cells — in particular with LINEST, which returns the regression coefficients associated with a linear regression. Yes, you can do it in basic Excel without an add-in.

If you are interested in creating a column with one million values (for example, to test the speed of some Excel computations), here's how to proceed (this would be a good job interview question):

  1. Start with 200 cells.
  2. Duplicate these cells; now you have 400.
  3. Duplicate these cells; now you have 800.

Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach (drag down with the mouse), and if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.

Finally, languages such as Perl, Python, R, C, Sort, Grep, and others can be used to do data summarization before feeding stuff to Excel. But if Excel came with the features just discussed, much more could be done within it.

How to Upgrade Excel to Work With Big Data

You should use PowerPivot add-in from Microsoft to work with large datasets. If you have Excel 2010, you can get and install it from powerpivot.com. If you have Office 2013, PowerPivot is included with the Office Professional Plus edition. (It is disabled by default, so you need to enable it.) If you want to do predictive analyses in Excel, you should look at Predixion Insight (www.predixionsoftware.com). It was developed by the same group of people who did data mining add-ins at Microsoft. The Developer edition of Predixion Insight is free. Predixion Insight has nice integration with PowerPivot.

The latest version of PowerPivot has extended external data connectors such as OData, Azure marketplace, and so on. PowerPivot is like having a proof-of-concept version of the SQL Server analysis server inside of Excel.

Also, 1010data (www.1010data.com) has developed the Trillion Row Spreadsheet, which enables companies to use a spreadsheet interface to perform data discovery and analyze trillions of rows of data, all through a cloud-based GUI. In addition to basic spreadsheet calculations, the Trillion Row Spreadsheet enables you to run advanced statistical analyses and create machine learning algorithms, all using the familiar spreadsheet paradigm. Best of all, this is done without the need to write MapReduce code.

Another solution is to use the Google spreadsheet plus BigQuery. See https://developers.google.com/apps-script/articles/bigquery_tutorial for a tutorial.

What MapReduce Can't Do

Let's now consider a large class of big data problems where MapReduce can't be used — at least not in a straightforward way — and a rather simple, analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, processes each small data set separately (but simultaneously) on different servers or computers, and then gathers and aggregates the results of all the subprocesses to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (nondistributed) designs if you use 1,000 servers and split the main process into 1,000 subprocesses.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, (meaning 2×2 or 3×3), MapReduce offers no benefit over nondistributed architectures. You must come up with a more sophisticated solution.

The Problem

Say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indexes (say that k = 10,000) and the n observations represent stock price signals (up/down) measured at n different times. You want to find high correlations (ideally with time lags to make a profit) — for example, if Google is up today, Facebook is up tomorrow.

You have to compute k * (k−1) / 2 correlations to solve this problem, despite the fact that you have only k = 10,000 stock symbols. You cannot split your 10,000 stock symbols into 1,000 clusters, each containing 10 stock symbols, and then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-cluster computations make MapReduce useless in this case. The same issue arises if you replace the word correlation with any other function, say f, and compute on two variables rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help.

Three Solutions

Three possible solutions to this problem are sampling, binning, and classical data reduction.

Solution 1: Sampling

Instead of computing all cross-correlations, just compute a fraction of them. Select m random pairs of variables — for example, m = 0.001 * k * (k-1)/2 — and compute correlations for these m pairs only. A smart strategy consists of starting with a small fraction of all possible pairs and increasing the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide which variables to keep and which ones to add to form new pairs after computing correlations on, for example 1,000 randomly selected seed pairs of variables.

CROSS-REFERENCE In the section Automated Fast Feature Selection with Combinatorial Optimization in Chapter 6, you will find a discussion on a semi-combinatorial strategy to handle not only 2×2 combinations (as in this correlation issue), but 3×3, 4×4, and so on to find high-quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

Solution 2: Binning

If you can bin your variables in a way that makes sense, and if n is small (for example, 5 or less), then you can precompute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential paired time series. An example of such a pair is {(up, up, down, up, down), (up, up, up, down, down)} where the first five values correspond to a particular stock and the last five values to another stock. It is thus easy to precompute all 512 correlations. You still have to browse all k * (l−1) / 2 pairs of stocks to solve your problem, but now it's much faster because for each pair you get the correlation from the lookup table — no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs might be faster than using lookup tables to access 512 precomputed correlations. However, the principle works regardless of whether you compute a correlation or a more complicated function f.

Solution 3: Classical Data Reduction

Traditional reduction techniques can also be used, such as forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy. Entropy can be measured in various ways. In a nutshell, if you have two data subsets from the same large data set such as these:

  • Set A with 100 variables, which is 1.23 GB when compressed.
  • Set B with 500 variables, including the 100 variables from set A, which is 1.25 GB when compressed

then you can say that the extra 400 variables (for example, stock symbols) in set B don't bring any extra predictive power and can be ignored. In other words, the lift obtained with the set B is so small that it's probably smaller than the noise inherent to these stock price signals.

NOTE An interesting solution consists of using a combination of these three strategies. Also make sure that the high correlations found are not an artifact caused by the “curse of big data” discussed previously.

Conclusion: When to Use MapReduce

Data that comes sequentially as transactional data or log data — anything where you don't need computations involving large-scale point interactions between a large number of far-away points (for instance, weather patterns that are global but feed on local patterns would not work) — is a good candidate for MapReduce. Scoring algorithms, in particular, are good candidates. Processing massive amounts of unstructured text data, collaborative filtering, web crawlers, and search engines are other good examples.

It's more difficult to use MapReduce in situations involving processing n^2 points (or n^3), where n is 1 billion and where you truly need to do all n^2 computations. This n^2 issue also appeared in the section “Clustering and Taxonomy Creation for Massive Data Sets” earlier in this chapter, but the data structure used there somehow managed to get rid of the n^2 matrix. It was possible because that big data was sparse, making MapReduce still useful despite the unfriendly nature of the problem at first glance, from a parallel computation perspective. The same can be said about Google's page-rank algorithm, which requires inverting the n^2 matrix, where n = number of web pages on the Internet (n > 100 billion). It can be done with MapReduce thanks to smart algorithms.

Communication Issues

The example discussed in the previous section, Big Data Problem Epitomizing the Challenges of Data Science, illustrates that the issue sometimes is not technical difficulty, but rather corporate silos, poor communication, and teams that do not collaborate optimally.

There are a few issues that make this search engine problem (matching search results with a user query) difficult to fix. The problem is easy for decision makers, CTOs, or CEOs to notice, understand, and then assess the opportunity cost (just run 200 high value random search queries, and see how many return irrelevant results), yet the communication between the analytics teams and business people is faulty: there is a short somewhere.

There might be multiple analytics teams working as silos — computer scientists, statisticians, engineers — sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hear is a lot of noise and a lot of technicalities, and they don't know how to begin addressing the problem, how complex the issue is, how much it will cost to fix it, and who should fix it.

The people involved in steps 1 through 8 of the earlier example are all on different teams, or are people working independently. The decision makers don't even know who to talk to about the problem. Sometimes the use of data science terminology with non–data scientists can exacerbate the problem. For example, data scientists use terms such as n-grams to discuss Step 4, and mention computational complexity as the bottleneck preventing Step 4 from being improved. They will tell decision makers that a query involving three tokens has 6 n-grams, another involving four tokens has 24 n-grams, another with six tokens has 720 n-grams, and if they need to consider all combinations of tokens in the keyword index, the index will explode! Then maybe the decision maker says, “OK, let's buy tons of servers and a clustered architecture (Hadoop).”

But when scientists claim that the keyword index will explode, they are wrong. They are wrong because they think like scientists, not like business people. In practice, how many queries have more than four tokens? Very few. And for queries with three or four tokens, very few n-grams have decent traffic. The vast majority can be replaced by the dominant n-gram (in terms of traffic). For instance, people sometimes search for car insurance but rarely for insurance car. So having only one entry for these two queries makes sense. But for data mining and mining data, it makes sense to keep two separate entries. So instead of a table exploding in size by a factor 1,000, in reality we see a size increase of approximately 50 percent, which is perfectly manageable. When users search for mining data, the algorithm should first check to see if there is one entry for this specific n-gram (there should be one), and if not, proceed to rearrange tokens in alphabetical order — that is, look for data mining in the keyword index.

TIP The keyword index should contain keywords no more than four tokens long.

Another fix concerns Step 2. Sometimes, for some words like mining or booking, you cannot stem (replace them with mine or book). You need to have a lookup table of the top 10,000 words ending with “ing” that cannot be stemmed.

The same applies to Step 3: Have a lookup table of the top 10,000 combinations involving a stop word, where the stop word cannot be removed. For example, IT job would be in the lookup table attached to the stop word it.

TIP As a rule of thumb, don't remove stop words in queries consisting of only two tokens.

One of the challenges of effective communications is how to create an optimal team structure. But perhaps most important is to have high-level professionals that work across multiple teams, a bit like a management consultant. This person should get the big picture of the problem, be somewhat technical and business oriented (maybe an MBA with strong analytics skills), and bridge the gaps between multiple data science groups or people.

CROSS-REFERENCE The typical process of a data science project is explained in the section “Life Cycle of Data Science Projects” in Chapter 5.

This example issue and solution are exactly why the data science position was created: to bridge the gap between statisticians, business people, computer scientists, and engineers, and frequently to deal with big data problems as in this example. Note that to develop a solution like the one presented in this section, the data scientist must have solid domain expertise — in this case, in search technology.

This problem epitomizes what data science is all about and its challenges. A case study like this one is typical, and the problem is not just related to search engines. It's not solved just by hiring the right statistician or data miner. Indeed if you read the solution, you don't need to hire anyone; maybe you need to fire some people instead. But it's definitely a people/organization issue.

Data Science: The End of Statistics?

This section starts with examples in which traditional statistics are misused and lead to wrong conclusions, especially when applied to modern data, which is typically less structured, bigger, and requires merging data sets that are not fully compatible from various sources. Then you see how modern statistics can help make data science better.

The Eight Worst Predictive Modeling Techniques

Most of the following techniques have evolved over time (in the last 10 years) to the point where most of their drawbacks have been eliminated, making the updated tool far different from and better than its original version. Typically, these bad techniques are still widely used.

  1. Linear regression relies on the normal, heteroscedasticity, and other assumptions and does not capture highly nonlinear, chaotic patterns. It is prone to overfitting, parameters are difficult to interpret, and it is very unstable when independent variables are highly correlated. Fixes include reducing variables, applying a transformation to your variables, and using constrained regression (for example, ridge or lasso regression).
  2. Traditional decision trees are large decision trees that are unstable, impossible to interpret, and prone to overfitting. Fixes could include combining multiple small decision trees instead of using one large decision tree.
  3. Linear discriminant analysis is used for supervised clustering. It is a bad technique because it assumes that clusters do not overlap and are well separated by hyperplanes. In practice, this is never the case. Use density estimation techniques instead.
  4. K-means clustering tends to produce circular clusters and does not work well with data points that are not a mixture of Gaussian distributions.
  5. Neural networks are difficult to interpret, unstable, and subject to overfitting.
  6. Maximum likelihood estimation requires your data to fit with a prespecified probabilistic distribution. It is not data driven, and in many cases the prespecified Gaussian distribution is a terrible fit for your data.
  7. Density estimation in high dimensions is subject to dimensionality. One fix is to use nonparametric kernel density estimators with adaptive bandwidths.
  8. Naive Bayes are used, for example, in fraud and spam detection and for scoring. They assume that variables are independent, but if they are not it will fail miserably. In the context of fraud detection or spam detection, variables (sometimes called rules) are highly correlated. One fix is to group the variables into independent clusters of variables where each cluster contains variables that are highly correlated. Then apply naive Bayes to the clusters or use data reduction techniques. Bad text mining techniques (for example, basic “word” rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.

TIP Always remember to use sound cross-validation techniques when testing models.

The reasons why such poor models are still widely used include the following:

  • Many university curricula use outdated textbooks; thus many students are not exposed to better data science techniques.
  • People use black-box statistical software, not knowing the limitations, the drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
  • Governments force regulated industries (pharmaceutical and banking — see Basel III regulations for banks) to use 30-year-old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and are arbitrarily rejected by authorities. The same goes for clinical-trial analyses submitted to the FDA, in which SAS is the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
  • Modern data sets are considerably more complex than and different from the data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
  • There's no perfect statistical technique that would apply to all data sets, but there are many poor techniques.

In addition, poor cross-validation allows bad models to make the cut by overestimating the true lift to be expected in future data, the true accuracy, or the true ROI outside the training set. Good cross-validations consist of the following:

  • Splitting your training set into multiple subsets (test and control subsets)
  • Including different types of clients and more recent data in the control sets than in your test sets
  • Checking the quality of forecasted values on control sets
  • Computing confidence intervals for individual errors (error defined, for example, as true value minus forecasted value) to make sure that the error is small enough and not too volatile (that it has a small variance across all control sets)

Marrying Computer Science, Statistics, and Domain Expertise

Let's now use a real-life problem to illustrate how to blend statistics, computer science, and domain expertise to offer a data science solution applied to a modern big data problem.

Consider an article written by scientists about how to predict ad clicks based on user queries and the ad's text. The article, available at https://static.googleusercontent.com/external_content/untrusted_dlcp/ research.google.com/en/us/pubs/archive/41159.pdf, was written by a team of scientists and focuses on the large number of metrics in the model (billions of features), the use of logistic regression (a statistical technique), and optimization techniques (numerical analysis gradient methods) to solve the logistic regression (find the optimum regression coefficients). As you would expect, they discuss at length how the feature space is sparse and how to take advantage of sparsity to design an efficient algorithm.

All this sounds great, and it is certainly a textbook example of a correct, good, interesting application of machine learning. Indeed this is computer science. I have two criticisms, however, and pretty much in all “pure science” stuff that I have read, the criticism is identical. It involves using a nuclear weapon to kill a fly and not realizing it, and not showing the lift over a basic methodology designed by domain experts — in this case, experts with deep expertise simultaneously in ad technology, business management, and statistics. Although Google is doing better than many companies with its algorithms, it could do even better with less effort by focusing less on science and more on domain expertise. This is precisely the gap that data science is trying to bridge.

Following is an example of the ad prediction-technique developed by Google.

What Are You Trying to Accomplish?

That's the first question all data scientists should ask. Here, maybe Google is trying to maximize the number of clicks delivered to advertisers to boost its revenue. Maybe the research paper is to be used by Google internally. Or maybe the purpose is to help advertisers, who always want more clicks — as long as they are converting to sales.

If the paper is for Google's internal use, there should be a discussion about the fact that boosting click-through-rate (CTR) to increase Google's revenue works only in the short term. Overboosting CTR (by the publisher, in this case Google) eventually results in lower ROI for many advertisers, as I have experienced countless times. There should at least be a discussion about long-term goals (boosting conversions) along with short-term goals (boosting CTR). Both are necessary and cannot be considered separately in any business optimization problem.

If the paper is for advertisers, it misses the point: most advertisers (those interested in real traffic by real humans) are interested in conversions. It's easy for advertisers to change the wording in their ads and add keywords to their campaigns to generate tons of clicks — and negative ROI. The exception is advertisers who are publishers and bill their advertising clients downstream using a per-impression model (where a click from Google is an impression for their clients) — in short, click arbitrageurs.

NOTE The example discussed here is probably one of the most extreme cases you could encounter. But the key concept is that the statistical approach should not be underestimated. In many circles, the solution is a system with billions of rules that can cover everything possible and process entire big data sets. The point is that there are less costly alternatives.

Do You Need a Nuclear Weapon to Kill a Fly?

Using billions of features, most of them almost never triggered, makes no sense. How do you handle co-dependencies among these features, and what statistical significance do you get from 99.9 percent of the features that are triggered no more than three times in 500 billion observations (clicks)? Sure, you could do some feature blending and clustering — a rather expensive technique, computationally speaking — but this issue of feature aggregation was not even discussed in the paper.

Also, the vast majority of these features are probably automatically created through a feature generation algorithm. This is by far the most intriguing component of the system, but it is not discussed in the paper. It's a combinatorial optimization problem, looking at all relationships (ratios, products, log transformations, and other mappings, such as IP category) among a set of base metrics such as log-file fields to discover features with predictive power. Some features are also created in bulk by analysts looking at data. This set of billions of features could be missing two or three core (but nonobvious) features that would make the algorithm far superior. The paper does not mention any of the features used in Google's algorithm.

You can solve this ad-click-prediction problem with just a couple of features (a feature is a variable) carefully selected by a domain expert. Here are the ones that are unlikely to be created by an automated feature generation algorithm, but are recommended features to predict ad clicks:

  • Keyword category: Match the keyword category to the category assigned to a text ad. This means that you have an algorithm to assign categories to a user query and a text ad. You have another algorithm to standardize user queries and to discriminate, for example, between mining data (data about mining) and data mining. It also means that you have a list of 500 categories, 100,000 subcategories, and three million sub-subcategories, enough to cover 99.99 percent of all commercial user queries (where advertisers are bidding). Note that a keyword can have two or three terms, as in car insurance Alabama and two categories such as insurance and regional.
  • Special keywords: Create or acquire a list of special keywords found in a text ad (for example, 2013, new, free, best).
  • Rare sub-category: Identify text ads and user queries that share a same rare sub-subcategory. This will increases the odds of the user clicking.
  • Advertiser type: Check for a high CTR (which is from clicks that do not convert). A high CTR indicates it is a false advertiser, because real advertisers usually have low CTR's.
  • Number of ad views: Identify how many times a user has viewed the ad (First, second, or third time?) Good ads work well initially, but if you never change your text ad, it stops performing and CTR goes down.
  • Popular events: Identify ads and user queries related to the same popular event.
  • Trustworthy domain name: Prepare or acquire a list of domain names that are trustworthy and respected. You need to have an algorithm that scores domains, broken down by category.
  • Special characters: Prepare a list of special characters and uppercase letters present in the ad.

Of course, the best solution is to blend features with the top features detected by an automated, machine learning algorithm and analysts.

Where Smart Statistics Help

After noticing the high sparsity of the feature space, I developed hidden decision trees to solve this type of problem. Do you need logistic regression with a gradient algorithm? Do you need an exact solution when the data is messy? I bet you can do great predictions using only 20 carefully selected features, and that's where the data scientist can also help by applying statistical knowledge to create a system that runs 1,000 times faster, using many fewer computer resources, and providing similar or better results. You don't need to use standard techniques such as robust logistic regression. I've been working with model-free statistics for a long time with great satisfaction, and yes, I also computed model-free confidence intervals, which are discussed in Chapter 5.

Another area in which statistics can help — if you like working with billions of features — is in identifying features with predictive power. Most of the billion features used by Google have no predictive power; in fact, predictive power is never discussed in the paper previously discussed. Sometimes two features have no individual predictive power, but when combined they do. For example, country (US versus UK) and time of day have far greater predictive power when combined. Statistical science can help define predictive power and assess when it is significant.

Finally, if you have billions of features, you can find features that seem to have predictive power but actually don't. Worse is the fact that these spurious features might overshadow the ones that truly have predictive power, making your system prone to systemic errors, and resulting in chaotic predictions. This is an issue that should be addressed using ad hoc statistical analyses, not the kind of stats currently taught in university curricula.

NOTE The reason for this problem (and the fix) is explained in detail in an article on the curse of big data, available at http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data.

The Big Data Ecosystem

A book about big data would not be complete without at least a brief mention of the components of the big data ecosystem. The big data ecosystem consists of the following products and services (for which some examples are also listed):

  • Hardware providers.
  • Cloud providers, including public, private, and hybrid clouds. Examples include EMC, Cloudera, and Amazon Web Services. Security companies are included in this category to protect your data.
  • Data integration vendors.
  • MapReduce/Hadoop environment.
  • Database vendors for NoSQL, NewSQL graph, in-memory analytics, and databases. Examples include Teradata and Pivotal.
  • Business Intelligence and dashboards, such as BIRT.
  • Visualization tools such as Tableau and TIBCO.
  • Data science and analytics tools, including SAS, StatSoft, Revolution Analytics, R, and Python data analysis libraries.

These products and services can be either open source or enterprise solutions. For more information on big data ecosystems and related subjects, see the following:

Summary

This chapter continued the discussion of what makes data science a new discipline, by illustrating how big data is different from data seen even just 10 years ago, showing why standard statistical techniques fail when blindly applied to such data, and identifying the pertinent issues with both misuse and lack of use of statistics, and how this should be investigated and addressed.

The “curse” of big data and fast-flowing data were considered, and then real-life examples of big data techniques were presented and evaluated. Communication issues in big data environments were discussed, followed by the issue of how and why statistics will come back in the context of big data. Finally, you considered a summary of elements of the big data ecosystem.

Chapter 3, “Becoming a Data Scientist,” discusses what a data scientist is, the different paths you can take as a data scientist, and how to acquire the right training to get hired as a data scientist.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset