The previous chapter discussed material that should be part of your data science training. The material was less focused on metrics and more on applications. This chapter discusses case studies, real-life applications, and success stories. It covers various types of projects, ranging from stock market techniques based on data science, to advertising mix optimization, fraud detection, search engine optimization, astronomy, automated news feed management, data encryption, e-mail marketing, and relevancy problems (online advertising).
Following is a simple strategy recently used in 2013 to select and trade stocks from the S&P 500, with consistently high returns, based on data science. This section also discusses other strategies, modern trends, and an API that can be used to offer stock signals to professional traders based on technical analysis.
This pattern was found on recent price activity for the 500 stocks that are part of the S&P 500 index. It multiplied the return by factor 5. For each day between 4/24 and 5/23, companies that experienced the most extreme returns—among these 500 companies—were looked at comparing today's with yesterday's closing price's.
Then the daily performance the following day was looked at (again comparing day-to-day close prices), for companies that ranked either #1 or #500. Companies that ranked #1 also experienced (on average) a boost in stock price the next day. The boost was more substantial for companies experiencing a 7.5 percent (or more) price increase. And the return boost on the next day was statistically significant and quite large, so big, in fact, that the total (non-compound) return based on this predictive signal would have been 20 percent over 30 days, versus 4.5 percent for the overall performance of the S&P 500 index.
The following things made it statistically significant:
The return the following day for these companies was positive 15 times out of 20.
These numbers were computed over a small time period, and it happened during a growth (bull market) period. However, you should note the following:
Sometimes, when a pattern stops working because of overuse, reversing the strategy (swapping buy and sell, for example, after one week with five bad daily returns) allows you to continue enjoying a good return for a few more weeks, until once again you need to reverse the strategy, creating consecutive, periodic cycles where either the strategy or the reverse strategy is used. Indeed, this pattern probably will not work anymore by the time you read this book because everyone will try to exploit it. But maybe the reverse strategy can work!
You might further increase the return by considering smaller time windows for buying or selling: intraday, or even high-frequency, rather than one buy/sell cycle per day. However, the return on a smaller time window is usually smaller, and the profit can more easily be consumed by:
Note that by focusing on the S&P 500, you can eliminate much of the stock market volatility. You can also work with liquid, fluid stocks. This reduces the risk of being a victim of manipulations and guarantees that your buy/sell transactions can be achieved at the wanted price. You can narrow down on stocks with a price above $10 to stay in an even, more robust, predictable environment. You can also check whether looking at extreme daily returns per business category might increase the lift. The S&P 500 data are broken down into approximately 10 main categories.
It is a good idea to do some simulations where you introduce a bit of noise into your data to check how sensitive your algorithm is to price variations. Also, you should not just rely on backtesting, but also walk forward to check whether a strategy will work in the future. This amounts to performing sound cross-validation before actually using the strategy.
You can get months of daily prices at once for all 500 S&P 500 stocks at http://www.stockhistoricaldata.com/daily-download. You need to provide the list of all 500 stocks in question. That list, with the stock symbol and business category for each S&P 500 company, is available in a spreadsheet at http://bit.ly/1apWiRP as well as on Wikipedia.
On a different note, can you sell stock price forecasts to stock traders? How would you price these forecasts? What would you do with unhappy clients aggressively asking for refunds when your forecast fails?
The conclusion is that you need deep domain expertise, not just pure statistical knowledge, to design professional trading signals that satisfy savvy customers. Dealing with clients (as opposed to developing purely theoretical models) creates potential liabilities and exposure to lawsuits in this context, especially if you sell your signals to people who don't have trading experience.
One of the common mistakes in optimizing statistical trading strategies consists of over-parameterizing the problem and then computing a global optimum. It is well known that this technique provides an extremely high return on historical data but does not work in practice. This section investigates this problem, and you see how it can be side-stepped. You also see how to build an efficient six-parameter strategy.
This issue is actually relevant to many real-life statistical and mathematical situations. The problem can be referred to as over-parameterization or over-fitting. The explication as to why this approach fails can be illustrated by a simple example. Imagine you fit data with a 30-parameter model. If you have 30 data points (that is, the number of parameters is equal to the number of observations), you can have a perfect, fully optimized fit with your data set. However, any future data point (for example, tomorrow's stock prices) might have a bad fit with the model, resulting in huge losses. Why? You have the same number of parameters as data points. Thus, on average each estimated parameter of the model is worth no more than one data point.
From a statistical viewpoint, you are in the same situation as if you were estimating the median U.S. salary, interviewing only one person. Chances are your estimation will be poor, even though the fit with your one-person sample is perfect. Actually, you run a 50 percent chance that the salary of the interviewee will be either very low or very high.
Roughly speaking, this is what happens when over-parameterizing a model. You obviously gain by reducing the number of parameters. However, if handled correctly, the drawback can actually be turned into an advantage. You can actually build a model with many parameters that is more robust and more efficient (in terms of return rate) than a simplistic model with fewer parameters. How is this possible? The answer to the question is in the way you test the strategy. When you use a model with more than three parameters, the strategy that provides the highest return on historical data will not be the best. You need to use more sophisticated optimization criteria.
One solution is to add boundaries to the problem, thus performing constrained optimization. Look for strategies that meet one fundamental constraint: reliability. That is, you want to eliminate all strategies that are too sensitive to small variations. Thus, you focus on that tiny part of the parameter space that shows robustness against all kinds of noise. Noise, in this case, can be trading errors, spread, and small variations in the historical stock prices or in the parameter set.
From a practical viewpoint, the solution consists in trying millions of strategies that work well under many different market conditions. Usually, it requires several months' worth of data to have various market patterns and some statistical significance. Then for each of these strategies, you must introduce noise in millions of different ways and look at the impact. You then discard all strategies that can be badly impacted by noise and retain the tiny fraction that are robust.
The computational problem is complex because it is equivalent to testing millions of strategies. But it is worth the effort. The end result is a reliable strategy that can be adjusted over time by slightly varying the parameters. My own strategies are actually designed this way. They are associated with six parameters at most, for instance:
It would have been possible to reduce the dimensionality of the problem by imposing symmetry in the parameters (for example, parameters being identical for the buy and sell prices). Instead, this approach combines the advantage of low dimensionality (reliability) with returns appreciably higher than you would normally expect when being conservative.
Finally, when you backtest a trading system, optimize the strategy using historical data that is more than 1 month old. Then check if the real-life return obtained during the last month (outside the historical data time window) is satisfactory. If your system passes this test, optimize the strategy using the most recent data, and use it. Otherwise, do not use your trading system in real life.
This section describes how to backtest a short-term strategy to assess its long-term return distribution. It focuses on strategies that require frequent updates, which are also called adaptive strategies. You examine an unwanted long-term characteristic shared by many of these systems: long-term oscillations with zero return on average. You are presented with a solution that takes advantage of the periodic nature of the return function to design a truly profitable system.
When a strategy relies on parameters requiring frequent updates, you must design appropriate backtesting tools. I recommend that you limit the number of parameters to six. You also learned how to improve backtesting techniques using robust statistical methods and constrained optimization. For simplicity, assume that the system to be tested provides daily signals and needs monthly updates. The correct way to test such an adaptive system is to backtest it one month at a time, on historical data, as the following algorithm does.
For each month in the test period, complete the following steps:
Step 1: Backtesting. Collect the last 6 months' worth of historical data prior to the month of interest. Backtest the system on these 6 months to estimate the parameters of the model.
Step 2: Walk forward. Apply the trading system with the parameters obtained in step 1 to the month of interest. Compute the daily gains.
The whole test period should be at least 18 months long. Thus you need to gather and process 24 months' worth of historical data (18 months, plus 6 extra months for backtesting). Monthly returns obtained sequentially OUT OF SAMPLE (one month at a time) in step 2 should be recorded for further investigation. You are likely to observe the following patterns:
You now have all the ingredients to build a long-term reliable system. It's a metastrategy because it is built on top of the original system and works as follows: If the last month's return is positive, use the same strategy this month. Otherwise, use the reverse strategy by swapping the buy and sell prices.
This section provides details about stock price signals (buy/sell signals) based on predictive modeling, previously offered online via an API, to paid clients. In short, this web app is a stock forecasting platform used to get a buy/sell signal for any stock in real time. First consider the predictive model in this section and then the Internet implementation (API) in the following section.
This app relies on an original system that provides daily index and stock trending signals. The nonparametric statistical techniques described here have several advantages:
The algorithm computes the probability, for a particular stock or index, that tomorrow's close will be higher than tomorrow's open by at least a specified percentage. The algorithm can easily be adapted to compare today's close with tomorrow's close instead. The estimated probabilities are based on at most the last 100 days of historical data for the stock (or index) in question.
The first step consists of selecting a few price cross-ratios that have an average value of 1. The variables in the ratios can be selected to optimize the forecasts. In one of the applications, the following three cross-ratios were chosen:
Then each day in the historical data set is assigned to one of eight possible price configurations. The configurations are defined as follows:
Now, to compute the probability that tomorrow's close will be at least 1.25 percent higher than tomorrow's open, first compute today's price configuration. Then check all past days in the historical data set that have that configuration. Now count these days. Let N be the number of such days. Then, let M be the number of such days further satisfying the following:
Next day's close is at least 1.25 percent higher than next day's open.
The probability that you want to compute is simply M/N. This is the probability (based on past data) that tomorrow's close will be at least 1.25 percent higher than tomorrow's open. Of course, the 1.25 figure can be substituted by any arbitrary percentage.
There are different ways of assessing the performance of your stock trend predictor. Following are two approaches:
Tests showed a success rate between 54 percent and 65 percent in predicting the NASDAQ trend. Even with a 56 percent success rate in predicting the trend, the long-term (non-compounded) yearly return before costs is above 40 percent in many instances. As with many trading strategies, the system sometimes exhibits oscillations in performance.
It is possible to substantially attenuate these oscillations using the metastrategy described. In its simplest form, the technique consists of using the same system tomorrow if it worked today. If the system fails to correctly predict today's trend, then use the reverse system for tomorrow.
More generic processes for optimizing models include:
The system was offered online to paid clients as a web app (web form) and also accessible automatically via an API in real time. When users accessed the web form, they had to provide a stock symbol, a key, and a few parameters, for instance:
The key was used to verify that users were on time with payments. An active key meant that users were allowed to run the web app. The web app returned a web page with buy/sell signals to use tomorrow, as well as:
When the user clicked Submit, a Perl script (say GetSignals.pl) was run on the web server to quickly perform the computations on historical data (up to 180 days of daily stock prices for the stock in question) and display the results. The computations required less than 0.5 second per call. Alternatively, the application could be called by a machine (API call via a web robot) using a URL such as www.DataShaping.com/GetSignals.pl?stock=GE&key=95396745&strength=weak&history=60.
The results were saved on a web page, available immediately (created on the fly, in real time), that could be automatically accessed via a web robot (machine-to-machine communication) such as www.DataShaping.com/Results-GE-95396745-weak-60.txt.
The web app consisted of the following components:
No database system was used, but you could say it was a NoSQL system: actually, a file management system with stock prices accessed in real time via an HTTP request (if not in the cache) or file lookup (if in the cache).
Although stock market data are widely available, it is also useful to have a simulator, especially if you want to test intraday strategies or simulate growth, decline, or a neutral market.
The statistical process used here is a Markov Chain with infinite state space, to model the logarithm of the stock price (denoted as y). At each iteration, y is augmented by a quantity x (either positive or negative). The possible transitions use notation from the following source code:
Depending on the values in pval and ptab, the underlying process may show a positive trend, a negative trend, or be stationary. In addition, some noise is added to the process to make it more realistic. The simulated data are stored in datashap.txt.
For ease, assume the only possible transitions are −1, 0, or +1:
The transition probabilities are actually proportional to the ptab values. Also, notice that in the source code, ptab[4] is slightly greater than ptab[2], to simulate real market conditions with an overall positive long-term trend (+0.02 percent daily return rate).
You can find a simulation of a perfectly neutral market at http://bit.ly/1aatvlA. The X-axis represents the stock price in U.S. dollars. The Y-axis represents the time, with the basic time unit being one day. Surprisingly, and this is counterintuitive, there is no growth — running this experiment long enough will lead you both below and above the starting point an infinite number of times. Such a memory-less process is known as a random walk. The following code, based on the same notations introduced earlier in this section, is used to simulate this type of random walk:
#include <stdlib.h> #include <stdio.h> #include <time.h> #include <math.h> #define N 7 int idx,l; long k,niter=50000L; double u,aux; double ptab[N]; double pval[N]; double aux,p_distrib,x,y=0; FILE *out; int main(void) { ptab[0]=0.0; ptab[1]=0.0; ptab[2]=0.5; ptab[3]=0.5; ptab[4]=0.51; ptab[5]=0.0; ptab[6]=0.0; for (l=0; l<N; l++) { aux+=ptab[l]; } for (l=0; l<N; l++) { ptab[l]/=aux; } pval[0]=-3; pval[1]=-2; pval[2]=-1; pval[3]=0; pval[4]=1; pval[5]=2; pval[6]=3; randomize(); out=fopen(“datashap.txt”,“wt”); fprintf(out,“simulated stock prices - www.datashaping.com ”); for (k=0L; k<niter; k++) { u=((double)rand())/RAND_MAX; idx=0; p_distrib=ptab[0]; while (u>p_distrib) { idx++; p_distrib+=ptab[idx]; } x=pval[idx]; u=(double)rand()/RAND_MAX; x+=0.5*(u-0.5); y=y+x; fprintf(out,“%lf ”,exp(y/50.0)); } fclose(out); return 0; }
For those interested in more mathematical stuff, this section presents some deeper results, without diving too deeply into the technical details. You can find here a few mathematical concepts associated with Wall Street. The section on the “curse” of big data in Chapter 2, “Why Big Data Is Different,” also discusses statistics related to stock trading.
The following theory is related to the stock market and trading strategies, which have roots in the martingale theory, random walk processes, gaming theory, and neural networks. You see some of the most amazing and deep mathematical results, of practical interest to the curious trader.
Say that you make 90 percent of your trading gains with 5 percent of successful trades. You can write h(0.05) = 0.90. The function h is known as the Lorenz curve. If the gains are the realizations of a random variable X with cdf F and expectation E[X], then
To avoid concentrating on too much gain on just a few trades, avoid strategies that have a sharp Lorenz curve. The same concept applies to losses. Related keywords are inventory management, Six Sigma, Gini index, Pareto distribution, and extreme value theory.
The Black-Scholes formula relates the price of an option to five inputs: time to expiration, strike price, value of the underlier, implied volatility of the underlier, and risk-free interest rate. For technical details, check out www.hoadley.net. You can also look at the book A Probability Path by Sidney Resnik (Birkhauser, 1998).
The formula can be derived from the theory of Brownian motions. It relies on the fact that stock market prices can be modeled by a geometric Brownian process. The model assumes that the variance of the process does not change over time and that the drift is a linear function of the time. However, these two assumptions can be invalidated in practice.
Stock picking strategies can be optimized using discriminant analysis. Based on many years of historical stock prices, it is possible to classify all stocks in three categories—bad, neutral, or good—at any given time. The classification rule must be associated with a specific trading strategy, such as buying at close today and selling at close 7 days later.
What is the parametric distribution of the daily ratio high/low? Or the 52-week high/low? And how would you estimate the parameter for a particular stock? Interdependencies in the time series of stock prices make it difficult to compute an exact theoretical distribution. Such ratios are shown in Figure 6-1.
The distribution is characterized by two parameters: mode and interquartile. The FTSE 100 (international index) has a much heavier left tail than the NASDAQ, though both distributions have same mean. making it more attractive to day traders. As a rule of thumb, indexes with a heavy left tail are good to get volatility and high return, but they are more risky.
Consider a random walk in Z (positive and negative integers), with transition probabilities P(k to k+1)=p, P(k to k)=q, P(k to k–1)=r, with p+q+r=1. The expected number of steps for moving above any given starting point is infinite if p is smaller than r. It is equal to 1/(p–r) otherwise.
This result, applied to the stock market, means that under stationary conditions (p=r), investing in a stock using the buy-and-hold strategy may never pay off, even after an extremely long period of time.
This result explains why 50 percent of the people consistently lose money, whereas 50 percent consistently see gains. Now compare stock trading to coin flipping (tails = loss, heads = gain). Then:
Modern trading strategies include crawling Twitter postings to extract market sentiment broken down by industry, event modeling, and categorization (impact of announcements, quarterly reports, analyst reports, job reports, and economic news based on keywords using statistical modeling), as well as detection of patterns such as the following:
These patterns use data science — more precisely, pattern detection — a set of techniques that data scientists should be familiar with. The key message here is that the real world is in an almost constant state of change. Change can be slow, progressive, or abrupt. Data scientists should be able to detect when significant changes occur and adapt their techniques accordingly.
Encryption is related to data science in two different ways: it requires a lot of statistical expertise, and it helps make data transmission more secure. Credit card encoding is discussed in Chapter 4 “Data Science Craftsmanship, Part I.” I included JavaScript code for a specific encoding technique that never encodes the same message identically because of random blurring added to the encoded numbers. This section discusses modern steganography: the art and science of hiding messages in multimedia documents, e-mail encryption, and captcha technology. The steganography discussion can help you become familiar with image processing techniques.
Steganography is related to data encryption and security. Imagine that you need to transmit the details of a patent or a confidential financial transaction over the Internet. There are three critical issues:
Having the message encrypted is a first step, but it might not guarantee high security. Steganography is about using mechanisms to hide a confidential message (for example, a scanned document such as a contract) in an image, a video, an executable file, or some other outlet. Combined with encryption, it is an efficient way to transmit confidential or classified documents without raising suspicion.
Now consider a statistical technology to leverage 24-bit images (bitmaps such as Windows BMP images) to achieve this goal. Steganography, a way to hide information in the lower bits of a digital image, has been in use for more than 30 years. A more advanced, statistical technique is described that should make steganalysis (reverse engineering steganography) more difficult: in other words, safer for the user.
Although you focus here on the widespread BMP image format, you can use the technique with other lossless image formats. It even works with compressed images, as long as information loss is minimal.
The BMP image format created by Microsoft is one of the best formats to use for steganography. The format is open source and public source code is available to produce BMP images. Yet there are so many variants and parameters that it might be easier to reverse-engineer this format, rather than spend hours reading hundreds of pages of documentation to figure out how it works. Briefly, this 24-bit format is the easiest to work with: it consists of a 54-bit header, followed by the bitmap. Each pixel has four components: RGB (red, green, and blue channels) values and the alpha channel (which you can ignore). Thus, it takes 4 bytes to store each pixel.
Go to http://bit.ly/19B7YVO for detailed C code about 24-bit BMP images. One way to reverse-engineer this format is to produce a blank image, add one pixel (say, purple) that is 50 percent Red, 50 percent Blue, 0 percent Green, change the color of the pixel, and then change the location of the pixel to see how the BMP binary code changes. That's how the author of this code figured out how the 256-color BMP format (also known as the 8-bit BMP format) works. Here, not only is the 24-bit easier to understand, but it is also more flexible and useful for steganography.
To hide a secret code, image, or message, you first need to choose an original (target) image. Some original images are great candidates for this type of usage; some are poor and could lead you to being compromised. Images that you should avoid are color poor, or images that have areas that are uniform. Conversely, color-rich images with no uniform areas are good candidates. So the first piece of a good steganography algorithm is a mechanism to detect images that are good candidates, in which to bury your secret message.
After you detect a great image to hide your message in, here is how to proceed. Assume that the message you want to hide is a text message based on an 80-char alphabet (26 lowercase letters, 26 uppercase letters, 10 digits, and a few special characters such as parentheses). Now assume that your secret message is 300 KB long (300,000 1-byte characters) and that you are going to bury it in a 600 x 600 pixel x 24-bit image (that is, a 1,440 KB image; 1,440 KB = 600 × 600 × (3+1) and 3 for the RGB channels, 1 for the alpha channel; in short, each pixel requires 4 bytes of storage). The algorithm consists of the following:
Step 1: Create a (one-to-many) table in which each of the 80 characters in your alphabet is associated with 1,000 RGB colors, widely spread in the RGB universe, and with no collision (no RGB component associated with more than one character). So you need an 80,000 record lookup table, each record being 3 bytes long. (So the size of this lookup table is 240 KB.) This table is somewhat the equivalent of a key in encryption systems.
Step 2: Embed your message in the target image:
The target (original) image will look exactly the same after your message has been embedded into it.
Look for the pixels that have an RGB color found in the 80,000 RGB color lookup table, and match them with the characters that they represent. It should be straightforward because this lookup table has two fields: character (80 unique characters in your alphabet) and RGB representations (1,000 different RGB representations per character).
With your message securely encoded, hidden in an image, you would think that you just have to e-mail the image to the recipient, and he will easily extract the encoded message. This is a dangerous strategy because even if the encrypted message cannot be decoded, if your e-mail account or your recipient's e-mail account is hijacked (for example, by the NSA), the hijacker can at least figure out who sent the message and/or to whom.
A better mechanism to deliver the message is to anonymously post your image on Facebook or other public forum. You must be careful about being anonymous, using bogus Facebook profiles to post highly confidential content (hidden in images using the steganography technique), as well as your 240 KB lookup table, without revealing your IP address.
The message hidden in your image should not contain identifiers, because if it is captured and decoded, the hijacker might be able to figure out who sent it, and/or to whom.
Most modern images contain metatags, to help easily categorize and retrieve images when users do a Google search based on keywords. However, these metatags are a security risk: they might contain information about who created the image and his IP address, timestamp, and machine ID. Thus it might help hijackers discover who you are. That's why you should alter or remove these metatags, use images found on the web (not created by you) for your cover images, or write the 54 bytes of BMP header yourself. Metatags should not look fake because this could have your image flagged as suspicious. Reusing existing images acquired externally (not produced on your machines) for your cover images is a good solution.
One way to increase security is to use a double system of lookup tables (the 240 KB tables). Say you have 10,000 images with embedded encoded messages stored somewhere on the cloud. The lookup tables are referred to as keys, and they can be embedded into images. You can increase security by adding 10,000 bogus (decoy) images with no encoded content and two keys: A and B. Therefore, you would have 20,002 images in your repository. You need key A to decode key B, and then key B to decode the other images. Because all the image files (including the keys) look the same, you can decode the images only if you know the filenames corresponding to keys A and B. So if your cloud is compromised, it is unlikely that your encoded messages will be successfully decoded by the hijacker.
NOTE If you reference the book Steganography in Digital Media by Jessica Friedrich (published by Cambridge University Press, 2010), there is no mention of an algorithm based on alphabet lookup tables. (Alphabets are mentioned nowhere in the book, but are at the core of the author's algorithm.) Of course, this does not mean that the author's algorithm is new, nor does it mean that it is better than existing algorithms. The book puts more emphasis on steganalysis (decrypting these images) than on steganography (encoding techniques).
What about creating a startup that offers encrypted e-mail that no government or entity could ever decrypt, offering safe solutions to corporations that don't want their secrets stolen by competitors, criminals, or the government?
For example, consider this e-mail platform:
Indeed, the government could create such an app and disguise it as a private enterprise; it would in this case be a honeypot app.
Note that no system is perfectly safe. If there's an invisible camera behind you filming everything you do on your computer, this system offers no protection for you—though it would still be safe for the recipient, unless he also has a camera tracking all his computer activity. But the link between you and the recipient (the fact that both of you are connected) would be invisible to any third party. And increased security can be achieved if you use the web app from an anonymous computer—maybe from a public computer in some hotel lobby.
To further improve security, the system could offer an e-mail tool such as www.BlackHoleMail.com that works as follows:
Encrypted e-mail addresses work for 48 hours only, to prevent enforcement agencies from successfully breaking into the system. If they do, they can reconstruct only e-mail addresses used during the last 48 hours. In short, this system makes it harder for the NSA and similar agencies to identify who is connected to whom.
A double-encryption system would be safer:
Here's an interesting project for future data scientists: designing an algorithm that can correctly identify the hidden code 90 percent of the time and make recommendations to improve captcha systems.
Reverse-engineering a captcha system requires six steps:
Your universal captcha decoder will probably work well with some types of captchas (blurred letters), and maybe not as well with other captchas (where letters are crisscrossed by a network of random lines).
Note that some attackers have designed technology to entirely bypass captchas. Their system does not even “read” them; it gets the right answer each time. They access the server at a deeper level, read what the correct answer should be, and then feed the web form with the correct answer for the captcha. Spam technology can bypass the most challenging questions in sign-up forms, such as factoring a product of two large primes (more than 2,000 digits each) in less than 1 second. Of course, they don't extract the prime factors; instead they read the correct answer straight out of the compromised servers, JavaScript code, or web pages.
Anyway, this interesting exercise will teach you a bit about image processing and clustering. At the end, you should be able to identify features that would make captchas more robust, such as:
This section focuses on a specific type of fraud: click fraud generated directly or indirectly by ad network affiliates, generating fake clicks to steal money from advertisers. You learn about a methodology that was used to identify a number of Botnets stealing more than 100 million dollars per year. You also learn how to select features out of trillions of feature combinations. (Feature selection is a combinatorial problem.) You also discover metrics used to measure the predictive power of a feature or set of features.
CROSS-REFERENCE For more information, see Chapter 4 for Internet technology and metrics for fraud detection, and Chapter 5 for mapping and hidden decision trees.
Click fraud is usually defined as the act of purposely clicking ads on pay-per-click programs with no interest in the target website. Two types of fraud are usually mentioned:
Although these are two important sources of non-converting traffic, there are many other sources of poor traffic. Some of them are sometimes referred to as invalid clicks rather than click fraud, but from the advertiser's or publisher's viewpoint, there is no difference. Here, consider all types of non-billable or partially billable traffic, whether it is the result of fraud, whether there is no intent to defraud, and whether there is a financial incentive to generate the traffic in question. These sources of undesirable traffic include:
This discussion encompasses other sources of problems not generally labeled as click fraud but sometimes referred to as invalid, non-billable, or low-quality clicks. They include:
Web traffic isn't black or white, and there is a whole range from low quality to great traffic. Also, non-converting traffic might not necessarily be bad, and in many cases can actually be good. Lack of conversions might be due to poor ads or poorly targeted ads. This raises two points:
A scoring approach can help search engines determine the optimum price for multiple clicks (such as true user-generated multiple clicks, not a double click that results from a technical glitch). By incorporating the score in their smart pricing algorithm, they can reduce the loss due to the simplified business rule “one click per ad per user per day.”
Search engines, publishers, and advertisers can all win because poor quality publishers can now be accepted in a network, but are priced correctly so that the advertiser still has a positive ROI. And a good publisher experiencing a drop in quality can have its commission lowered according to click scores, rather than being discontinued outright. When its traffic gets better, its commission increases accordingly based on scores.
To make sense for search engines, a scoring system needs to be as generic as possible. Click scores should be designed to match the conversion rate distribution using generic conversions, taking into account bogus conversions and based on attribution analytics (discussed later in this section) to match a conversion with a click through correct user identification. An IP can have multiple users attached to it, and a single user can have multiple IP addresses within a 2-minute period. Cookies (particularly in server logs, less so in redirect logs) also have notorious flaws, and you should not rely on cookies exclusively when dealing with advertiser server log data.
I have personally designed scores based on click logs, relying, for instance, on network topology metrics. The scores were designed based on advertiser server logs, also relying on network topology metrics (distribution partners, unique browsers per IP cluster, and so on) and even on impression-to-click-ratio and other search engine metrics, as server logs were reconciled with search engine reports to get the most accurate picture. Using search engine metrics to score advertiser traffic enabled designing good scores for search engine data, and the other way around, as search engine scores are correlated with true conversions.
When dealing with advertiser server logs, the reconciliation process and the use of appropriate tags (for example, Google's gclid) whenever possible allow you to not count clicks that are artifacts of browser technology.
Advertiser scores are designed to be a good indicator of the conversion rate. Search engine scores use a combination of weights based both on expert knowledge and advertiser data. Scores should have been smoothed and standardized using the same methodology used for credit card scoring. The best quality assessment systems rely on both real-time and less granular scores, such as end of day.
The use of a smooth score based on solid metrics substantially reduces false positives. If a single rule is triggered, or even two rules are triggered, it might barely penalize the click. Also, if a rule is triggered by too many clicks or not correlated with true conversions, it is ignored. For instance, a rule formerly known as “double click” (with enough time between the two clicks) has been found to be a good indicator of conversion and was changed from a rule into an anti-rule whenever the correlation is positive. A click with no external referral but otherwise normal will not be penalized after score standardization.
The scoring methodology I have developed is state-of-the-art, and based on almost 30 years of experience in auditing, statistics, and fraud detection, both in real time and on historical data. It combines sophisticated cross-validation, design of experiments, linkage, and unsupervised clustering to find new rules, machine learning, and the most advanced models ever used in scoring, with a parallel implementation and fast, robust algorithms to produce at once a large number of small overlapping decision trees. The clustering algorithm is a hybrid combination of unique decision-tree technology with a new type of PLS logistic stepwise regression to handle tens of thousands of highly redundant metrics. It provides meaningful regression coefficients computed in a short amount of time and efficiently handles interaction between rules.
Some aspects of the methodology show limited similarities with ridge regression, tree bagging, and tree boosting (see http://www.cs.rit.edu/∼rlaz/prec20092/slides/Bagging_and_Boosting.pdf). Now you can compare the efficiency of different systems to detect click fraud on highly realistic simulated data. The criterion for comparison is the mean square error, a metric that measures the fit between scored clicks and conversions:
The reason bogus conversions are elaborated on is because their impact is worse than most people think. If not taken care of, they can make a fraud detection system seriously biased. Search engines that rely on presales or non-sales (soft) conversions, such as sign-up forms, to assess traffic performance can be misled into thinking that some traffic is good when it actually is poor, and the other way around.
Usually, the advertiser is not willing to provide too much information to the search engine, and thus conversions are computed generally as a result of the advertiser placing some JavaScript code or a clear gif image (beacon) on target conversion pages. The search engine can then track conversions on these pages. However, the search engine has no control over which “converting pages” the advertiser wants to track. Also, the search engine cannot see what is happening between the click and the conversion, or after the conversion. If the search engine has access to presale data only, the risk for bogus conversions is high. A significant increase in bogus conversions can occur from some specific traffic segment.
Another issue with bogus conversions arises when an advertiser (for example, an ad broker) purchases traffic upstream and then acts as a search engine and distributes the traffic downstream to other advertisers. This business model is widespread. If the traffic upstream is artificial but results in many bogus conversions—a conversion being a click or lead delivered downstream—the ad broker does not see a drop in ROI. She might actually see an increase in ROI. Only the advertisers downstream start to complain. When the problem starts being addressed, it might be too late and may have already caused the ad broker to lose clients.
This business flaw can be exploited by criminals running a network of distribution partners. Smart criminals will hit this type of “ad broker” advertiser harder. The criminals can generate bogus clicks to make money themselves, and as long as they generate a decent number of bogus conversions, the victim is making money too and might not notice the scheme.
It has been argued that the victims of click fraud are good publishers, not advertisers, because advertisers automatically adjust their bids. However, this does not apply to advertisers lacking good conversion metrics (for example, if the conversion takes place offline) nor to smaller advertisers who do not update bids and keywords in real time. It can actually lead advertisers to permanently eliminate whole traffic segments, and lack the good ROI when the fraud problem gets fixed on the network. On some second-tier networks, impression fraud can lead an advertiser to be kicked out one day without the ability to ever come back. Both the search engine and the advertiser lose in this case, and the one who wins is the bad guy now displaying cheesy, irrelevant ads on the network. The website user loses, too, because all good ads have been replaced with irrelevant material.
Finally, many systems to detect fraud are still essentially based on outlier detection and detecting shifts from the average. But most fraudsters try hard to look as average as possible, avoiding expensive or cheap clicks, using the right distribution of user agents, and generating a small random number of clicks per infected computer per day, except possibly for clicks going through large proxies. This type of fraud needs a truly multivariate approach, looking at billions of combinations of several carefully selected variables simultaneously and looking for statistical evidence in billions of tiny click segments to unearth the more sophisticated fraud cases impacting a large volume of clicks.
To some extent, the technology to combat click fraud is similar to what banks use to combat credit card fraud. The best systems are based on statistical scoring technology because the transaction—a click in the context—is usually not either bad or good.
Multiple scoring systems based on IP scores and click scores and metric mix (feature) optimization are the basic ingredients. Because of the vast amount of data, and potentially millions of metrics used in a good scoring system, combinatorial optimization is required, using algorithms such as Markov Chain Monte Carlo or simulated annealing.
Although scoring advertiser data can be viewed as a regression problem, the dependent variable being the conversion metric, scoring search engine data is more challenging because conversion data are not readily available. Even when dealing with advertiser data, there are several issues to address. First, the scores need to be standardized. Two identical ad campaigns might perform differently if the landing pages are different. The scoring system needs to address this issue.
Also, although scoring can be viewed as a regression problem, it is a difficult one. The metrics involved are usually highly correlated, making the problem ill-conditioned from a mathematical viewpoint. There might be more metrics (and thus more regression coefficients) than observed clicks, making the regression approach highly unstable. Finally, the regression coefficients—also referred to as weights—must be constrained to take only a few potential values. The dependent variable being binary, you are dealing with a sophisticated ridge logistic regression problem.
The best technology actually relies on hybrid systems that can handle contrarian configurations, such as “time < 4 am” is bad, “country not US” is bad, but “time < 4 am and country = UK” is good. Good cross validation is also critical to eliminate configurations and metrics with no statistical significance or poor robustness. Careful metric (feature) binning and a fast distributed feature optimization algorithm are important.
Finally, design of experiments to create test campaigns—some with a high proportion of fraud and some with no fraud—and usage of generic conversion and proper user identification are critical. And remember that failing to remove bogus conversions will result in a biased system with many false positives. Indeed, buckets of traffic with conversion rates above 10 percent should be treated separately from buckets of traffic with conversion rates below 2 percent.
Click scoring can do many things, including:
These are just a few of the applications of click scoring. Also note that scoring is not limited to clicks, but can also involve impressions and metrics such as clicks per impression.
From the advertiser's viewpoint, one important application of click scoring is to detect new sources of traffic to improve total revenue in a way that cannot be accomplished through A/B/C testing, traditional ROI optimization, or SEO. The idea consists of tapping into delicately selected new traffic sources rather than improving existing ones.
Now consider a framework in which you have two types of scores:
What can you do when you combine these two scores? Here's the solution:
The conclusion here is that, in the context of keyword bidding on pay-per-click programs such as Google Adwords, click scores aggregated by keyword are useful to predict conversion rates for keywords with little or no history. These keywords represent as much as 90 percent of all keywords purchased and more than 20 percent of total ad spend. These scores are also useful in real time bidding.
Feature selection is a methodology used to detect the best subset of features out of dozens or hundreds of features (also called variables or rules). “Best” means with highest predictive power, a concept defined in the following subsection. In short, you want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).
The problem is combinatorial in nature. You want a manageable, small set of features (say, 20 features) selected from, for example, a set of 500 features, to run the hidden decision trees (or some other classification/scoring technique) in a way that is statistically robust. But there are 2.7 × 1035 combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternative solution. The good thing is that you don't need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.
One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either of the following:
At each iteration, choose the feature to be added from the two remaining features with the highest predictive power. You will choose (between these two features) the one that increases the overall predictive power most (of the subset under construction). Now you have reduced your computations from 2.7 × 1035 to 40 = 2 × 20.
NOTE An additional step to boost predictive power: remove one feature at a time from the subset, and replace it with a feature randomly selected from the remaining features (from outside the subset). If this new feature boosts an overall predictive power of a subset, keep it; otherwise, switch back to the old subset. Repeat this step 10,000 times or until no more gain is achieved (whichever comes first).
Finally, you can add two or three features at a time, rather than one. Sometimes, combined features have far better predictive power than isolated features. For instance, if feature A = country, with values in {USA, UK} and feature B = hour of the day, with values in {“day – Pacific Time”, “night – Pacific Time”}, both features separately have little if any predictive power. But when you combine both of them, you have a much more powerful feature: UK/night is good, USA/night is bad, UK/day is bad, and USA/day is good. Using this blended feature also reduces the risk of false positives/false negatives.
Also, to avoid highly granular features, use lists. So instead of having feature A = country (with 200 potential values) and feature B = IP address (with billions of potential values), use:
This section illustrates the concept of predictive power on a subset of two features, to be a bit more general. Say you have two binary features, A and B, taking two possible values, 0 or 1. Also, in the context of fraud detection, assume that each observation in the training set is either Good (no fraud) or Bad (fraud). The fraud status (G or B) is called the response or dependent variable in statistics. The features A and B are also called rules or independent variables.
First, split your training set (the data where the response—B or G—is known) into two parts: control and test. Make sure that both parts are data-rich: if the test set is big (millions of observations) but contains only one or two clients (out of 200), it is data-poor and your statistical inference will be negatively impacted (low robustness) when dealing with data outside the training set. It is a good idea to use two different time periods for control and test. You are going to compute the predictive power (including rule selection) on the control data. When you have decided on a final, optimum subset of features, you can then compute the predictive power on the test data. If the drop in predictive power is significant in the test data (compared with the control), something is wrong with your analysis. Detect the problem, fix it, and start over again. You can use multiple control and test sets. This can give you an idea of how the predictive power varies from one control set to another. Too much variance is an issue that should be addressed.
Using the previous example with two binary features (A, B) taking on two values (0, 1), you can break the observations from the control data set into eight categories:
Now denote as n1, n2 … n8 the number of observations in each of these eight categories, and introduce the following quantities:
Assume that p, measuring the overall proportion of fraud, is less than 50 percent (that is, p<0.5, otherwise you can swap between fraud and non-fraud). For any r between 0 and 1, define the W function (shaped like a W), based on a parameter a (0 < a < 1, and I recommend a = 0.5–p) as follows:
Typically, r= P00, P01, P10, or P11. The W function has the following properties:
Now define the predictive power:
The function H is the predictive power for the feature subset {A, B}, having four bins, 00, 01, 10, and 11, corresponding to (A=0, B=0), (A=0, B=1), (A=1, B=0), and (A=1, B=1). Although H appears to be remotely related to entropy, H was designed to satisfy nice properties and to be parameter-driven, because of a. Unlike entropy, H is not based on physical concepts or models; it is actually a synthetic (though useful) metric.
Note that the weights P00… P11 in H guarantee that bins with low frequency (that is, a low-triggering rate) have low impact on H. Indeed, I recommend setting W(r) to 0 for any bin that has less than 20 observations. For instance, the triggering rate for bin 00 is (n1 + n5) / (n1 + … + n8), observations count is n1 + n5, and r = P00 = n5 / (n1 + n5) for this bin. If n1 + n5 = 0, set P00 to 0 and W(P00) to 0. I actually recommend doing this not just if n1 + n5 = 0 but also whenever n1 + n5 < 20, especially if p is low. (If p is very low, say, p < 0.01, you need to over-sample bad transactions when building your training set and weight the counts accordingly.) Of course, the same rule applies to P01, P10, and P11. Note that you should avoid feature subsets that have a large proportion of observations spread across a large number of almost empty bins, as well as feature subsets that produce a large number of empty bins. Observations outside the training set are likely to belong to an empty or almost empty bin and lead to high-variance predictions. To avoid this drawback, stick to binary features, select up to 20 features, and use the (hybrid) hidden decision tree methodology for scoring transactions. Finally, Pij is the naive estimator of the probability P(A=i, B=j) for i,j = 0,1.
The predictive power H has interesting properties:
You can pre-compute all the bin counts ni for the top 20 features (that is, features with highest predictive power) and store them in a small hash table with at most 2 × 220 entries (approximately 2 million; the factor is 2 because you need two measurements per bin: number of Bs, and number of Gs). An entry in this hash table would look like this:
$Hash{01101001010110100100_G} = 56,
meaning that Bin # 01101001010110100100 has 56 good (G) observations.
The hash table is produced by parsing your training set one time, sequentially: for each observation, compute the flag vector (which rules are triggered, that is, the 01101001010110100100 vector in this example), check if it's good or bad, and update (increase count by 1) the associated hash table entry accordingly, with the following instruction:
$Hash{01101001010110100100_G}++
Then, whenever you need to measure the predictive power of a subset of these 20 features, you don't need to parse your big data set again (potentially billions of observations), but instead, just access this small hash table. This table contains all you need to build your flag vectors and compute scores for any combination of features that is a subset of the top 20.
You can even do better than top 20, maybe top 30. Although this would create a hash table with 2 billion entries, most of these entries would correspond to empty bins and thus would not be in the hash table. Your hash table might contain only 200,000,000 entries, maybe too big to fit in memory, and requiring a MapReduce/Hadoop implementation.
Even better: build this hash table for the top 40 features. Then it will fully solve your feature selection problem described earlier. However, now your hash table could have up to 2 trillion entries. But if your data set has only 100 billion observations, then, of course, your hash table cannot have more than 100 billion entries. In this case, I suggest that you create a training set with 20 million observations so that your hash table will have at most 20 million entries (probably less than 10 million with all the empty bins) and thus can fit in memory.
You can compute the predictive power of a large number (say, 100) of feature subsets by parsing the big 40-feature input hash table obtained in the previous step. Then for each flag vector and G/B entry in the input hash table, loop over the 100 target feature subsets to update counts (all the ni) for these 100 feature subsets. These counts are stored/updated in an output hash table. The key in the output hash table has two components: feature ID and flag vector. You then loop over the output hash table to compute the predictive power for each feature subset. This step can be further optimized.
An example of collusion is when multiple affiliates (or sub-publishers) are associated in the same fraud scheme and share the same resources. Typically, this a Botnet, that is, a set of computer IP addresses infected with a virus that automatically clicks once a day (at most) from each infected machine.
Association rules are fraud detection rules that try to correlate two fraudsters via an external metric. For instance, when an affiliate A is identified as fraudulent, you look at all the IP addresses attached to it. If 50 percent or more of these IP addresses also appear exclusively in association with another affiliate B (and nowhere else), then chances are that B is also part of the fraudulent scheme. Even better: now you look at all the IP addresses attached to B, and you discover that another affiliate, say C, shares 10 percent of them exclusively. Now you've found yet another participant in the fraud scheme. I have successfully identified several additional sources of fraud associated with one fraudster using this technique.
Another rule falling in the same category is the proportion of short visits (less than 20 seconds) that involve two clicks associated with two little, unknown affiliates or referral domains. When this proportion is above some threshold (say, 20 percent of all visits for a specific affiliate), you are likely dealing with fraud. By creating the list of all affiliates found in these two-click visits and then looking at other visits that also involve these affiliates as well as other affiliates, you can detect a larger proportion of bad affiliates.
Finally, Botnets used in click fraud are typically involved in other criminal activities as well and controlled by one operator. So detecting these Botnets with a click fraud detection algorithm can also help detect other crimes: spam, phishing, ID theft, and so on.
Because criminals don't put too much effort into creating artificial traffic that looks, from a statistical point of view, similar to natural traffic, their traffic is full of extreme events that make detection easier. These event probabilities can be estimated (in natural traffic) via Monte Carlo simulations. When discovering an event (in suspicious traffic) with a probability of occurrence below one chance in 1 million, it is a red flag. Of course, if you are tracking billions of events, you will have false positives. Read the section on the curse of big data (Chapter 2) to see how to address this issue. Rare events that were successfully used to detect fraud include:
All these rules have been successfully used to detect fraud. Multiple rules must be triggered before confirming fraud. For instance, in one case, the bad affiliate simultaneously had the domain name pattern, the identical scores pattern, and the empty web page pattern. It was first discovered because all its clicks had the same score. And by further analyzing the data, I discovered and created the two other rules: domain pattern and empty page pattern.
The borderline between data science and other domains, in particular business intelligence (what business analysts do), is not clear. In smaller companies, startups, and non-traditional companies (which hire a significant share of data scientists), several of the project types discussed in this section are handled by one person — the data scientist — who also plays a part-time role as business analyst. Being polyvalent allows you to command higher salaries and experience increased job security.
Domain expertise is critical to succeed in data analytics. A lot of domain expertise is shared with you here, with a focus on rules of thumb and principles that stay valid over time (such as solid SEO principles, e-mail campaign optimization, ad matching, and so on), even though the digital/ad tech industry is constantly changing.
Some of the techniques discussed in this section are typically implemented by vendors (Comscore, Rocket Fuel, and others), search engines (such as Bing, Google, Yahoo), ad networks such as AdKnowledge, or savvy publishers, rather than by advertisers themselves. Advertisers rely on vendors, though large advertisers such as eBay or Cars.com also develop and integrate their own proprietary tools on top of vendor tools. It is important for you to learn about the type of data science work (digital analytics) done by vendors and search engines, because collectively they hire more data scientists than all advertisers combined.
CROSS-REFERENCE For a more traditional data science discussion related to digital analytics, read the section Big Data Problem That Epitomizes the Challenges of Data Science in Chapter 2.
How do you effectively compute the reach and frequency of ads for advertisers? The formula discussed here computes reach and frequency for online advertisers to help them determine how many impressions they must purchase to achieve their goals. It is provided to illustrate that sometimes, it is easier to spend two hours to reinvent the wheel, solve the problem mathematically, and get an efficient solution than to spend hours trying to find the solution in the literature or blindly run expensive simulations or computations the way computer scientists traditionally do.
How many ad impressions should an advertiser purchase to achieve a specified reach on a website? I have investigated the problem and found a simple mathematical solution:
Reach = Unique Users − SUM{ Uk * ( 1 − P )k}
The sum is over all positive integers k = 1, 2, and so on. Uk is the number of users turning exactly k pages in the given time period. So you need to compute the distribution of page views per unique user during the targeted time period to compute Uk. P is the ratio of purchased impressions by total pages.
The number of unique users who see the ad n times during the time period in question is given by:
SUM{ Uk * C(k, n) * Pn ( 1 − P )k-n}
The sum is over all positive integers k greater or equal to n. C(k, n) = k! / [ n! (k–n)! ] are the binomial coefficients.
In this context, “unique user” represents a user that is counted as one user regardless of the number of visits during the time period in question. A user is a human being, thus robots (web crawlers) should be filters. The IAB (Internet Advertising Bureau) provides definitions, directives, and best practices on how to measure unique users and filter robots. Users are typically tracked via cookies. In the past, many counting methods resulted in a single user being counted multiple times, thus the term “unique user” was created to distinguish users from visits. (Defining concepts such as “unique user” is also part of the data scientist's job.)
We improved our own open rates on Data Science Central by 300 percent and dramatically improved total clicks and click-through rates using the following strategies, based on data analysis:
As a result, our mailing list management vendor informed us that our open rate is far above average — and they manage 60,000 clients from small to multimillion e-mail lists!
SEO (search engine optimization) and SEM (search engine marketing) companies regularly hire data scientists, so you need to have a good grasp of how to implement solid, general SEM principles. Here's an example application that is based on careful analysis of keyword performance on Google Analytics:
Here is another application where data science can help. Each year, billions of dollars are poured into online advertising programs (display and pay-per-click advertising) in an effort to attract visitors to websites. Those websites that attract a high volume of visitors do so because they have great content.
Producing great content can be time-consuming and expensive, so nearly every website publisher wants to automate their content production. In addition, displaying the right content to the right user at the right time dramatically increases ad revenue. Analytic techniques can be used to achieve the goal of relevant and highly targeted content.
For example, at Data Science Central we have achieved the first steps of automated content creation. Our news feed system harvests news from keyword-targeted Google news, selected Twitters and blogs, and also content that we create. Our news feed aggregator was created without writing a single line of code, instead using various widgets, Twitter feeds, and Feedburner. Our analytics news shows up on AnalyticBridge, LinkedIn, and many other places, and is updated in real time to drive good traffic to our websites.
Bigger companies are hiring analytic professionals to optimize these processes. For website publishers and anyone generating online advertising revenue, content is king. My first experiments have proved that optimizing content and news has a bigger impact than optimizing advertising campaigns in terms of increasing advertising revenue. The increased revenue comes from two sources:
The fact that Data Science Central is a social network does not play a major role in this dynamic (assuming that spam control has been taken care of effectively). What is critical is content quality and dynamic content, thus our focus on real time news and on including (in our feeds) micro-blogs from a number of small, well-respected data scientists. With strategies such as these, you can turn a social network into a profitable website with growing revenue.
Competitive intelligence is part of business intelligence, and is a sub-domain of market research. Bit.ly allows you to check the popularity of web pages published by you or your competitors. It also allows you to see how much traffic you contribute to a web page (one that's your own or some external web page that you promote as part of a marketing campaign), versus how much traffic other people or your competitors generate for the same web page.
Bit.ly, a URL shortener (like tinyurl.com), is the standard tool used in LinkedIn e-mail blasts to shorten long URLs with several tags. (The reason is that LinkedIn blasts are text-only.) Yet, bit.ly offers much more than redirect services; it also has interesting reporting capabilities.
This reporting capability is an example of an excellent big data application, provided you are aware of its simple yet incredibly powerful dashboard. It is also a classic example of big data that is useless unless you are aware of the features coming with it and how to leverage it.
Here a useful example. Consider Figure 6-2, which shows a screenshot of a bit.ly dashboard:
You can find the amount of traffic (page views) from your bit.ly redirects in the “Clicks via Your Shortlink” column. Keep in mind that the same link might have been shortened by other people. That's why there is an extra column labeled “Total Clicks” that tells you the total traffic coming to the target URL, both from you and from other people who used bit.ly to shorten the same URL.
NOTE Other important competitive intelligence tools besides bit.ly include Quantcast.com, Compete.com, and Alexa.com (which is the least reliable).
Assessing traffic from the competition is useful for website publishers or media companies. For the goPivotal link created at 5:15 p.m. (Figure 6-2), 155 clicks are from me and another 114 are from competitors. So I delivered better than competitors on this one. Of course, not everyone uses bit.ly in their marketing campaigns, but many do because it offers traffic statistics by referrer, country, and day to the client. Simply add a + sign at the shortened URL https://bit.ly/11xTDUD+.
Note that the goPivotal link created at 7:22 p.m. contains special proprietary tags and is thus tracked separately. (The bit.ly URL is different, although it points to the same page.) That's why it has 0 clicks attributable to third parties.
Still continuing with Figure 6-2, the link for “Three classes of metrics” is internal to Data Science Central, so I know that the extra 53 clicks (53 = 155–102) are from people who shared the article with others using bit.ly. If you click “Three Classes of Metrics” on the bit.ly dashboard, then you know who shared the link and how much traffic they generated (in this case, 53 clicks). This information allows you to track who contributes solid traffic to your website.
RealTime bit.ly is still a beta project (you can access it at http://rt.ly/), but if you use the real-time feature it allows you to fine-tune e-mail campaigns in real time, via A/B testing. For example, if you want to send an e-mail to 100,000 subscribers, you would:
For instance, the difference between A and B is the subject line, and the success metric is the open rate or click-out rate. For instance, in A, subject line = “Weekly Digest - April 22” and in B, subject line = “Best articles from DSC.” Note that usually when doing e-mail campaign optimization, you are not interested in optimizing just one campaign, but actually all your campaigns. If you want to optimize one campaign, subject line B would win. For global, long-term optimization, A will win.
Building a great list of Twitter followers and judiciously using #hashtags seems to be the Holy Grail to grow traffic and reap rewards from social media. A good reference on the subject is the Mashable article “How to Get the Most Out of Twitter #Hashtags” which can be accessed at http://mashable.com/2009/05/17/twitter-hashtags/.
The reality is very different. First, when did you read a Tweet for the last time? Probably long ago. Also, hashtags tend to discourage people from clicking your link. In practice, even with 6,000 real followers, a Tweet will generate 0 clicks at worst and 20 clicks at best. This is true if you count only clicks coming straight from your Tweet and ignore indirect traffic and the multiplier effect described next.
Here's an example of a Tweet with two hashtags, in case you've never seen one before:
“naXys to Focus on the Study of Complex Systems”: http://ning.it/12EFrsc#bigdata#datascience
No, it actually works. First, if your Tweet gets re-Tweeted, it is not difficult to generate tons of additional clicks that can be attributed to Twitter (and even more from other sources), even for subjects as specialized as data science. The Twitter multiplier can be as high as 3.
For instance, a great article published in our Data Science Central weekly digest gets 800 clicks that are directly attributed to the eBlast sent to more than 60,000. After 12 months, the article has been viewed 12,000 times. When you analyze the data, you find that 2,400 of the total visits can be attributed to Twitter (people reposting your link and sharing it via Twitter).
There are also some other positive side effects. Some publishers run web crawlers to discover, fetch, and categorize information, and #hashtags are one of the tools they use to detect and classify pieces of news. Data Science Central is one of them—both as a crawler and as a crawlee. The same publishers also publish lists of top data scientists. Their lists are based on who is popular on specific hashtags regardless of whether the Tweets in questions are generated automatically (syndicated content) or by a human being.
And while a Tweet full of special characters is a deterrent for the average Internet user, it is much less so for geeks. (If you read this book, chances are that you are, at least marginally, a geek.) However, you must still comply with the following rules:
Otherwise, people will stop reading what you post and then you are faced with the challenge of finding new followers faster than you are losing old ones.
Hashtags are particularly suited to comment on a webinar as it goes live: INFORMS could use the #INFORMS2013 hashtag for people commenting on their yearly conference. Hashtags are also helpful to promote or discuss some event. For instance, we created #abdsc to reward bloggers Tweeting about us.
Still, it is not so easy to be listed under Top rather than All People on Twitter when posting a Tweet. And if you are not in the Top list, your hashtag has no value. In addition, you can be a victim of hashtag hijacking: hashtag spammers abusing a popular hashtag to promote their own stuff.
Figure 6-3 is based on recent posts at Data Science Central tracked via bit.ly and shows that a blog post can easily see its traffic statistics triple (rightmost column), thanks to other people who like what you wrote and re-Tweet or re-post it on their blogs. Note that the fewer internal clicks, the bigger the multiplier: when we don't do much to promote one of our posts, then other bloggers promote it on our behalf. The most extreme case here is one internal click with multiplier 53.
An easy way to check whether hashtags provide a lift is to perform some A/B testing. You can use the following protocol:
For example, your Tweet (hashtag version) could be something like this:
Testing Twitter hashtags performance https://bit.ly/11wpSXi #datascience #hashtagABtest
Note that when you click https://bit.ly/11wpSXi, the query string shows up in the reconstructed URL in your browser (after the redirect). What's more, you can monitor how many clicks you get by looking at https://bit.ly/11wpSXi+ (same bit.ly URL, with + added at the end).
After 20 days, you can now perform a statistical analysis to check the following:
You can detect if click variations are explained by time, weekend/weekday, or some other factor such as link and whether hashtags positively contribute, and quantify this contribution.
Micro-memes are emergent topics for which a hashtag is created, used widely for a few days, and then disappears. How can they be detected and leveraged? Journalists and publishers are particularly interested in this.
TrendSpottr and Top-Hashtags are websites offering this type of service. However, they rely on sample data—a major drawback to discover new, rare events that suddenly spike. It is especially difficult because these memes use made-up keywords that never existed before, such as #abdsc or #hashtagABtest.
Also, Top-Hashtag worked with Instagram, but not with Twitter, last time I checked. Hashtags also work on Google+. For data science, the most interesting large social networks are LinkedIn (more professional), Google+, Quora (more geeky), and Twitter. So identifying in real time new trends that are emerging on these networks is important.
One solution consists of following the most influential people in your domain and getting alerts each time one of their new hashtags get 10+ Tweets. If some new hashtag from an unknown person starts creating a lot of buzz, these influencers (at least some of them) will quickly notice, and you will, too, because you follow the influencers. Just pick up the right list of keywords and hashtags to identify top influencers: data science, Hadoop, big data, business analytics, visualization, predictive modeling, and so on. Some lists of top influencers are publicly available.
The key message of this discussion is twofold: currently, good algorithms to detect signal from noise and categorize Tweets are yet to be developed, and they involve data science and NLP (natural language processing). Such algorithms are particularly useful to website publishers, content managers, and journalists.
These big data problems probably impact many search engines, which is one piece of evidence that there is still room for new startups to invent superior search engines. Data scientists are well equipped to identify these problems and quantitatively measure whether they are big or small issues, to help prioritize them. These problems can be fixed with improved analytics and data science. Here are the problems and the solutions.
Google does not do a good job of showing new or recently updated web pages. Of course, new does not mean better, and the Google algorithm favors old pages with a good ranking, on purpose—maybe because the ranking for new pages is less reliable or has less history. (That's why I created statistical scores to rank web pages with no history.) To solve this problem, the user can add 2013 to Google searches. And Google could do that, too, by default. For instance, compare search results for the query data science with those for data science 2013. Which one do you like best? Better, Google should allow you to choose between “recent” versus “permanent” search results when you do a search.
The issue here is to correctly date web pages, which is a difficult problem because webmasters can use fake timestamps to fool Google. But because Google indexes most pages every couple of days, it's easy to create a Google timestamp and keep two dates for each (static) web page: date when first indexed and date when last modified. You also need to keep a 128-bit signature (in addition to related keywords) for each web page to easily detect when it is modified. The problem is more difficult for web pages created on the fly.
You write an article on your blog. It then gets picked up by another media outlet, say The New York Times. Google displays The New York Times version at the top and sometimes does not even display the original version at all, even if the search query is the title of the articles, using an exact match. You might argue that The New York Times is more trustworthy than your little unknown blog or that your blog has a poor page rank. But this has two implications:
It creates a poor user experience. To the Google user, the Internet appears much smaller than it actually is.
Webmasters can use unfair strategies to defeat Google: using smart Botnet technology with the right keyword mix to manufacture tons of organic Google queries, but manufacture few clicks—only on their own links. This will lower the organic CTR for competitors but boost yours. It is expected that this “attack” would fix your content attribution problem.
Another, meaner version of the CTR attack consists of hitting the culprit website with lots of organic Google clicks (Google redirects) but no Google impression. This could fool Google algorithms into believing that the website in question is engaging in CTR fraud by manipulating click-in ratios, and thus get them dropped from Google. At the same time it will create a terrible impression-to-conversion (click-out) ratio for advertisers showing up on the incriminated landing pages, hitting display ads particularly hard, causing advertisers to drop the publisher. In short, this strategy could automatically take care of copyright infringement using two different levers at once: click in and click out. It must be applied to one violator at a time (to avoid detection) until they are all gone. If properly executed, nobody will ever know who did the attack; maybe nobody will ever find out that an attack took place.
One easy way for Google to fix the problem is, again, to correctly identify the first version of an article, as described in the previous paragraph.
Google generates a number of search result impressions per week for every website, and this number is extremely stable. It is probably based on the number of pages, keywords, and popularity (page rank) of the website in question, as well as a bunch of other metrics (time to load, proportion of original content, niche versus generic website, and so on). If every week Google shows exactly 10,000 impressions for your website, which page or keyword match should Google favor?
Google should favor pages with low bounce rate. In practice, it does the exact opposite.
Why? Maybe if a user does not find your website interesting, he performs more Google searches and the chance of him clicking a paid Google ad increases. So bad (landing page) user experience financially rewards Google.
How do you fix it? If most users spend little time on a web page (and Google can easily measure time spent), that web page (or better, that web page or keyword combination) should be penalized in the Google index to show up less frequently. Many publishers also use Google Analytics, which provides Google with additional valuable information about bounce rate and user interest at the page level.
However, you might argue that if bounce rate is high, maybe the user found the answer to his question right away by visiting your landing page, and thus the user experience is actually great. I disagree, since at Data Science Central, each page displays links to similar articles and typically results in subsequent page views. Indeed, the worst bounce rate is associated with Google organic searches. More problematic is the fact that the bounce rate from Google organic is getting worse (while it's getting better for all other traffic sources), as if the Google algorithm lacks machine learning capabilities or is doing a poor job with new pages added daily. In the future, longer articles broken down into two or three pages will hopefully improve the bounce rate from Google organic (and from other sources as well).
Here you focus on ad relevancy in the context of online advertising such as search engine advertising. When you see ads on search result pages or elsewhere, the ads that display in front of your eyes (should) have been highly selected to maximize the chance that you convert and generate ad revenue for the publisher or search engine.
If you think you see irrelevant ads, either they are priced cheaply or Google's ad relevancy algorithm is not working well. Sometimes it's due to cheap ads, sometimes poor scoring, and you could argue that sometimes it's because the advertiser is doing a poor job of keyword selection and targeting.
Ad scoring algorithms used to be simple, the score being a function of the max bid paid by the advertiser and the conversion rate CTR. This led to abuses: an advertiser could generate bogus impressions to dilute competitor CTR, or clicks on its own ads to boost its own CTR, or a combination of both, typically using proxies or Botnets to hide the scheme, thus gaining unfair competitive advantage over search engines. Note that search engines use traffic quality scores. Indeed, I have been hired in the past, as a data scientist, to reverse engineer some of these traffic quality algorithms, as well as perform stress tests to see what makes them fail.
Recently, in addition to CTR and max bid, ad networks have added ad relevancy to their ad scoring mix (that is, in the algorithm used to determine which ads you will see, and in which order). In short, ad networks don't want to display ads that will make the user frustrated — it's all about improving user experience and reducing churn to boost long-term profits. How Does Ad Relevancy Scoring Work?
There are three components in the mix:
The fourth important component—the landing page—is not considered in this short discussion. (Good publishers scrape advertiser landing pages to check the match between a text ad and its landing page, and eliminate bad adware.)
You will need to create three taxonomies:
The two important taxonomies are B and C, unless the ad is displayed on a generic web page, in which case A is more important than B. So ignore taxonomy A for now. The goal is to match a category from Taxonomy B with one from Taxonomy C. Taxonomies may or may not have the same categories, so in general it will be a fuzzy match where, for instance, the page hosting the ad is attached to the categories Finance/Stock Market in Taxonomy B, while the ad is attached to the categories Investing/Finance in Taxonomy C. So you need to have a system in place to measure distances between categories belonging to two different taxonomies.
There are a lot of vendors and open source solutions available on the market, but if you want to build your own taxonomies from scratch, here's one way to do it:
Say that (X, Y) is such a pair. Compute n1 = # of occurrences of X in your table, n2 = # of occurrences of Y in your table, and n12 = # of occurrences where X and Y are associated (for example, found on the same web page). A metric that tells you how close X and Y are to each other would be R = n12 / SQRT(n1 x n2). With this dissimilarity metric (used for instance at http://www.frenchlane.com/kw8.html) you can cluster keywords via hierarchical clustering and eventually build a taxonomy—which is nothing more than an unsupervised clustering of the keyword universe with labels manually assigned to (say) the top 20 clusters, with each representing a category.
When an advertiser uses (say) five versions of an ad for the same landing page and the same AdWords keyword group (keywords purchased on Google), how frequently should each of the five ads show up? From Google's perspective, the optimum is achieved by the ad providing the highest revenue per impression, whether the advertiser is charged by impression (CPM model) or by the click (pay-per-click). For the advertiser, the ad commanding the lowest cost-per-click (CPC) is sometimes the optimum one, although it might generate too few clicks.
However, Google cannot just simply show 100 percent of the time the ad with the highest revenue per impression. This metric varies over time, and in 2 weeks the optimum ad could change, as users get bored seeing the same ad all the time or new keywords are purchased. So Google will show the five ads, semi-randomly. The optimum ad is displayed much more frequently, and the worst one very rarely.
But how do you define “much more frequently” and “very rarely”? Of course, it depends on the impression-to-click ratio multiplied by the CPC. Interestingly, it seems that Google has chosen a logarithm relationship between how frequently an ad is displayed and its CTR (impression-to-click ratio). The CPC is ignored in Google's model, at least in this example.
What do you think about Google's choice of a logarithmic distribution? Is it:
And why is the CPC decoupled? Maybe to avoid fraud and manipulations by advertisers.
If you check the figures at http://www.analyticbridge.com/group/webanalytics/forum/topics/ad-serving-optimization you'll see that the optimum ad is served 76.29 percent of the time, and the second best one only 10.36 percent of the time, despite both having a similar CTR and a similar CPC. The CPC is not taken into account, just the CTR; models involving CTR have a worse R^2 (goodness of fit). Models other than logarithmic have a worst R^2 than logarithmic.
This section discusses a few data science applications in various other contexts. These applications are routinely encountered in today's markets.
A common challenge is how to make more accurate sales forecasts. The solution is to leverage external data and simplify your predictive model. In 2000, I was working with GE's analytic team to improve sales forecasts for NBC Internet, a web portal owned by NBC. The sales/finance people were using a basic formula to predict next month sales based mostly on sales from the previous month. With GE, I started to develop more sophisticated models that included time series techniques (ARMA—auto regressive models) and seasonality but still were entirely based on internal sales data.
Today, many companies still fail to use the correct methodology to get accurate sales forecasts. This is especially true for companies in growing or declining industries, or when volatility is high due to macroeconomic, structural factors. Indeed, the GE move toward using more complex models was the wrong move: it did provide a small lift but failed to generate the huge lift that could be expected switching to the right methodology.
Most companies with more than 200 employees use independent silos to store and exploit data. Data from the finance, advertising/marketing, and operation/inventory management/product departments are more or less independent and rarely merged together to increase ROI. Worse, external data sources are totally ignored.
Even each department has its own silos: within the BI department, data regarding paid search, organic search, and other advertising (print, TV, and so on) are treated separately by data analysts that don't talk to each other. Although lift metrics from paid search (managed by SEM people), organic search (managed by SEO people), and other advertising are clearly highly dependent from a business point of view, interaction is ignored and the different channels are independently—rather than jointly—optimized.
If a sale results from a TV ad seen 3 months ago, together with a Google ad seen twice last month, and also thanks to good SEO and landing page optimization, it will be impossible to accurately attribute the dollar amount to the various managers involved in making the sale happens. Worse, sales forecasts suffer from not using external data and econometric models.
For a startup (or an old company launching a new product), it is also important to accurately assess sales growth using auto-regressive time series models that take into account advertising spend and a decay function of time. In the NBC Internet example, TV ads were found to have an impact for about 6 months, and a simple but good model would be:
where the time unit is 1 month (28 days is indeed better), and both g and f are functions that need to be identified via cross-validation and model fitting techniques (the f function corresponding to the ARMA model previously mentioned).
Following is a list of factors influencing sales:
Identify the top four metrics that drive sales among the metrics that I have suggested (do not ignore external data sources—including a sentiment analysis index by product, that is, what your customers write about your products on Twitter), and create a simple regression model. You could get it done with Excel (use the data analysis plug-in or the linest function) and get better forecasts than you would using a more sophisticated model based only on internal data coming from just one of your many data silos.
You need to hire some sort of a management consultant with analytic acumen who will interact with all departments in your organization, gather, merge, and analyze all data sources from most of your silos, integrate other external data sources, and communicate with both executives and everybody in your organization who owns or is responsible for a data silo. He will recommend a solution. Conversations should include data quality issues, which metrics you should track moving forward, and how to create a useful dashboard for executives.
The Washington Education Association (WEA, in Washington State) is partnering with Aon Hewitts (Illinois), a verification company, to eliminate a specific type of health insurance fraud: teachers reporting non-qualifying people as dependents, such as an unemployed friend with no health insurance. The fraud is used by “nice” people (teachers) to provide health insurance to people who would otherwise have none, by reporting them as spouses or kids.
Interestingly, I saw the letter sent to all WEA teachers. It requires them to fill out lots of paperwork and provide multiple identity proofs (tax forms, birth certificates, marriage certificates, and so on) similar to ID documents (I9 form) required to be allowed to work for a company.
It is easy to cheat on the paper documentation that you have to mail to the verification company (for example, by producing fake birth certificates or claiming you don't have one, and so on). In addition, asking people to fill out so much paperwork is a waste of time and natural resources (trees used to produce paper) and results in lots of errors, privacy issues, and ID theft risk, and costs lots of money for WEA.
So why don't they use modern methods to detect fraud: data mining techniques to detect suspicious SSNs, identifying SSNs reported as dependent by multiple households based on IRS tax data, SSNs not showing up in any tax forms submitted to the IRS, address mismatch detection, and so on? (Note that a 5-day-old baby probably has no record in the IRS database, yet he is eligible as a dependent for tax or health insurance purposes.)
Why not use data mining technology, instead of paper, to get all of the advantages that data mining offers over paper? What advantages does paper offer? I don't see any.
I was once asked to perform a study for NBC. It was running advertising using a weekly mix of TV shows at various times to promote its Internet website. I was asked to identify which TV shows worked best to drive new users to the website. Each week, it was using pretty much the same mix of 10 to 12 shows out of 20 different shows. GRT data (number of eyeballs who watched a particular TV show at a particular time) were collected, as well as weekly numbers of new users, over a 6-month time period. No A/B testing was possible, so the problem was quite complicated.
Here's how I solved the problem. I computed a distance for all pairs of weeks {A, B}, based on which TV shows were used in the weeks in question, as well as the total GRT of each week (high GRT results in many new users unless the TV ad is really bad). I then looked at pairs of weeks that were very similar (in terms of GRTs and show mixes) and close in time (no more than 2 months apart). I focused on pairs where week A had one more show, say show S, compared with week B. Week B did not have S but instead had an extra show not shared with A, say show T. Actually, it was a bit more complicated than that. I allowed not just one, but two different shows. But for the sake of simplicity, stick with this example: show S in week A and not in week B, and show T in week B but not in week A. If week A had more new users than B, I would adjust the performance index of shows S and T as follows: P(S) = P(S) + 1, P(T) = P(T) − 1. The converse was done; A had fewer new users than B.
To identify the best performing shows, all that was left to do was to rank shows by their performance indexes. Law & Order was the winner and was used more often in the mix of TV shows used to promote the website.
This example can be used for training purposes as part of the curriculum to become a data scientist. This section describes the different steps involved in a typical, real analysis: from understanding the problem (how to forecast meteorite hits) to providing an answer.
Conclusions are based on manual estimates, which make sense in this context. This section also describes how much a consultant would charge for this analysis, disproving the fact that data scientists are expensive.
The conclusions of the statistical analysis are different from what you've read in the press recently: the Russian meteor is an event said to occur once every 40 years, according to journalists.
Following is a statistical analysis in eight steps:
Data about identified meteors—those that have not hit Earth yet but occasionally fly by—are not included in this analysis. Identifying and forecasting the number of such meteors would be the scope of another similar project.
These patterns are:
There are nine group sizes with at least one observation: from size 0 for the smallest meteorites to size 8 representing a 9 to 66 ton meteorite—like the one that exploded over Russia. According to the data set, no size 8 was found in the last 40 years, and maybe this is why people claim that the 2013 Russian bang occurs every 40 years. But there were five occurrences between 1900 and 1950. Not sure how reliable these old numbers are. Here, the size is computed as INT(0.5 * log m) where m is the mass of the meteorite in grams.
Also look at size ratios, defined as the number of meteorites of size k divided by the number of meteorites of size k + 1 (for a given decade), to build a predictive model. Analyze how variations occur in these ratios, as well as ratio variations across decades for a fixed size k.
A quick regression where Y = ratio (1950–2010 aggregated data) and X = size shows a perfect linear fit (R^2 > 0.98), as shown in the top figure at http://www.analyticbridge.com/profiles/blogs/great-statistical-analysis-forecasting-meteorite-hits. In the figure, I have excluded size 0 (it has its own problems), as well as sizes 8 and 7, which are rare (only one occurrence for size 8 during the time period in question). The total number of meteorites across all sizes is 31,690.
I tried separate regressions (size versus ratio) for each decade to see if a pattern emerges. A pattern does emerge, although it is easier to detect with the brain than with a computer: the slope of the regression decreases over time (very, very roughly and on average), suggesting that more small meteorites have been found recently, whereas the number of large meteorites does not increase much over time (for size 5+).
The ratios in white font/red background in the spreadsheet represent manual estimates (sometimes overwriting actual data) based on interpolating the most stable patterns. No cross-validation was performed. Much of the validation consisted of identifying islands of stability in the ratio summary table in cells Y1:AG8.
A formula to compute yearly occurrences as a function of actual weight can easily be derived:
Yearly_Occurrences(weight) = 1/(A + B* log(weight))
Here's an exercise for you: estimate A and B based on Figure 6-4.
Based on the table, it makes the “every 40 year” claim for the 2013 Russian bang plausible. The Russian meteorite might have exploded louder than expected because of the angle, composition, or velocity. Or maybe the weight of the meteorite is based on the strength of its explosion, rather than the other way around.
There were no meteorites of sizes 9, 10, 11, or 12 in the data set and few of size 8, so the associated risks (for these sizes) are based on pure interpolation. Yet they constitute the most interesting and valuable part of this study.
This is a small analysis and there is not a real metric to measure model performance or ROI on this project. Many projects do, just not this one. It is also unusual for a statistician or data scientist to make intelligent guesses and handmade predictions. It is commonplace in engineering, though.
This section illustrates that data science can be used in many different contexts. It also focuses on experimental design and sampling, a domain of statistical science that should be included in data science. Finally, it shows that “no data” does not mean you can't do anything. Data scientists should be able to tell how to first create (in a way that helps data science) and then track data, when the data are initially absent.
Trailhead parking lots are notorious for crime activity. These remote, unpatrolled locations, hidden in dense forests, attract all sorts of bad people. One original crime is thieves stealing the park fees paid by hikers. These fees consist of $5 bills inserted by trail users (hikers) in an envelope and dropped in a locked mailbox at the trailhead. These wooden boxes are easy to break and repair, for the purpose of stealing some of the $5 bills. A smart thief would probably visit 10 trailhead parking lots with lots of hikers once a week, at night, and steal 50 percent of the envelopes. (Stealing 100 percent would attract the police's attention.) It could easily generate a $50,000 income a year, with little work and low risk—especially because there are so many agencies involved in charging and collecting these fees. A smart thief would steal only $5,000 a year from each agency but would target 10 different agencies associated with 10 different popular hikes.
I'm not saying this crime is happening, but it is a possibility. Another type of crime would be perpetrated by official employees collecting these fees and keeping a small percentage for themselves.
How would you detect this crime? Installing webcams might be too expensive in these remote locations with no Internet and no cell phone service. How can you prevent this crime from happening? And how can you make sure every hiker pays the fee—usually $5 per day? Now answer some of these questions.
We asked readers on Data Science Central to answer the question: how can you detect crime and fix it with data science? The answers, which offer real data science solutions, include using experimental design, creating data, tracking numbers on dollar bills or money collected by employees, employee rotation, and others. (If you would like to see more of the answers in more detail, see http://www.analyticbridge.com/profiles/blogs/how-would-you-detect-and-stop-this-type-of-crime in the comments section.)
Data science is applied in many other domains and industries. For instance:
Data science applications cover most fields: sensor data, mobile data, smart grids (optimizing electrical networks and electricity pricing), econometrics, clinical trials, operations research (optimizing business processes), astronomy, plagiarism and copyright infringement detection, and even computational chemistry (designing new molecules to fight cancer).
This chapter discussed case studies, real-life applications, and success stories involving data science. It covered various types of initiatives, ranging from stock market techniques to advertising mix optimization, fraud detection, search engine optimization (for search companies), astronomy (forecasting meteorite hits), automated news feed management, data encryption, e-mail marketing, and various relevancy problems (online advertising).
The next chapter focuses on helping you get employed as a data scientist. It includes sample resumes, more than 90 job interview questions, companies that routinely hire data scientists, and typical job titles for data scientists.