Chapter 6 discussed case studies, real-life applications, and success stories covering various types of data science initiatives. This chapter focuses on helping you get employed as a data scientist, starting with 90 job interview questions (including questions on how to solve small, real-life data science problems), exercises to test your visual analytic skills, career moves from statistician to data scientist, examples of well-known data scientists and the skills they have, and finally, typical job titles (few are actually called data scientists, though this is changing) and salary surveys.
These are mostly open-ended questions that prospective employers may ask to assess the technical horizontal knowledge of a senior candidate for a high-level position, such as a director, as well as of junior candidates. The answers to some of the key questions can be found at http://bit.ly/1cGlFA5.
Which is the most efficient way to proceed?
NOTE To test these assumptions and help you become familiar with the concept of distributed architecture, you can have your kid build two LEGO® cars, A and B, in parallel and then two other cars, C and D, sequentially. If the overlap between A and B (the proportion of LEGO® pieces that are identical in both A and B) is small, then the sequential approach will work best. Another concept that can be introduced is that building an 80-piece car takes more than twice as much time as building a 40-piece car. Why? (The same also applies to puzzles.)
For this question, you are provided with the solution below as a learning tool. First, let's introduce some notations:
Basic formula: P(C is a shared connection) = P(C is connected to you) × P(C is connected to B) = (y/N) × (x/N) = (x × y) / (N × N). Thus x = (z × N) / y, or N = (x × y) / z.
Step 1: Compute N
To build the table shown in Figure 7-1, I sampled a few of my connections that have fewer than 500 connections to find out what x and z were. My number of connections is y=9,670.
A first approximation (visual analytics without using any tool other than my brain!) yields N = (x × y) / z = (approx.) 500 × 9,670 / 5 = (approx.) 1 million. So my N is 1 million. Yours might be different. Note that the number of people on LinkedIn is well above 100 million.
Step 2: Compute x for a specific connection
Using the formula x = (z × N) / y, if a LinkedIn member shares 200 connections with me, he probably has approximately 20,000 connections, using y=10,000 rather than 9,670, as an approximation for my number of connections. If he shares only one connection with me, he's expected to have 100 connections. You can compute confidence intervals for x by first computing confidence intervals for N, by looking at the variations in the above table. You can also increase accuracy by using a variable N that depends on job title or location.
Here are some exercises you can do on your own that will help to prepare you for job interviews. These exercises are aimed at assessing your visual and analytic judgment.
Look at the three charts (A, B, and C) shown in Figure 7-2. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatterplots? Are there other ways to visually represent this type of data?
It is clear that chart C exhibits a strong clustering pattern, unless you define your problem as points randomly distributed in an unknown domain whose boundary has to be estimated. So, the big question is: between charts A and B, which one represents randomness? Look at these charts closely for 60 seconds, then make a guess, and then read on. Note that all three charts contain the same number of points, so there's no scaling issue involved here.
Now assume that you are dealing with a spatial distribution of points over the entire 2-dimensional space, and that observations are seen through a small square window. For instance, points (observations) could be stars as seen on a picture taken from a telescope.
The first issue is that the data is censored: if you look at the distribution of nearest neighbor distances to draw conclusions, you must take into account that points near the boundary have fewer neighbors because some neighbors are outside the boundary. You can eliminate the bias by:
The second issue is that you need to use better visualization tools to see the patterns. Using + rather than a dot symbol to represent the points helps: some points are so close to each other that if you represent points with dots, you won't visually see the double points. (In our example, double points could correspond to double star systems—and these small-scale point interactions are part of what makes the distribution non-random in two of the charts.) But you can do much better: you could measure a number of metrics (averages, standard deviations, correlation between x and y, number of points in each subsquare, density estimates, and so on) and identify metrics proving that you are not dealing with pure randomness.
In these three charts, the standard deviation for either x or y—in case of pure randomness—should be 0.290 plus or minus 0.005. Only one of the three charts succeeds with this randomness test.
The third issue is that even if multiple statistical tests suggest that the data is truly random, it does not mean it actually is. For instance, all three charts show zero correlation between x and y and have mean x and y close to 0.50 (a requirement to qualify as random distribution in this case). However, only one chart exhibits randomness.
The fourth issue is that you need a mathematical framework to define and check randomness. True randomness is the realization of a Poisson stochastic process, and you need to use metrics that uniquely characterize a Poisson process to check whether a point distribution is truly random. Such metrics could be:
The fifth issue is that some of the great metrics (distances between k and its nearest neighbors) might not have a simple mathematical formula. But you can use Monte Carlo simulations to address this issue: simulate a random process, compute the distribution of distances (with confidence intervals) based on thousands of simulations, and compare with distances computed on your data. If distance distribution computed on the data set matches results from simulations, you are good; it means the data is probably random. However, you would have to make sure that distance distribution uniquely characterizes a Poisson process and that no non-random processes could yield the same distance distribution. This exercise is known as goodness-of-fit testing: you try to see if your data supports a specific hypothesis of randomness.
The sixth issue is that if you have a million points (and in high dimensions, you need more than a million points due to the curse of dimension), then you have a trillion distances to compute. No computer, not even in the cloud, can make all these computations in less than 1,000 years. So you need to pick up 10,000 points randomly, compute distances, and compare with equivalent computations based on simulated data. You need to make 1,000 simulations to get confidence intervals, but this is feasible.
Here's how the data in charts A, B, and C were created:
The only chart exhibiting randomness is chart A. Chart B has significantly too low standard deviations for x and y, too few points near boundaries, and too many points that are close to each other. In addition, Chart A (the random distribution) exhibits a little bit of clustering, as well as some point alignments. This is, however, perfectly expected from a random distribution. If the number of points in each subsquare were identical, the distribution would not be random, but would correspond to a situation in which antagonist forces make points stay as far away as possible from each other. How would you test randomness if you had only two points (impossible to test), three points, or just 10 points?
Finally, once a pattern is detected (for example, abnormal close proximity between neighboring points), it should be interpreted and/or leveraged. That is, it should lead to, for example, ROI-positive trading rules if the framework is about stock trading or the conclusion that double stars do exist (based on chart B) if the framework is astronomy.
Pretend that I drew the picture shown in Figure 7-3, and then I told you that this is how Earth and its moon would be seen by a human being living on Saturn, which is 1 billion miles away from Earth. Earth's radius is approximately 4,000 miles, and the distance between Earth and its moon is approximately 200,000 miles. What would be your reaction? There's not a right or wrong answer; this simply checks your thought process.
Potential answers include:
Many time series charts seem to exhibit a pattern: an uptrend, apparent periodicity, or a stochastic process that seems not to be memoryless, and so on. The picture shown in Figure 7-4 represents stock price simulations, with the X-axis representing time (days) and the Y-axis the simulated stock price.
Do you think there is an uptrend in the data? Actually, in the long term, there isn't, it's a realization of a pure random walk. At any time, the chance that it goes up or down by a given amount is exactly 50 percent. Yet, for short-term periods, it is perfectly normal to observe ups and downs. It does not make this time series predictable: you could try to design a predictive model trained on the first 1,000 observations and then test it on the remaining 1,000 observations. You would notice that your model is not better (for predicting up and down movements) than randomly throwing dice.
Another test you can do to familiarize yourself with how randomness and lack of randomness look is to simulate auto-regressive time series. One of the most basic processes is X(t) = a × X(t–1) + b × X(t–2) + E + c, where t is time, c is a constant (−0.25 < c < +0.25), and E is an error term (white noise) simulated as a uniform deviate on [−1, +1]. For instance, X(t) represents a stock price at time t. You can start with X(0) = X(1) = 0. If a + b = 1, then the process is stable in equilibrium. Why? If c < 0, there is a downward trend. If c > 0, there is an upward trend. If c = 0, it does not go up or down (stationary process). If a = b = 0, the process is memory-free; there's no stochastic pattern embedded into it—its just pure noise. Try producing some of these charts with various values for a, b, and c and see if you visually notice the pattern and can correctly interpret it. In Figure 7-4, c = 0, b = 0, a = 1. Of course, with some values of a, b, and c, the patterns are visually obvious. But if you keep both c and a + b close to zero, it is visually a more difficult exercise, and you might have to look at a long time frame for your brain to recognize the correct pattern. Try detecting the patterns with a predictive algorithm, and see when and if your brain can beat your predictive model.
Recently a long thread grew on LinkedIn about why statisticians are not involved in big data and how to remedy the situation. You can read the entire discussion at http://bit.ly/197Jsfa. At the time of this writing, it had 160 comments from leading data scientists and statisticians. Here I mention those comments that I think will be most beneficial in helping you understand (particularly if you are a statistician) what you can do to make yourself highly marketable.
Statistics does not deal enough with applied computational aspects critical for big data, or with the business aspects that are critical for getting business value out of data. Statisticians should acquire business acumen, including being able to carry out tasks traditionally performed by business analysts, especially when working for a small company or tech startup. (The discussions on digital analytics in Chapter 6 show examples of this.) It will also be helpful for the statistician to acquire more engineering and computer science knowledge, including APIs, Python, MapReduce, NoSQL, and even Hadoop architecture—at least having a general understanding of these techniques and languages and how they impact data access, data storage, algorithm efficiency, and data processing.
Statisticians are well-equipped to make the transition to data science since they typically have critical skills that many data scientists lack, such as design of experiments, sampling, and R.
For those who believe that big data and data science are just pure engineering or computer science fields with ignorance of or poor application of statistics, you are now reading the book that will debunk this myth. In this book, you learn that data science has its own core of statistics and statistical research. For instance, in Chapter 2, Big Data Is Different, you considered that in big data, you are bound to find spurious correlations when you compute billions or trillions of correlations. Such spurious correlations overshadow real correlations that go undetected. Instead of looking at correlations, you should compare correlograms. Correlograms uniquely determine if two time series are similar, which correlations do not do. You also learned about normalizing for size. You don't need to be a statistician to identify these issues and biases and correct them. A data scientist should know these things too, as well as other things such as experimental design, applied extreme value theory and Monte Carlo simulations, confidence intervals created without underlying statistical model (as in the Analyticbridge First Theorem), identifying non-randomness, and much more.
You can be both a data scientist and a statistician at the same time, just like you can be a data scientist and an entrepreneur at the same time—and actually, it is a requirement. It's certainly not incompatible; you just have to be aware that the official image of statisticians as pictured in ASA publications or on job boards does not represent (for now at least) the reality of what many statisticians do.
Still, data science and statistics are different. Many of the data science books can give you the impression that they are one and the same, but it's because the authors just reused old information (not even part of data science), added a bit of R or Python, and put a new name on it. I call this fake data science. Likewise, data science without statistics (or with reckless application of statistical principles) is not real data science either.
In the end, everyone has their own idea of what data science, statistics, computer science, BI, and entrepreneurship are. You decide what you want to call yourself. As for me, I'm clearly no longer a statistician, but rather a data scientist. (I was a computational statistician to begin with anyway.) My knowledge and expertise are different from those of a statistician. (It's probably closer to computer science.) And although I have a good knowledge of experimental design, Monte Carlo, sampling, and so on, most of it I did not learn at school or in a training program. The knowledge is available for free on the Internet. Anybody—a lawyer, a politician, a geographer—can acquire it without attending statistics classes.
NOTE Part of the Data Science Central data science apprenticeship is to make this knowledge accessible to a broader group of people. My intuitive Analyticbridge Theorem on model-free confidence intervals is an example of one “statistical” tool designed to be understood by a 12-year-old, with no mathematical prerequisite and applied in big data environments with tons of data buckets.
Statistics in data science can be built from within by data scientists or brought in by outsiders (those who don't want to be called data scientists and call themselves, for example, statisticians). I am an example of an insider creating a culture of statistics inside data science. Some people like contributions from outsider statisticians, whereas others like insider contributions because the insider is already familiar with data science, understands the difference between statistics and data science, and tends to be close to the businesspeople.
While many interesting positions are offered by various government agencies and large non-governmental organizations, the creative data scientist might face a few challenges, and might thrive better in a startup environment. Some of these challenges (also preventing employers from attracting more candidates) are:
Statisticians should gain some familiarity with how to design a database: metrics to use (exact definitions), how they are captured, and how granular you need to be when keeping old data. This is part of data science. It involves working closely with DB engineers and data architects. Data scientists must have some of the knowledge of data architects and also of business people to understand exactly what they do and what they are going to capture. Data scientists should also be involved in dashboard design. Typically, all these things are tasks that many statisticians don't want to do or believe are not part of their job.
Discussing the question is the first critical, fundamental step in any project. Not doing it is like building a skyscraper without a foundation. But flexibility should be allowed. If something does not work as expected, how can you rebuild or change the project? How do you adapt to change? Is the design flexible enough to adapt? Who's in charge of defining the project and its main feature?
While business analysts (when present in your company, which is the case in larger organizations) and executives are eventually responsible for defining KPIs and dashboard design, the data scientist must get involved and communicate with these businesspeople (especially in organizations with fewer than 500 employees), rather than working in isolation. Data scientists should also communicate with data architects, product managers, and engineers/software engineers.
Such communication is crucial for several reasons: to help the data scientist to understand the business problem and gain business acumen and necessary domain expertise, to make technical implementations smoother, to influence company leaders by making recommendations to boost robustness and help them gain analytics acumen themselves, and so on. This two-way communication is useful to optimize business processes.
In my opinion, there are two types of statisticians: those who associate themselves with the ASA (American Statistical Association), and those do not. Likewise, there are two types of big data practitioners:
I advise statisticians to apply for positions with teams that do not have strong statistical knowledge. (You can assess the team's level of competency by asking questions during the interview, such as what kind of statistical analyses they do.) You will better shine if you can convince them of the value of sampling, mathematical modeling, experimental design, imputation techniques, identifying and dealing with bad data, survival analysis, and sound confidence intervals, and show real examples of added value.
Also, when interviewing for a data scientist position (or when the interviewer is not a statistician), do not use excessive statistical jargon during the interview, but rather discuss the benefits of simple models. For example, the statistical concept of p-value is rarely used in data science, not because of ignorance but because data science uses different terms and different metrics, even though it serves a similar purpose. In data science contexts, many times there are no underlying models. You do model-free inference, such as predictions, confidence intervals, and so on, which are data-driven with no statistical model. So a data scientist interviewer will be more receptive to discussions of simple models.
NOTE Instead of p-values, I invite you to use alternatives, such as the model-free confidence intervals discussed in Chapter 5. This equivalent approach is easy to understand by non-statisticians. In another context, I use predictive power (a metric easily understood by non-statisticians, as discussed in Chapter 6) as a different, simpler, model-free approach to p-value. Also, when talking to clients, you need to use words they understand, not jargon like p-value.
Data scientists also do EDA (exploratory data analysis), but it's not something done by statisticians exclusively. (See the section Data Dictionary in Chapter 5.) As a statistician, you should easily understand this data dictionary concept, and you may already be using it when dealing with larger data sets. Indeed, I believe that the EDA process could be automated, to a large extent, with the creation of a data dictionary being the first step of such an automation.
Sometimes you need the entire data set. When you create a system to estimate the value of every single home in the United States, you probably want the entire data set being part of the analysis; millions of people each day are checking millions of houses and neighboring houses for comparison shopping. In this case, it makes sense to use all the data. If your database had only 10 percent of all historical prices, sure, you could do some inference (though you would miss a lot of the local patterns), but 90 percent of the time, when a user entered an address to get a price estimate, he would also have to provide square footage, sales history, number of bathrooms, school rankings, and so on. In short, this application (Zillow.com) would be useless.
Examples of observational data where sampling is not allowed include:
Understanding the taxonomy of a data scientist can help you identify areas of potential interest for your data science career. This discussion is the result of a data science project I did recently, and it is slightly more technical than the previous section. If you like, you can skip the technicalities and focus on the main conclusions of the discussion. But since this is a data science book, I thought it good to provide some details about the methodology that I used to come to my conclusions.
The key point of this discussion is to help you determine which technical domains you should consider specializing in based on your background. You can identify (with statistical significance) the main technical sub-domains related to data science (machine learning, data mining, big data, analytics, and so on) by attaching a weight to each domain based on data publicly available on LinkedIn.
Also provided is a list of several well-known, successful data scientists. You can further check their profiles to better understand what successful data scientists actually do. Finally, I explain cross-correlations among top domains, since all of them overlap, some quite substantially.
I present here the results of a study I conducted on LinkedIn, in which I used my own LinkedIn account, which has 8,000+ data science connections, to identify the skills most frequently associated with data science, as well as the top data scientists on LinkedIn. The following lists were created by searching for data scientists with 10+ data science skill endorsements on LinkedIn, and analyzing the top five skills that they list on their profile.
The statistical validity of data science–related skills is strong, whereas validity is weak for top data scientists. The reason is that you need to have at least 10 endorsements for your LinkedIn data science profile in the skills section to be listed as a top data scientist in the following list. Some pioneering data scientists are not listed because they did not add data science skills to their LinkedIn profile, for the same reason that you are not listed in top big data people lists based on Twitter hashtags if you don't use Twitter hashtags in your Tweets or if you do not Tweet at all.
Figure 7-5 shows the list of data science related skills (DS stands for Data Science). In short, you could write the data science equation as:
Note that, surprisingly, visualization does not appear at the top. This is because visualization, just like UNIX, is perceived either as a tool (for example, Tableau) or sometimes as a soft skill, rather than a hard skill, technique, or field, and thus it is frequently mentioned as a skill on LinkedIn profiles but not in the top five. Computer science is also missing, probably because it is too broad a field, and instead people will list (in their profile) a narrower field such as data mining or machine learning.
Figure 7-6 shows how (from a quantitative point of view) related skills contribute to data science, broken down per skill and per person. (This is used to compute the summary table shown in Figure 7-5.) For instance, the first row reads: Monica Rogati lists data science as skill #3 in her LinkedIn profile; she is endorsed by 61 people for data science and by 106 people for machine learning. The machine learning contribution to data science coming from her is 106/3 = 35.33. The people listed in the following table (Figure 7-6) are data science pioneers—among the top data scientists, according to LinkedIn.
The full list, based on the top 10 data scientists identified on LinkedIn, can be downloaded at http://bit.ly/1iRJXQC.
If, for each skill, instead of summing Endorsements(person, skill)/DS_Skill_Rank(person) over the 10 persons listed in the spreadsheet, you sum SQRT{Endorsements(person, skill) * DS_Endorsements(person) } over the same 10 persons, then you obtain a slightly different mix illustrated in the following:
Whether you use the first or second formula, you are dealing with three parameters, n, m, and k:
The most complicated problem is to identify all professionals with at least 10 endorsements for data science on LinkedIn. It can be solved by using a list of 100 well-known data scientists as seed persons, and then looking at first-degree, second-degree, and third-degree-related persons. “Finding related people” means accessing the LinkedIn feature that tells you “people who visit X's profile also visit Y's profile” and extracting endorsement counts and skills, for each person, using a web crawler.
WHICH IS BEST?
In your opinion, which formula is best from a methodology point of view? The first one, or the alternative? Not surprisingly, they both yield similar results. I like the second one better.
A good exercise would be to find the equivalent formulas for data mining, big data, machine learning, and so on. It's important to remember that people can be more than just data scientists—for instance, a data scientist and a musician at the same time. This explains why the skill rank, for anybody, is rarely if ever #1 for data science: even I get far more endorsements for data mining or analytics than for data science, in part because data science is relatively new.
Machine learning is part of data mining (at least for some people). Data mining and machine learning both involve analytics, big data, and data science. Big data involves analytics, data mining, machine learning, data science, and so on. So how do you handle skill interactions? Should you have multiple equations, one for data science, one for data mining, one for big data, and so on, and try to solve a linear system of equations? Each equation could be obtained using the same methodology used for the data science equation. I'll leave it to you as an exercise.
Hilary Mason (at the time of writing) has data science as skill #4, with 49 endorsements and the following top skills:
So she should definitely be in the top 10, yet she does not show up when doing a Data Science search using my LinkedIn account, although she is a second-degree connection. If you add her to the list, now Python will pop up as a data science-related skill, with a low weight, but above R or SQL.
Some of the top data scientists in the world and their areas of interest are discussed in this section. The list is summarized in Figure 7-7.
The 10 pioneering data scientists listed here were identified as top data scientists based on their LinkedIn profiles. For each of them, I computed the number of LinkedIn endorsements for the top four data science–related skills: analytics, big data, data mining, and machine learning (these skills were identified as most strongly linked to data science, as discussed in the previous section). Then I normalized the counts so that each is expressed as a ratio between 0 and 1, and for each individual the total aggregated count over the four skills is 100 percent. This approach makes the classification easier.
Note that the correlation between machine learning and analytics is very negative (−0.82). Likewise, the correlation between big data and data mining is very negative (−0.80). All other cross-skill correlations are negligible. Other notable items include the following:
The big data/machine learning combo exhibits the strongest cluster structure among the six potential scatterplots. Milind Bhandarkar (Pivotal's Chief Scientist), and to a lesser extent Eric Colson (former VP of Data Science and Engineering at Netflix), are outliers, both strong in big data.
IS THIS BIG DATA ANALYSIS?
Yes and no. Yes, because I extracted what I wanted out of Terabytes of LinkedIn data, leveraging my expertise to minimize the amount of work and data processing required. No, because it did not involve massive data transfers—the information being well organized and easy to efficiently access. After all, you could say it's tiny data with just 10 observations and four variables. But that 10x4 table is a summary table. Identifying the data scientist with the most endorsements on LinkedIn isn't easy, unless you have domain expertise.
I performed what I would call “manual clustering.” You could say that my analysis is light analytics. How much better (or worse!) can you do using heavy analytics: by extracting far more data from LinkedIn (200 people selected out of 5,000, with 10 metrics), and applying a real (not manual) clustering algorithm? Which metric would you use to assess the lift created by heavy analytics as opposed to light analytics?
I compared the four-skill mix of each of these 10 data scientists with a generic data science skill mix (Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data). In short, I computed 10 correlations (one per data scientist) to determine who best represents data science. The results are shown in Figure 7-8.
Dean Abbott is closest to the “average” (which I define as the “purest”), whereas Milind Bhandarkar (a big data, Hadoop guy) is farthest from the “center.” Despite repeated claims (by myself and others) that I am a pure data scientist, I score only 0.43. (Sure, I'm also some kind of product/marketing/finance/entrepreneur guy, not just a data scientist, but these extra skills were isolated from my experiment.) Surprisingly, Kirk Borne, known as an astrophysicist, scores high in the data science purity index. So does Gregory Piatetsky-Shapiro, who is known as a data miner.
I looked at more than 10,000 data scientists in my network of LinkedIn connections and found that most data scientists (including me) don't use the title “data scientist” to describe their job. So what are the most popular job titles used for data scientists? Below is a list of many of the job titles I found for “director” and “chief” positions. But keep in mind that, for every director, there are multiple junior and mid-level data scientists working on the same team and, thus, they would have similar job titles (minus “chief” or “director”). Also remember that many data scientists are consultants, principals, founders, CEOs, professors, students, software engineers, and so on.
Chiefs:
Directors:
This section provides data about salaries and the number of open positions for various big data skills in several major cities worldwide.
CROSS-REFERENCE Chapter 3, “Becoming a Data Scientist,” presents a brief overview of salary surveys along with a link to comprehensive information (http://bit.ly/19vA0n1). Sample job descriptions and resumes, as well as companies hiring data scientists, are provided in Chapter 8, “Data Science Resources.”
The following data was obtained in December 2013 from the job search website Indeed.com. (You can find the search results at http://bit.ly/1dmCouo.) Hadoop, a fast-growing data science skill listed in many job ads, is compared with other IT skills, including big data, Python, Java, SQL, and others, in different locales. Note that in these search results, an exact match is shown in quotes (for example, “data science”) whereas a broad match does not contain quotes. The discrepancy between the results for “data science” and data science is huge because data scientists can have hundreds of different job titles, as seen in the previous section. The results for “big data” are 15 times smaller than the results for big data. Also note that San Francisco is the hub for big data, Hadoop, data science, and so on. But location #2 could be London, where there are more jobs and better salaries than in New York City.
The information presented here shows you what tools are available to help you find numerous job openings, where large numbers of jobs are located geographically, and the general salary overview for different data science skills. The numbers in parentheses represent the number of openings listed for each salary.
Hadoop – San Francisco, CA
Hadoop – Seattle, WA
Hadoop – New York, NY
Hadoop – Chicago, IL
Hadoop – Los Angeles, CA
Hadoop – London, UK
“Big Data” – San Francisco, CA
Data Science – San Francisco, CA
“Data Science” – San Francisco, CA
Python – San Francisco, CA
SQL – San Francisco, CA
Java – San Francisco, CA
Perl – San Francisco, CA
Statistician – San Francisco, CA
“Data Mining” – San Francisco, CA
“Data Miner” – San Francisco, CA
A great source for salary surveys, some broken down by experience level, can be found at http://bit.ly/19vA0n1.
You can use the same Indeed.com website, modifying the search keyword and location in the search box, to find the number of positions and salary breakdown in any locale, for any data science-related occupation. (See previous sections to identify potential job titles.) Many positions can be found on LinkedIn and AnalyticTalent, but competition is fierce (hundreds of applicants per opening) for jobs in the following companies: Google, Twitter, Netflix, PayPal, eBay, LinkedIn, Facebook, Yahoo, Apple—typically companies involved in data science since its beginnings—as well as Intel, Pivotal, IBM, Microsoft, Amazon.com, and a few others. Unknown companies or companies located in less popular cities (for instance, banks or insurance companies in Ohio) attract much fewer candidates—fewer than 50 per job ad. Startups are also a good source of jobs.
Other sources of career information include Glassdoor.com and Wetfeet.com.
This chapter focused on helping you get employed as a data scientist. It discussed more than 90 job interview questions, tests to assess your analytic and visual skills, guidance on transitioning from statistician to data scientist, top data scientists and data science–related skills, typical job titles for data scientists, and salary surveys. It also included a few small data science project examples.
In the next and final chapter, you can find several resources useful for current and aspiring data scientists: conference listings, sample resumes and job ads, data science books, definitions, data sets, organizations, popular websites, companies with many data scientists, and so on.