CHAPTER
8

Data Science Resources

The previous chapter focused on helping you get employed as a data scientist. In this chapter, you will find numerous resources that will be of value to you in your professional life and your career-building efforts.

Professional Resources

This section provides information on resources to help you in your data science career, including data sets, books, conferences, organizations, websites, and important definitions.

CROSS-REFERENCE Chapter 3, “Becoming a Data Scientist,” covers training programs, courses, and certifications.

See the sections History and Pioneers in Chapter 1, and The Big Data Ecosystem in Chapter 2 for details on vendors.

See the section Taxonomy of a Data Scientist in Chapter 7 for information on today's top data scientists.

Data Sets

Data Science Central has several data sets available at http://bit.ly/W2HTJU, including the following:

  • Source code and data for your big data keyword correlation API (also see the section Source Code for Keyword Correlation API in Chapter 5).
  • Great statistical analysis: forecasting meteorite hits (also see the section Forecasting Meteorite Hits in Chapter 6).
  • 53.5-billion-clicks data set available for benchmarking and testing.
  • More than 5,000,000 financial, economic, and social data sets.
  • New pattern to predict stock prices and multiplies return by factor 5 (stock market data, S&P 500; see the section Stock Market in Chapter 6).
  • 3.5 billion web pages: the graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages.
  • Another large data set: 250 million data points. This is the full resolution GDELT event data set running from January 1, 1979, through March 31, 2013, and containing all data fields for each event record.
  • 125 years of public health data available for download.

You can find additional data sets at the Harvard University Data Science website at http://cs109.org/resources.php. I was particularly interested in their Linked Data resources at http://linkeddata.org/. Information on Harvard's data science course featuring these resources can be found at http://bit.ly/1hU8O5l.

KDnuggets is also a great resource and can be accessed at http://www.kdnuggets.com/datasets/index.html and http://bit.ly/18U6fNw. Additional resources include http://data.gov.uk/ and similar initiatives in the United States (see http://onforb.es/1m0W8cU).

Books

Books useful for data scientists can be broken down into a few categories: visualization, big data/Hadoop, statistics/machine learning, pure data science, business analytics, data science for decision makers (recruiting and managing projects), and a few others.

Data Science Central lists 100+ books at http://bit.ly/179em2h and http://www.analyticbridge.com/group/books.

Following are some of the titles and references listed on these two web pages. These titles were recommended and acclaimed in data science circles such as KDnuggets, and are popular in the Data Science Central community. Details (book description, authors, date, and so on) can be found on the referenced web pages, along with direct links to resellers (Amazon.com in many cases) to allow you to find and buy the book quickly. Some of the listed titles are journals, and some are bundles (for instance, five books on data visualization), and quite a few are available for free as PDF documents.

These books are general, rather than being specialized for a specific industry or problem. A more comprehensive list, including specialized books, is available on the above web pages.

Business

  • Data Science for Business, O'Reilly, 2013
  • Big Data Computing, CRC Press, 2013
  • Past, Present, and Future of Statistical Science, CRC Press, 2014
  • The Field Guide to Data Science, free PDF document, by Booz Allen Hamilton, 2013
  • Implementing Analytics, Elsevier, 2013
  • Automate This: How Algorithms Came to Rule Our World, Portfolio Hardcover, 2012
  • Analyzing the Analyzers, O'Reilly, 2013
  • Business Analytics: A Practitioner's Guide, Springer, 2013
  • A Practitioner's Guide to Business Analytics, McGraw-Hill, 2013
  • Delivering Business Analytics: Practical Guidelines for Best Practice, Wiley, 2013
  • Building Data Science Teams, O'Reilly, 2011

Technical

  • Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Academic Press, 2012
  • Data Mining, Wiley-IEEE Press, 2011
  • Encyclopedia of Machine Learning, Springer, 2010
  • Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media, O'Reilly, 2013
  • Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan & Claypool Publishers, 2010
  • Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000
  • Mining of Massive Datasets, download free copy at http://infolab.stanford.edu/∼ullman/mmds/booka.pdf
  • Applied Data Science, Columbia University course
  • Forecasting: principles and practice, 2013, download free copy at https://www.otexts.org/fpp/
  • Alternative Methods of Regression, John Wiley & Sons, 1993

Bundles

Programming Tools

  • Big Data Analytics with R and Hadoop, Packt Publishing, 2013
  • Practical Data Science with R, book in progress, see http://www.manning.com/zumel/.
  • Predictive Analytics: Microsoft Excel, Que Publishing, 2012
  • Data Analysis with Open Source Tools, O'Reilly, 2010
  • Data Mining: Discovering and Visualizing Patterns with Python, free download at http://refcardz.dzone.com/refcardz/data-mining-discovering-and
  • Hadoop in Practice, Manning Publications, 2012
  • Hadoop: the Definitive Guide, O'Reilly, 2012

Visualization

  • Visualizing Data, O'Reilly, 2007

Journals

Conferences and Organizations

There are basically three types of organizations that are routinely involved in professional conferences: vendors, professional societies, and companies that organize conferences. Following are some of the key conferences and related organizations.

Vendors

SAS organizes the yearly I-cap. Pivotal organizes the Data Science Summit in partnership with VentureBeat and Data Science Central (http://venturebeat.com/events/databeat2013/). Other vendors such as Teradata, Hadoop (http://hadoopsummit.org/), Cloudera, Alpine Labs, and Hortonworks have their own events.

Professional Societies

There are two basic types of data science professional societies:

  • Those such as IEEE or Direct Marketing Association (DMA, see http://thedma.org), with a focus much broader than data science, that recently started to organize analytics, big data, and data science events.
  • Other societies more focused on analytics, such as INFORMS (operations research), American Statistical Association, Digital Analytics Association, and the International Institute for Analytics.

Conference Organizers

A few of the most active conference organizers include:

Websites

Data Science Central has put together a list of websites related to analytics, data science, or big data, based on input from Data Science Central members. (Each of these top domains was cited by at least four members.) The list includes vendors, publishers, universities, organizations, and personal blogs from well-known data scientists. Some of them are pure data science sites, whereas others are more general, but still tech-oriented with a strong emphasis on data issues at large, or regular data science content.

Since such a popular list is constantly evolving, it is available on the web at http://bit.ly/1ghDR7K so that you can always get the most current list. You might want to add your suggestions as well!

Definitions

Here are several selected terms that you need to understand and will likely use in your career. You can visit http://bit.ly/18UcD7c to find more details on them and names of the contributors to these definitions.

  • Adjusted R2 (R-Square): The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R2, which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R2 will always increase as you add new variables, which results in models that over-fit the data and have poor predictive ability. Adjusted R2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction.
  • Cluster analysis: Methods to assign a set of objects into groups. These groups are called clusters, and objects in a cluster are more similar to each other than to those in other clusters. Well-known algorithms are hierarchical clustering, k-means, fuzzy clustering, and supervised clustering.
  • Cross-validation: Cross-validation is a general computer-intensive approach used in estimating the accuracy of statistical models. The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times with each of the N subsets used as the validation set. Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error.
  • Decision trees: A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3.
  • Design of experiments: Also called experimental design. It is a methodology to sample, group observations, and test statistical models to detect root causes or influential predictive factors.
  • Exploratory Data Analysis: Also called EDA. It is the first step in all statistical analyses after data has been gathered: looking at interaction, visualizations, and outlier detection, and summarizing the data using a data dictionary.
  • Factor analysis: Used as a variable reduction technique to identify groups of clustered variables.
  • Feature selection: A feature is a variable, and feature selection is about detecting, out of trillions of potential feature combinations, those that have a great predictive power and robustness (not sensitive to noise).
  • General Linear Model: General (or Generalized) Linear Models (GLM), in contrast to linear models, allow you to describe both additive and non-additive relationships between a dependent variable and N independent variables. The independent variables in GLM may be continuous as well as discrete. (The dependent variable is often named response, and independent variables are named factors and covariates, depending on whether they are controlled or not.)
  • Goodness of fit: The degree to which the predicted values created by a model minimize errors in cross-validation tests. However, over-fitting the data can be dangerous because it results in a model that will have no predictive power for fresh data. True goodness of fit is determined by how the model fits new data, for instance, its predictive ability.
  • Hadoop: Open source framework that supports large-scale data analysis by allowing you to decompose questions into discrete chunks that can be executed independently, close to slices of the data in question, and ultimately reassembled into an answer to the question posed. It is a file management system more than a traditional database framework, though some SQL layers have been built on top of it.
  • Hidden decision trees: Methodology designed to score transactional data. Blends linear and nonlinear classifiers, builds and blends multiple small decision trees (the nonlinear classifier) implicitly, and eventually merges and recalibrates two scores to produce a unique score. Fast and efficient but requires expertise in feature selection, though the process can be automated using the fast feature selection algorithm described in this book.
  • K-means: Popular clustering algorithm where for a given (a priori) K, finds K clusters by iteratively moving cluster centers to the cluster centers of gravity and adjusting the cluster set assignments.
  • Logistic regression: Regression used with binary data when you want to model the probability that a specified outcome will occur. Also used to describe a regression where the response is a probability.
  • Machine learning: Set of techniques, usually described as algorithms, to classify data based on training sets. The training sets constitute the learning part, and much of the discussion is about designing automated or semi-automated learning systems.
  • Mahout: Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on the areas of collaborative filtering, clustering, and classification, often leveraging, but not limited to, the Hadoop platform.
  • MapReduce: Model for processing large amounts of data efficiently. Original problem is “mapped” to smaller problems (which may themselves become “original” problems). Smaller problems are processed in parallel. Results of smaller problems are combined, or “reduced,” into solutions to original problems.
  • Monte Carlo simulations: Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to computing probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty.
  • Natural Language Processing (NLP): A set of techniques to automatically process text to extract insights, for instance, sentiment analysis, or to automatically produce taxonomies or abstracts. Evolved from word counts (bags of words) to more elaborate text mining techniques.
  • NoSQL: “Not only SQL” is a group of database management systems. Data is not stored in tables like a relational database and is not based on the mathematical relationship between tables. It is a way of storing and retrieving unstructured data quickly.
  • Multidimensional scaling: Reduces space dimension by projecting a N2 (N = number of observations) similarity matrix onto a 2-dimensional visual representation. A classic example is producing a geographic map with cities when the only data available is travel times between any pair of cities.
  • Naive Bayes: The Naive Bayes method is a simple method of classification applicable to categorical data based on Bayes theorem. It is fast and easy to implement, but it assumes that the variables (also called features) are independent. In practice, this is not the case and algorithms based on Naive Bayes (spam detection) perform poorly.
  • Nonparametric statistics: Set of statistical techniques that process data without making assumptions about statistical observations. Also known as data-driven, versus model-driven, statistics.
  • Pig: Pig is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for processing a large variety of different data types.
  • Predictive modeling: Set of techniques based on statistical models to make predictions (stock market, fraud risk, and odds for a user to convert into a sale), usually with confidence intervals for predicted values.
  • Sensitivity analysis: Process used to determine the sensitivity of a predictive model to noise, missing data, outliers, and other anomalies in the model predictors.
  • Six Sigma: Set of tools and strategies for process improvement, originally developed by Motorola in 1985. Six Sigma seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business processes.
  • Step-wise regression: Variable selection process for multivariate regression. In forward step-wise selection, a seed variable is selected, and each additional variable is inputted into the model but only kept if it significantly improves goodness of fit (as measured by increases in R2). Backward selection starts with all variables and removes them one by one until removing an additional one decreases R2 by a nontrivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R2, not Adjusted R2.
  • Supervised clustering: Also known as discriminate analysis. Consists of classifying new data points when you already have a set (training set) of preclassified observations, that is, observations with a known cluster label.
  • Time series: A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data and replaced with just a t0 (start time) and delta-t that apply to the whole series.

Career-Building Resources

In this section you can find information on the diverse companies that employ and routinely hire data scientists. This section also includes sample resumes and job ads, which can be a gold mine for identifying the hot keywords and skills mentioned everywhere in the data science world at this time (for example, R, Python, Hadoop, NoSQL, SQL, predictive modeling, machine learning, and so on).

Companies Employing Data Scientists

An enormous variety and number of organizations routinely hire data scientists, though each may use a different job title in its advertisements and organization. A few companies that you might not expect to be on such a list include Walmart, PwC, Electronic Arts, Boeing, and Starbucks. Companies at the top of the list (IBM, Microsoft) have a lower proportion of data scientists in their workforce than others such as Facebook, Google, FICO, eBay, or LinkedIn. Traditional industries, such as manufacturing, tend to use the title “operations research analyst.”

Most jobs — particularly the senior-level roles — are still concentrated in the United States (New York City and the San Francisco Bay Area), but some places are catching up quickly, such as Singapore, Spain, Ireland, and London. Following is a list of the companies employing the largest numbers of data scientists. The list is based on my own 10,000+ LinkedIn connections, broken down by company and ordered by the number of connections. The top 20 companies represent approximately 10 percent of all data scientist positions, but the distribution has a long and unusually heavy tail. You can find a more comprehensive list of 6,000+ companies at http://bit.ly/19vRlNV.

  • Microsoft
  • IBM
  • Amazon.com
  • SAS
  • Google
  • Accenture
  • Oracle
  • LinkedIn
  • FICO
  • Bank of America
  • Citi
  • Tata Consultancy Services
  • Facebook
  • Cognizant Technology Solutions
  • Wells Fargo
  • Capgemini
  • eBay
  • Apple
  • Hewlett-Packard
  • EMC
  • Pivotal

Sample Data Science Job Ads

This chapter would not be complete without providing you with information on what currently hiring companies are looking for in the data science arena, their requirements, and other useful information. Consider the sample (yet actual and recent) job ads found at http://bit.ly/1hVAmr7. The skills most frequently listed are: Python, Linux, UNIX, MySQL, MapReduce, Hadoop, Matlab, SAS, Java, R, SPSS, Hive, Pig, Scala, Ruby, Cassandra, SQL Server, and NoSQL.

CROSS-REFERENCE See Chapter 7, “Launching Your New Data Science Career,” for more information on how and where to conduct job searches for different job titles, levels, and skills.

Sample Resumes

The following sample resume extracts are from actual data science practitioners who agreed to be featured in this book. In order to allow these professionals to delete or update their resumes, I've made the resumes accessible on the web at http://bit.ly/1j4PNuP. You can find more resumes and profiles by doing a search on LinkedIn with the keyword data science or related keywords, or by browsing Data Science Central member profiles.

Included in this list are people from different locales and backgrounds in an attempt to cover various aspects of data science. The emphasis is on providing a well-balanced mix of professional analytic people — both junior and senior, people with big company or startup experience or both, top stars and people with average resumes (sometimes the most faithful employees), and corporate or consultant or academia-related people. The purpose is to help you find one or more you can relate to. I also added mine to provide an example containing patents and classic big data science such as credit card fraud detection and digital analytics.

Typical skills mentioned in these resumes are: programming language R, Python, Matlab, MongoDB, SQL, MySQL, statistics/machine learning (KNN, decision trees, neural networks, linear/logistic regression), and finally Java, JavaScript, Tableau, Excel, recommendation engines, and Google Analytics. Of course, none of the resumes have all of these skills listed, but the most commonly listed skills are R in 50% of the resumes and Python in 50% of the resumes.

You should check these resumes to see career progression (lateral or vertical) and the degrees, ongoing training, and certifications these people have.

By comparing these resumes to the previously presented job ads, it seems that Human Resources departments are sometimes looking for a unicorn — a professional with a skill mix that does not exist. I encourage employers to seek out and hire people with strong potential and train them, rather than looking for the rare and expensive unicorn who often turns out to not be the best fit (and may only be happy running their own business).

NOTE A lot of discussion continues to occur in the data science industry about the ideal team structure for data science projects: is it better to have one or two “unicorns” who can do it all or a more diverse team? In reality, the diverse team structure is most common because it is just too difficult to find one individual who has all the required skill sets. But Human Resources should try, if possible, to hire someone with deep domain expertise, business acumen, coding experience (production code, unless the position is for prototyping algorithms), and an analytic background (real exposure to statistics, big data, and engineering/computer science), and have him/her learn statistics, Hadoop (as a user rather than an architect), or core data science or computer science techniques.

If a company can find all they want (minus data science core, which someone qualified can easily learn) in one person, they won't face competence silos and their drawbacks. But be aware that these skilled, polyvalent individuals may not stay as long as expected.

Summary

This chapter mentioned a number of useful resources for data scientists. Resources for practicing data scientists included data sets, data science books and journals, conferences, organizations, popular websites, and definitions. Career-building resources included information on companies with many data scientists, and sample job ads and resumes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset