We have discussed a typical life cycle of a data science project in Chapter 1, Big Data Analytics at a 10,000-Foot View. This chapter, however, is aimed at learning more about machine learning techniques used in data science with Spark and Hadoop.
Data science is all about extracting deep meaning from data and creating data products. This requires both tools and methods such as statistics, machine learning algorithms, and tools for data collection and data cleansing. Once the data is collected and cleansed, it is analyzed using exploratory analytics to find patterns and build models with the aim of extracting deep meaning or creating a data product.
So, let's understand how these patterns and models are created. This chapter is divided into the following subtopics:
Machine learning is the science of making machines work without programming predefined rules. Let's go through a simple example of how a program is written with a regular approach and a machine learning approach. For example, if you are developing a spam filter. You need to identify all possible parameters at design time and hardcode them within the program as follows:
spam_words = ("No investment", "Why pay more?", "You are a winner!", "Free quote") import sys for line in sys.stdin: if spam_words in line: print "Spam Found" else: process_lines()
In machine learning, computers will learn from the data we provide and make a decision on these spam words. Machine learning is similar to human learning. Let's understand how humans learn.
Humans learn something by doing a task over and over again, which is known as practice. Humans gain experience by practicing something. They get better at the task with more and more practice. Humans are considered to have learned something when they can repeat a task with some expected level of accuracy. However, human learning is not scalable as it has to consider a variety of things.
In machine learning, you typically provide training data with features, such as the type of words with output variables such as spam or ham. Once this data is fed to machine learning algorithms, such as classification or regression, it learns a model of correlation between features and output variables. You can predict that the e-mail is a spam or ham by providing input e-mails called test data to the model. You can refine the model by providing more and more training data to improve accuracy. You can see a spam detection example with machine learning in the next section.
The advantages of machine learning are as follows:
The disadvantages of machine learning are as follows: