Chapter 1

Introducing How Machines Learn

IN THIS CHAPTER

check Defining the dream of AI, and comparing AI to machine learning

check Understanding the engineering portion of AI and machine learning

check Considering how statistics and big data work together in machine learning

check Defining the role of algorithms in machine learning

check Determining how training works with algorithms in machine learning

“A breakthrough in machine learning would be worth ten Microsofts.”

— BILL GATES

Artificial Intelligence (AI) is a huge topic today, and it’s getting bigger all the time thanks to the success of technologies such as Siri (www.apple.com/ios/siri). Talking to your smartphone is both fun and helpful to find out things like the location of the best sushi restaurant in town or to discover how to get to the concert hall. As you talk to your smartphone, it learns more about the way you talk and makes fewer mistakes in understanding your requests. The capability of your smartphone to learn and interpret your particular way of speaking is an example of an AI, and part of the technology used to make it happen is machine learning. You likely make use of machine learning and AI all over the place today without really thinking about it. For example, the capability to speak to devices and have them actually do what you intend is an example of machine learning at work. Likewise, recommender systems, such as those found on Amazon, help you make purchases based on criteria such as previous product purchases or products that complement a current choice. The use of both AI and machine learning will only increase with time.

In this chapter, you delve into AI and discover what it means from several perspectives, including how it affects you as a consumer and as a scientist or engineer. You also discover that AI doesn’t equal machine learning, even though the media often confuse the two. Machine learning is definitely different from AI, even though the two are related.

You will also understand the fuel that powers both AI and machine learning — big data. Algorithms, lines of computer code based on statistics, turn big data into information and eventually insight. Through this process, you will be amazed by how AI and machine learning help computers excel at tasks that used to be done only by humans.

Getting the Real Story about AI

For many years, people understood AI based on Hollywood. Robots enhanced human abilities in TV shows like The Jetsons or Knight Rider, and in movies like Star Wars and Star Trek. Recent developments like powerful computers that can fit in your pocket and cheap storage to collect massive amounts of data have moved real-life reality closer to the on-screen fiction.

This section separates hype from reality, and explores a few actual applications in machine learning and AI.

Moving beyond the hype

As any technology becomes bigger, so does the hype, and AI certainly has a lot of hype surrounding it. For one thing, some people have decided to engage in fear mongering rather than science. Killer robots, such as those found in the film The Terminator, really aren’t going to be the next big thing. Your first real experience with an android AI is more likely to be in the form a health-care assistant (www.good.is/articles/robots-elder-care-pepper-exoskeletons-japan) or possibly as a coworker (www.computerworld.com/article/2990849/robotics/meet-the-virtual-woman-who-may-take-your-job.html). The reality is that you already interact with AI and machine learning in far more mundane ways. Part of the reason you need to read this chapter is to get past the hype and discover what AI can do for you today.

remember You may also have heard machine learning and AI used interchangeably. AI includes machine learning, but machine learning doesn’t fully define AI. This chapter helps you understand the relationship between machine learning and AI.

Machine learning and AI both have strong engineering components. That is, you can quantify both technologies precisely based on theory (substantiated and tested explanations) rather than simply hypothesis (a suggested explanation for a phenomenon). In addition, both have strong science components, through which people test concepts and create new ideas of how expressing the thought process might be possible. Finally, machine learning also has an artistic component, and this is where a talented scientist can excel. In some cases, AI and machine learning both seemingly defy logic, and only the true artist can make them work as expected.

Dreaming of electric sheep

Androids (a specialized kind of robot that looks and acts like a human, such as Data in Star Trek) and some types of humanoid robots (a kind of robot that has human characteristics but is easily distinguished from a human, such as C-3PO in Star Wars) have become the poster children for AI. They present computers in a form that people can anthropomorphize (for example, make human). In fact, it’s entirely possible that one day you won’t be able to distinguish between human and artificial life with ease. Science fiction authors, such as Philip K. Dick, have long predicted such an occurrence, and it seems all too possible today. The story Do Androids Dream of Electric Sheep? discusses the whole concept of more real than real. The idea appears as part of the plot in the movie Blade Runner (www.warnerbros.com/blade-runner). The sections that follow help you understand how close technology currently gets to the ideals presented by science fiction authors and the movies.

technicalstuff The current state of the art is lifelike, but you can easily tell that you’re talking to an android. Viewing videos online can help you understand that androids that are indistinguishable from humans are nowhere near any sort of reality today. Check out the Japanese robots at www.youtube.com/watch?v=MaTfzYDZG8c and www.nbcnews.com/tech/innovation/humanoid-robot-starts-work-japanese-department-store-n345526. One of the more lifelike examples is Amelia (https://vimeo.com/166359613). Her story appears in Computerworld at www.computerworld.com/article/2990849/robotics/meet-the-virtual-woman-who-may-take-your-job.html. The point is, technology is just starting to get to the point where people may eventually be able to create lifelike robots and androids, but they don’t exist today.

Understanding the history of AI and machine learning

There is a reason, other than anthropomorphization, that humans see the ultimate AI as one that is contained within some type of android. Ever since the ancient Greeks, humans have discussed the possibility of placing a mind inside a mechanical body. One such myth is that of a mechanical man called Talos (www.ancient-wisdom.com/greekautomata.htm). The fact that the ancient Greeks had complex mechanical devices, only one of which still exists (read about the Antikythera mechanism at www.ancient-wisdom.com/antikythera.htm), makes it quite likely that their dreams were built on more than just fantasy. Throughout the centuries, people have discussed mechanical persons capable of thought (such as Rabbi Judah Loew's Golem, www.nytimes.com/2009/05/11/world/europe/11golem.html).

AI is built on the hypothesis that mechanizing thought is possible. During the first millennium, Greek, Indian, and Chinese philosophers all worked on ways to perform this task. As early as the seventeenth century, Gottfried Leibniz, Thomas Hobbes, and René Descartes discussed the potential for rationalizing all thought as simply math symbols. Of course, the complexity of the problem eluded them, and still eludes us today. The point is that the vision for AI has been around for an incredibly long time, but the implementation of AI is relatively new.

The true birth of AI as we know it today began with Alan Turing’s publication of “Computing Machinery and Intelligence” in 1950. In this paper, Turing explored the idea of how to determine whether machines can think. Of course, this paper led to the Imitation Game involving three players. Player A is a computer and Player B is a human. Each must convince Player C (a human who can’t see either Player A or Player B) that they are human. If Player C can’t determine who is human and who isn’t on a consistent basis, the computer wins.

A continuing problem with AI is too much optimism. The problem that scientists are trying to solve with AI is incredibly complex. However, the early optimism of the 1950s and 1960s led scientists to believe that the world would produce intelligent machines in as little as 20 years. After all, machines were doing all sorts of amazing things, such as playing complex games. AI currently has its greatest success in areas such as logistics, data mining, and medical diagnosis.

Exploring what machine learning can do for AI

Machine learning relies on algorithms to analyze huge data sets. Currently, machine learning can’t provide the sort of AI that the movies present. Even the best algorithms can’t think, feel, present any form of self-awareness, or exercise free will. What machine learning can do is perform predictive analytics far faster than any human can. As a result, machine learning can help humans work more efficiently. The current state of AI, then, is one of performing analysis, but humans must still consider the implications of that analysis — making the required moral and ethical decisions. The “Considering the relationship between AI and machine learning” section later in this chapter delves more deeply into precisely how machine learning contributes to AI as a whole. The essence of the matter is that machine learning provides just the learning part of AI, and that part is nowhere near ready to create an AI of the sort you see in films.

remember The main point of confusion between learning and intelligence is that people assume that simply because a machine gets better at its job (learning) it’s also aware (intelligence). Nothing supports this view of machine learning. The same phenomenon occurs when people assume that a computer is purposely causing problems for them. The computer can’t (currently) assign emotions and therefore acts only upon the input provided and the instruction contained within an application to process that input. A true AI will eventually occur when computers can finally emulate the clever combination used by nature:

  • Genetics: Slow learning from one generation to the next
  • Teaching: Fast learning from organized sources
  • Exploration: Spontaneous learning through media and interactions with others

Considering the goals of machine learning

At present, AI is based on machine learning, and machine learning is essentially different from statistics. Yes, machine learning has a statistical basis, but it makes some different assumptions than statistics do because the goals are different. Table 1-1 lists some features to consider when comparing AI and machine learning to statistics.

TABLE 1-1 Comparing Machine Learning to Statistics

Technique

Machine Learning

Statistics

Data handling

Works with big data in the form of networks and graphs; raw data from sensors or the web text is split into training and test data.

Models are used to create predictive power on small samples.

Data input

The data is sampled, randomized, and transformed to maximize accuracy scoring in the prediction of out-of-sample (or completely new) examples.

Parameters interpret real-world phenomena and provide a stress on magnitude.

Result

Probability is taken into account for comparing what could be the best guess or decision.

The output captures the variability and uncertainty of parameters.

Assumptions

The scientist learns from the data.

The scientist assumes a certain output and tries to prove it.

Distribution

The distribution is unknown or ignored before learning from data.

The scientist assumes a well-defined distribution.

Fitting

The scientist creates a best fit, but generalizable, model.

The result is fit to the present data distribution.

Defining machine learning limits based on hardware

Huge data sets require huge amounts of memory. Unfortunately, the requirements don’t end there. When you have huge amounts of data and memory, you must also have processors with multiple cores and high speeds. One of the problems that scientists are striving to solve is how to use existing hardware more efficiently. In some cases, waiting for days to obtain a result to a machine learning problem simply isn’t possible. The scientists who want to know the answer need it quickly, even if the result isn’t quite right. With this in mind, investments in better hardware also require investments in better science. This book considers some of the following issues as part of making your machine learning experience better:

  • Obtaining a useful result: As you work through the book, you discover that you need to obtain a useful result first, before you can refine it. In addition, sometimes tuning an algorithm goes too far, and the result becomes quite fragile (and possibly useless outside a specific data set).
  • Asking the right question: Many people get frustrated in trying to obtain an answer from machine learning because they keep tuning their algorithm without asking a different question. To use hardware efficiently, sometimes you must step back and review the question you’re asking. The question might be wrong, which means that even the best hardware will never find the answer.
  • Relying on intuition too heavily: All machine learning questions begin as a hypothesis. A scientist uses intuition to create a starting point for discovering the answer to a question. Failure is more common than success when working through a machine learning experience. Your intuition adds the art to the machine learning experience, but sometimes intuition is wrong and you have to revisit your assumptions.

technicalstuff When you begin to realize the importance of environment to machine learning, you can also begin to understand the need for the right hardware and in the right balance to obtain a desired result. The current state-of-the-art systems actually rely on graphical processing units (GPUs) to perform machine learning tasks. Relying on GPUs does speed the machine learning process considerably. A full discussion of using GPUs is outside the scope of this book, but you can read more about the topic at https://devblogs.nvidia.com/parallelforall/bidmach-machine-learning-limit-gpus.

Overcoming AI fantasies

As with many other technologies, AI and machine learning both have their fantasy or fad uses. For example, some people are using machine learning to create Picasso-like art from photos. You can read all about it at www.washingtonpost.com/news/innovations/wp/2015/08/31/this-algorithm-can-create-a-new-van-gogh-or-picasso-in-just-an-hour. As the article points out, the computer can copy only an existing style at this stage — not create an entirely new style of its own. The following sections discuss AI and machine learning fantasies of various sorts.

Discovering the fad uses of AI and machine learning

AI is entering an era of innovation that you used to read about only in science fiction. It can be hard to determine whether a particular AI use is real or simply the dream child of a determined scientist. For example, The Six Million Dollar Man (https://en.wikipedia.org/wiki/The_Six_Million_Dollar_Man) is a television series that looked fanciful at one time. When it was introduced, no one actually thought that we’d have real-world bionics at some point. However, Hugh Herr has other ideas — bionic legs really are possible now (www.smithsonianmag.com/innovation/future-robotic-legs-180953040). Of course, they aren’t available for everyone yet; the technology is only now becoming useful. Muddying the waters is another television series, The Six Billion Dollar Man (www.cinemablend.com/new/Mark-Wahlberg-Six-Billion-Dollar-Man-Just-Made-Big-Change-91947.html). The fact is that AI and machine learning will both present opportunities to create some amazing technologies and that we’re already at the stage of creating those technologies, but you still need to take what you hear with a huge grain of salt.

remember To make the future uses of AI and machine learning match the concepts that science fiction has presented over the years, real-world programmers, data scientists, and other stakeholders need to create tools. Most of these tools are still rudimentary. Nothing happens by magic, even though it may look like magic when you don’t know what’s happening behind the scenes. In order for the fad uses for AI and machine learning to become real-world uses, developers, data scientists, and others need to continue building real-world tools that may be hard to imagine at this point.

Considering the true uses of AI and machine learning

You find AI and machine learning used in a great many applications today. The only problem is that the technology works so well that you don’t know that it even exists. In fact, you might be surprised to find that many devices in your home already make use of both technologies. Both technologies definitely appear in your car and most especially in the workplace. In fact, the uses for both AI and machine learning number in the millions — all safely out of sight even when they’re quite dramatic in nature.

Here are just a few of the ways in which you might see AI used:

  • Fraud detection: You get a call from your credit card company asking whether you made a particular purchase. The credit card company isn’t being nosy; it’s simply alerting you to the fact that someone else could be making a purchase using your card. The AI embedded within the credit card company’s code detected an unfamiliar spending pattern and alerted someone to it.
  • Resource scheduling: Many organizations need to schedule the use of resources efficiently. For example, a hospital may have to determine where to put a patient based on the patient’s needs, availability of skilled experts, and the amount of time the doctor expects the patient to be in the hospital.
  • Complex analysis: Humans often need help with complex analysis because there are literally too many factors to consider. For example, the same set of symptoms could indicate more than one problem. A doctor or other expert might need help making a diagnosis in a timely manner to save a patient’s life.
  • Automation: Any form of automation can benefit from the addition of AI to handle unexpected changes or events. A problem with some types of automation today is that an unexpected event, such as an object in the wrong place, can actually cause the automation to stop. Adding AI to the automation can allow the automation to handle unexpected events and continue as if nothing happened.
  • Customer service: The customer service line you call today may not even have a human behind it. The automation is good enough to follow scripts and use various resources to handle the vast majority of your questions. With good voice inflection (provided by AI as well), you may not even be able to tell that you’re talking with a computer.
  • Safety systems: Many of the safety systems found in machines of various sorts today rely on AI to take over the vehicle in a time of crisis. For example, many automatic braking systems rely on AI to stop the car based on all the inputs that a vehicle can provide, such as the direction of a skid.
  • Machine efficiency: AI can help control a machine in such a manner as to obtain maximum efficiency. The AI controls the use of resources so that the system doesn’t overshoot speed or other goals. Every ounce of power is used precisely as needed to provide the desired services.

This list doesn’t even begin to scratch the surface. You can find AI used in many other ways. However, it’s also useful to view uses of machine learning outside the normal realm that many consider the domain of AI. Here are a few uses for machine learning that you might not associate with an AI:

  • Access control: In many cases, access control is a yes or no proposition. An employee smart card grants access to a resource much in the same way that people have used keys for centuries. Some locks do offer the capability to set times and dates that access is allowed, but the coarse-grained control doesn’t really answer every need. By using machine learning, you can determine whether an employee should gain access to a resource based on role and need. For example, an employee can gain access to a training room when the training reflects an employee role.
  • Animal protection: The ocean might seem large enough to allow animals and ships to cohabitate without problem. Unfortunately, many animals get hit by ships each year. A machine learning algorithm could allow ships to avoid animals by learning the sounds and characteristics of both the animal and the ship.
  • Predicting wait times: Most people don’t like waiting when they have no idea of how long the wait will be. Machine learning allows an application to determine waiting times based on staffing levels, staffing load, complexity of the problems the staff is trying to solve, availability of resources, and so on.

Being useful; being mundane

Even though the movies make it sound like AI is going to make a huge splash, and you do sometimes see some incredible uses for AI in real life, the fact of the matter is that most uses for AI are mundane, even boring. For example, a recent article details how Verizon uses AI to analyze security breach data (www.computerworld.com/article/3001832/data-analytics/how-verizon-analyzes-security-breach-data-with-r.html). The act of performing this analysis is dull when compared to other sorts of AI activities, but the benefits are that Verizon saves money performing the analysis, and the results are better as well.

In addition, Python developers have a huge array of libraries available to make machine learning easy. In fact, Kaggle (www.kaggle.com/competitions) provides competitions to allow developers to hone their machine learning skills in creating practical applications. The results of these competitions often appear later as part of products that people actually use. Additionally, the developer community is particularly busy creating new libraries to make complex data science and machine learning applications easier to program (see www.kdnuggets.com/2015/06/top-20-python-machine-learning-open-source-projects.html for the top 20 Python libraries in use today).

Considering the relationship between AI and machine learning

Machine learning is only part of what a system requires to become an AI. The machine learning portion of the picture enables an AI to perform these tasks:

  • Adapt to new circumstances that the original developer didn’t envision.
  • Detect patterns in all sorts of data sources.
  • Create new behaviors based on the recognized patterns.
  • Make decisions based on the success or failure of these behaviors.

The use of algorithms to manipulate data is the centerpiece of machine learning. To prove successful, a machine learning session must use an appropriate algorithm to achieve a desired result. In addition, the data must lend itself to analysis using the desired algorithm, or it requires a careful preparation by scientists.

AI encompasses many other disciplines to simulate the thought process successfully. In addition to machine learning, AI normally includes

  • Natural language processing: The act of allowing language input and putting it into a form that a computer can use
  • Natural language understanding: The act of deciphering the language in order to act upon the meaning it provides
  • Knowledge representation: The ability to store information in a form that makes fast access possible
  • Planning (in the form of goal seeking): The ability to use stored information to draw conclusions in near real time (almost at the moment it happens, but with a slight delay, sometimes so short that a human won’t notice, but the computer can)
  • Robotics: The ability to act upon requests from a user in some physical form

technicalstuff In fact, you might be surprised to find that the number of disciplines required to create an AI is huge. Consequently, this book exposes you to only a portion of what an AI contains. However, even the machine learning portion of the picture can become complex because understanding the world through the data inputs that a computer receives is a complex task. Just think about all the decisions that you constantly make without thinking about them. For example, just the concept of seeing something and knowing whether you can interact successfully with it can become a complex task.

Considering AI and machine learning specifications

As scientists continue to work with a technology and turn hypotheses into theories, the technology becomes related more to engineering (where theories are implemented) than science (where theories are created). As the rules governing a technology become clearer, groups of experts work together to define these rules in written form. The result is specifications (a group of rules that everyone agrees upon).

Eventually, implementations of the specifications become standards that a governing body, such as the IEEE (Institute of Electrical and Electronics Engineers) or a combination of the ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), manages. AI and machine learning have both been around long enough to create specifications, but you currently won’t find any standards for either technology.

The basis for machine learning is math. Algorithms determine how to interpret big data in specific ways. The math basics for machine learning appear in Book 8, Chapter 2. You discover that algorithms process input data in specific ways and create predictable outputs based on the data patterns. What isn’t predictable is the data itself. The reason you need AI and machine learning is to decipher the data in such a manner to be able to see the patterns in it and make sense of them.

You see the specifications detailed in Book 8, Chapter 4 in the form of algorithms used to perform specific tasks. When you get to Book 9, you begin to see the reason that everyone agrees to specific sets of rules governing the use of algorithms to perform tasks. The point is to use an algorithm that will best suit the data you have in hand to achieve the specific goals you’ve created. Professionals implement algorithms using languages that work best for the task. Machine learning relies on Python and R, and to some extent MATLAB, Java, Julia, and C++. (See the discussion at www.quora.com/What-is-the-best-language-to-use-while-learning-machine-learning-for-the-first-time for details.)

Defining the divide between art and engineering

The reason that AI and machine learning are both sciences and not engineering disciplines is that both require some level of art to achieve good results. The artistic element of machine learning takes many forms. For example, you must consider how the data is used. Some data acts as a baseline that trains an algorithm to achieve specific results. The remaining data provides the output used to understand the underlying patterns. No specific rules governing the balancing of data exist; the scientists working with the data must discover whether a specific balance produces optimal output.

remember Cleaning the data also lends a certain amount of artistic quality to the result. The manner in which a scientist prepares the data for use is important. Some tasks, such as removing duplicate records, occur regularly. However, a scientist may also choose to filter the data in some ways or look at only a subset of the data. As a result, the cleaned data set used by one scientist for machine learning tasks may not precisely match the cleaned data set used by another.

You can also tune the algorithms in certain ways or refine how the algorithm works. Again, the idea is to create output that truly exposes the desired patterns so that you can make sense of the data. For example, when viewing a picture, a robot may have to determine which elements of the picture it can interact with and which elements it can’t. The answer to that question is important if the robot must avoid some elements to keep on track or to achieve specific goals.

When working in a machine learning environment, you also have the problem of input data to consider. For example, the microphone found in one smartphone won’t produce precisely the same input data that a microphone in another smartphone will. The characteristics of the microphones differ, yet the result of interpreting the vocal commands provided by the user must remain the same. Likewise, environmental noise changes the input quality of the vocal command, and the smartphone can experience certain forms of electromagnetic interference. Clearly, the variables that a designer faces when creating a machine learning environment are both large and complex.

The art behind the engineering is an essential part of machine learning. The experience that a scientist gains in working through data problems is essential because it provides the means for the scientist to add values that make the algorithm work better. A finely tuned algorithm can make the difference between a robot successfully threading a path through obstacles and hitting every one of them.

Learning in the Age of Big Data

Computers manage data through applications that perform tasks using algorithms of various sorts. A simple definition of an algorithm is a systematic set of operations to perform on a given data set — essentially a procedure. The four basic data operations are create, read, update, and delete (CRUD). This set of operations may not seem complex, but performing these essential tasks is the basis of everything you do with a computer. As the data set becomes larger, the computer can use the algorithms found in an application to perform more work. The use of immense data sets, known as big data, enables a computer to perform work based on pattern recognition in a nondeterministic manner. In short, to create a computer setup that can learn, you need a data set large enough for the algorithms to manage in a manner that allows for pattern recognition, and this pattern recognition needs to use a simple subset to make predictions (statistical analysis) of the data set as a whole.

Big data exists in many places today. Obvious sources are online databases, such as those created by vendors to track consumer purchases. However, you find many non-obvious data sources, too, and often these non-obvious sources provide the greatest resources for doing something interesting. Finding appropriate sources of big data lets you create machine learning scenarios in which a machine can learn in a specified manner and produce a desired result.

Statistics, one of the methods of machine learning that you consider in this book, is a method of describing problems using math. By combining big data with statistics, you can create a machine learning environment in which the machine considers the probability of any given event. However, saying that statistics is the only machine learning method is incorrect. This chapter also introduces you to the other forms of machine learning currently in place.

Algorithms determine how a machine interprets big data. The algorithm used to perform machine learning affects the outcome of the learning process and, therefore, the results you get. This chapter helps you understand the five main techniques for using algorithms in machine learning.

Before an algorithm can do much in the way of machine learning, you must train it. The training process modifies how the algorithm views big data. The final section of this chapter helps you understand that training is actually using a subset of the data as a method for creating the patterns that the algorithm needs to recognize specific cases from the more general cases that you provide as part of the training.

Defining big data

Big data is substantially different from being just a large database. Yes, big data implies lots of data, but it also includes the idea of complexity and depth. A big data source describes something in enough detail that you can begin working with that data to solve problems for which general programming proves inadequate. For example, consider Google’s self-driving cars. The car must consider not only the mechanics of the car’s hardware and position with space but also the effects of human decisions, road conditions, environmental conditions, and other vehicles on the road. The data source contains many variables — all of which affect the vehicle in some way. Traditional programming might be able to crunch all the numbers, but not in real time. You don’t want the car to crash into a wall and have the computer finally decide five minutes later that the car is going to crash into a wall. The processing must prove timely so that the car can avoid the wall.

The acquisition of big data can also prove daunting. The sheer bulk of the data set isn’t the only problem to consider — also essential is to consider how the data set is stored and transferred so that the system can process it. In most cases, developers try to store the data set in memory to allow fast processing. Using a hard drive to store the data would prove too costly, time-wise.

remember When thinking about big data, you also consider anonymity. Big data presents privacy concerns. However, because of the way machine learning works, knowing specifics about individuals isn’t particularly helpful anyway. Machine learning is all about determining patterns — analyzing training data in such a manner that the trained algorithm can perform tasks that the developer didn’t originally program it to do. Personal data has no place in such an environment.

Finally, big data is so large that humans can’t reasonably visualize it without help. Part of what defined big data as big is the fact that a human can learn something from it, but the sheer magnitude of the data set makes recognition of the patterns impossible (or would take a really long time to accomplish). Machine learning helps humans make sense and use of big data.

Considering the sources of big data

Before you can use big data for a machine learning application, you need a source for big data. Of course, the first thing that most developers think about is the huge, corporate-owned database, which could contain interesting information, but it’s just one source. The fact of the matter is that your corporate databases might not even contain particularly useful data for a specific need. The following sections describe locations you can use to obtain additional big data.

Building a new data source

To create viable sources of big data for specific needs, you might find that you actually need to create a new data source. Developers built existing data sources around the needs of the client-server architecture in many cases, and these sources may not work well for machine learning scenarios because they lack the required depth (being optimized to save space on hard drives does have disadvantages). In addition, as you become more adept in using machine learning, you find that you ask questions that standard corporate databases can’t answer. With this in mind, the following sections describe some interesting new sources for big data.

OBTAINING DATA FROM PUBLIC SOURCES

Governments, universities, nonprofit organizations, and other entities often maintain publicly available databases that you can use alone or combined with other databases to create big data for machine learning. For example, you can combine several geographic information systems (GIS) to help create the big data required to make decisions such as where to put new stores or factories. The machine learning algorithm can take all sorts of information into account — everything from the amount of taxes you have to pay to the elevation of the land (which can contribute to making your store easier to see).

The best part about using public data is that it’s usually free, even for commercial use (or you pay a nominal fee for it). In addition, many of the organizations that created them maintain these sources in nearly perfect condition because the organization has a mandate, uses the data to attract income, or uses the data internally. When obtaining public source data, you need to consider a number of issues to ensure that you actually get something useful. Here are some of the criteria you should think about when making a decision:

  • The cost, if any, of using the data source
  • The formatting of the data source
  • Access to the data source (which means having the proper infrastructure in place, such as an Internet connection when using Twitter data)
  • Permission to use the data source (some data sources are copyrighted)
  • Potential issues in cleaning the data to make it useful for machine learning

OBTAINING DATA FROM PRIVATE SOURCES

You can obtain data from private organizations such as Amazon and Google, both of which maintain immense databases that contain all sorts of useful information. In this case, you should expect to pay for access to the data, especially when used in a commercial setting. You may not be allowed to download the data to your personal servers, so that restriction may affect how you use the data in a machine learning environment. For example, some algorithms work slower with data that they must access in small pieces.

The biggest advantage of using data from a private source is that you can expect better consistency. The data is likely cleaner than from a public source. In addition, you usually have access to a larger database with a greater variety of data types. Of course, it all depends on where you get the data.

CREATING NEW DATA FROM EXISTING DATA

Your existing data may not work well for machine learning scenarios, but that doesn’t keep you from creating a new data source using the old data as a starting point. For example, you might find that you have a customer database that contains all the customer orders, but the data isn’t useful for machine learning because it lacks tags required to group the data into specific types. One of the new job types that you can expect to create is people who massage data to make it better suited for machine learning — including the addition of specific information types such as tags.

remember Machine learning will have a significant effect on your business. The article at www.computerworld.com/article/3007053/big-data/how-machine-learning-will-affect-your-business.html describes some of the ways in which you can expect machine learning to change how you do business. One of the points in this article is that machine learning typically works on 80 percent of the data. In 20 percent of the cases, you still need humans to take over the job of deciding just how to react to the data and then act upon it. The point is that machine learning saves money by taking over repetitious tasks that humans don’t really want to do in the first place (making them inefficient). However, machine learning doesn’t get rid of the need for humans completely, and it creates the need for new types of jobs that are a bit more interesting than the ones that machine learning has taken over. Also important to consider is that you need more humans at the outset until the modifications they make train the algorithm to understand what sorts of changes to make to the data.

Using existing data sources

Your organization has data hidden in all sorts of places. The problem is in recognizing the data as data. For example, you may have sensors on an assembly line that track how products move through the assembly process and ensure that the assembly line remains efficient. Those same sensors can potentially feed information into a machine learning scenario because they could provide inputs on how product movement affects customer satisfaction or the price you pay for postage. The idea is to discover how to create mashups that present existing data as a new kind of data that lets you do more to make your organization work well.

remember Big data can come from any source, even your email. A recent article discusses how Google uses your email to create a list of potential responses for new emails. (See the article at www.semrush.com/blog/deep-learning-an-upcoming-gmail-feature-that-will-answer-your-emails-for-you.) Instead of having to respond to every email individually, you can simply select a canned response at the bottom of the page. This sort of automation isn’t possible without the original email data source. Looking for big data in specific locations will blind you to the big data sitting in common places that most people don’t think about as data sources. Tomorrow’s applications will rely on these alternative data sources, but to create these applications, you must begin seeing the data hidden in plain view today.

Some of these applications already exist, and you’re completely unaware of them. The video at www.research.microsoft.com/apps/video/default.aspx?id=256288 makes the presence of these kinds of applications more apparent. By the time you complete the video, you begin to understand that many uses of machine learning are already in place and users already take them for granted (or have no idea that the application is even present).

Locating test data sources

As you progress through Book 8, you discover the need to teach whichever algorithm you’re using (don’t worry about specific algorithms; you see a number of them in Book 9) how to recognize various kinds of data and then to do something interesting with it. This training process ensures that the algorithm reacts correctly to the data it receives after the training is over. Of course, you also need to test the algorithm to determine whether the training is a success. In many cases, Book 8 helps you discover ways to break a data source into training and testing data components in order to achieve the desired result. Then, after training and testing, the algorithm can work with new data in real time to perform the tasks that you verified it can perform.

In some cases, you might not have enough data at the outset for both training (the essential initial test) and testing. When this happens, you might need to create a test setup to generate more data, rely on data generated in real time, or create the test data source artificially. You can also use similar data from existing sources, such as a public or private database. The point is that you need both training and testing data that will produce a known result before you unleash your algorithm into the real world of working with uncertain data.

Specifying the role of statistics in machine learning

Some sites online would have you believe that statistics and machine learning are two completely different technologies. For example, when you read Statistics vs. Machine Learning, fight! (http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/), you get the idea that the two technologies are not only different, but downright hostile toward each other. The fact is that statistics and machine learning have a lot in common and that statistics represents one of the five tribes (schools of thought) that make machine learning feasible. The five tribes are

  • Symbolists: The origin of this tribe is in logic and philosophy. This group relies on inverse deduction to solve problems.
  • Connectionists: The origin of this tribe is in neuroscience. This group relies on backpropagation to solve problems.
  • Evolutionaries: The origin of this tribe is in evolutionary biology. This group relies on genetic programming to solve problems.
  • Bayesians: This origin of this tribe is in statistics. This group relies on probabilistic inference to solve problems.
  • Analogizers: The origin of this tribe is in psychology. This group relies on kernel machines to solve problems.

The ultimate goal of machine learning is to combine the technologies and strategies embraced by the five tribes to create a single algorithm (the master algorithm) that can learn anything. Of course, achieving that goal is a long way off. Even so, scientists such as Pedro Domingos (homes.cs.washington.edu/~pedrod/) are currently working toward that goal.

Book 9 follows the Bayesian tribe strategy, for the most part, in that you solve most problems using some form of statistical analysis. You do see strategies embraced by other tribes described, but the main reason you begin with statistics is that the technology is already well established and understood. In fact, many elements of statistics qualify more as engineering (in which theories are implemented) than science (in which theories are created). The next section of the chapter delves deeper into the five tribes by viewing the kinds of algorithms each tribe uses. Understanding the role of algorithms in machine learning is essential to defining how machine learning works.

Understanding the role of algorithms

Everything in machine learning revolves around algorithms. An algorithm is a procedure or formula used to solve a problem. The problem domain affects the kind of algorithm needed, but the basic premise is always the same — to solve some sort of problem, such as driving a car or playing dominoes. In the first case, the problems are complex and many, but the ultimate problem is one of getting a passenger from one place to another without crashing the car. Likewise, the goal of playing dominoes is to win. The following sections discuss algorithms in more detail.

Defining what algorithms do

An algorithm is a kind of container. It provides a box for storing a method to solve a particular kind of a problem. Algorithms process data through a series of well-defined states. The states need not be deterministic, but the states are defined nonetheless. The goal is to create an output that solves a problem. In some cases, the algorithm receives inputs that help define the output, but the focus is always on the output.

Algorithms must express the transitions between states using a well-defined and formal language that the computer can understand. In processing the data and solving the problem, the algorithm defines, refines, and executes a function. The function is always specific to the kind of problem being addressed by the algorithm.

Considering the five main techniques

As described in the previous section, each of the five tribes has a different technique and strategy for solving problems that result in unique algorithms. Combining these algorithms should lead eventually to the master algorithm that will be able to solve any given problem. The following sections provide an overview of the five main algorithmic techniques.

SYMBOLIC REASONING

The term inverse deduction commonly appears as induction. In symbolic reasoning, deduction expands the realm of human knowledge, while induction raises the level of human knowledge. Induction commonly opens new fields of exploration, while deduction explores those fields. However, the most important consideration is that induction is the science portion of this type of reasoning, while deduction is the engineering. The two strategies work hand in hand to solve problems by first opening a field of potential exploration to solve the problem and then exploring that field to determine whether it does, in fact, solve it.

As an example of this strategy, deduction would say that if a tree is green and that green trees are alive, the tree must be alive. When thinking about induction, you would say that the tree is green and that the tree is also alive; therefore, green trees are alive. Induction provides the answer to what knowledge is missing given a known input and output.

CONNECTIONS MODELLED ON THE BRAIN’S NEURONS

The connectionists are perhaps the most famous of the five tribes. This tribe strives to reproduce the brain’s functions using silicon instead of neurons. Essentially, each of the neurons (created as an algorithm that models the real-world counterpart) solves a small piece of the problem, and the use of many neurons in parallel solves the problem as a whole.

The use of backpropagation, or backward propagation of errors, seeks to determine the conditions under which errors are removed from networks built to resemble the human neurons by changing the weights (how much a particular input figures into the result) and biases (which features are selected) of the network. The goal is to continue changing the weights and biases until such time as the actual output matches the target output. At this point, the artificial neuron fires and passes its solution along to the next neuron in line. The solution created by just one neuron is only part of the whole solution. Each neuron passes information to the next neuron in line until the group of neurons creates a final output.

EVOLUTIONARY ALGORITHMS THAT TEST VARIATION

The evolutionaries rely on the principles of evolution to solve problems. In other words, this strategy is based on the survival of the fittest (removing any solutions that don’t match the desired output). A fitness function determines the viability of each function in solving a problem.

Using a tree structure, the solution method looks for the best solution based on function output. The winner of each level of evolution gets to build the next-level functions. The idea is that the next level will get closer to solving the problem but may not solve it completely, which means that another level is needed. This particular tribe relies heavily on recursion and languages that strongly support recursion to solve problems. An interesting output of this strategy has been algorithms that evolve: One generation of algorithms actually builds the next generation.

BAYESIAN INFERENCE

The Bayesians use various statistical methods to solve problems. Given that statistical methods can create more than one apparently correct solution, the choice of a function becomes one of determining which function has the highest probability of succeeding. For example, when using these techniques, you can accept a set of symptoms as input and decide the probability that a particular disease will result from the symptoms as output. Given that multiple diseases have the same symptoms, the probability is important because a user will see some in which a lower probability output is actually the correct output for a given circumstance.

Ultimately, this tribe supports the idea of never quite trusting any hypothesis (a result that someone has given you) completely without seeing the evidence used to make it (the input the other person used to make the hypothesis). Analyzing the evidence proves or disproves the hypothesis that it supports. Consequently, it isn’t possible to determine which disease someone has until you test all the symptoms.

technicalstuff One of the most recognizable outputs from this tribe is the spam filter used in many popular email applications.

SYSTEMS THAT LEARN BY ANALOGY

The analogyzers use kernel machines to recognize patterns in data. By recognizing the pattern of one set of inputs and comparing it to the pattern of a known output, you can create a problem solution. The goal is to use similarity to determine the best solution to a problem. It’s the kind of reasoning that determines that using a particular solution worked in a given circumstance at some previous time; therefore using that solution for a similar set of circumstances should also work. One of the most recognizable outputs from this tribe is recommender systems. For example, when you get on Amazon and buy a product, the recommender system comes up with other, related products that you might also want to buy.

Defining what training means

Many people are somewhat used to the idea that applications start with a function, accept data as input, and then provide a result. For example, a programmer might create a function called Add() that accepts two values as input, such as 1 and 2. The result of Add() is 3. The output of this process is a value. In the past, writing a program meant understanding the function used to manipulate data to create a given result with certain inputs.

Machine learning turns this process around. In this case, you know that you have inputs, such as 1 and 2. You also know that the desired result is 3. However, you don’t know what function to apply to create the desired result. Training provides a learner algorithm with all sorts of examples of the desired inputs and results expected from those inputs. The learner then uses this input to create a function. In other words, training is the process whereby the learner algorithm maps a flexible function to the data. The output is typically the probability of a certain class or a numeric value.

remember A single learner algorithm can learn many different things, but not every algorithm is suited for certain tasks. Some algorithms are general enough that they can play chess, recognize faces on Facebook, and diagnose cancer in patients. An algorithm reduces the data inputs and the expected results of those inputs to a function in every case, but the function is specific to the kind of task you want the algorithm to perform.

The secret to machine learning is generalization. The goal is to generalize the output function so that it works on data beyond the training set. For example, consider a spam filter. Your dictionary contains 100,000 words (actually a small dictionary). A limited training data set of 4,000 or 5,000 word combinations must create a generalized function that can then find spam in the 2^100,000 combinations that the function will see when working with actual data.

When viewed from this perspective, training might seem impossible and learning even worse. However, to create this generalized function, the learner algorithm relies on just three components:

  • Representation: The learner algorithm creates a model, which is a function that will produce a given result for specific inputs. The representation is a set of models that a learner algorithm can learn. In other words, the learner algorithm must create a model that will produce the desired results from the input data. If the learner algorithm can’t perform this task, it can’t learn from the data, and the data is outside the hypothesis space of the learner algorithm. Part of the representation is to discover which features (data elements within the data source) to use for the learning process.
  • Evaluation: The learner can create more than one model. However, it doesn’t know the difference between good and bad models. An evaluation function determines which of the models works best in creating a desired result from a set of inputs. The evaluation function scores the models because more than one model could provide the required results.
  • Optimization: At some point, the training process produces a set of models that can generally output the right result for a given set of inputs. At this point, the training process searches through these models to determine which one works best. The best model is then output as the result of the training process.

Much of Book 8 and Book 9 focuses on representation. For example, in Book 9, Chapter 2 you discover how to work with the k-Nearest Neighbor (KNN) algorithm. However, the training process is more involved than simply choosing a representation. All three steps come into play when performing the training process. Fortunately, you can start by focusing on representation and allow the various libraries discussed in Book 9 to do the rest of the work for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset