Preface

 

"He who defends everything, defends nothing."

 
 --Frederick the Great

Machine learning is a very broad topic. The following quote sums it up nicely: The first problem facing you is the bewildering variety of learning algorithms available. Which one to use? There are literally thousands available, and hundreds more are published each year. (Domingo, P., 2012.) It would therefore be irresponsible to try and cover everything in the chapters that follow because, to paraphrase Frederick the Great, we would achieve nothing.

With this constraint in mind, I hope to provide a solid foundation of algorithms and business considerations that will allow the reader to walk away and, first of all, take on any machine learning tasks with complete confidence, and secondly, be able to help themselves in figuring out other algorithms and topics. Essentially, if this book significantly helps you to help yourself, then I would consider this a victory. Don't think of this book as a destination but rather, as a path to self-discovery.

The world of R can be as bewildering as the world of machine learning! There is seemingly an endless number of R packages with a plethora of blogs, websites, discussions, and papers of various quality and complexity from the community that supports R. This is a great reservoir of information and probably R's greatest strength, but I've always believed that an entity's greatest strength can also be its greatest weakness. R's vast community of knowledge can quickly overwhelm and/or sidetrack you and your efforts. Show me a problem and give me ten different R programmers and I'll show you ten different ways the code is written to solve the problem. As I've written each chapter, I've endeavored to capture the critical elements that can assist you in using R to understand, prepare, and model the data. I am no R programming expert by any stretch of the imagination, but again, I like to think that I can provide a solid foundation herein.

Another thing that lit a fire under me to write this book was an incident that happened in the hallways of a former employer a couple of years ago. My team had an IT contractor to support the management of our databases. As we were walking and chatting about big data and the like, he mentioned that he had bought a book about machine learning with R and another about machine learning with Python. He stated that he could do all the programming, but all of the statistics made absolutely no sense to him. I have always kept this conversation at the back of my mind throughout the writing process. It has been a very challenging task to balance the technical and theoretical with the practical. One could, and probably someone has, turned the theory of each chapter to its own book. I used a heuristic of sorts to aid me in deciding whether a formula or technical aspect was in the scope, which was would this help me or the readers in the discussions with team members and business leaders? If I felt it might help, I would strive to provide the necessary details.

I also made a conscious effort to keep the datasets used in the practical exercises large enough to be interesting but small enough to allow you to gain insight without becoming overwhelmed. This book is not about big data, but make no mistake about it, the methods and concepts that we will discuss can be scaled to big data.

In short, this book will appeal to a broad group of individuals, from IT experts seeking to understand and interpret machine learning algorithms to statistical gurus desiring to incorporate the power of R into their analysis. However, even those that are well-versed in both IT and statistics—experts if you will—should be able to pick up quite a few tips and tricks to assist them in their efforts.

Machine learning defined

Machine learning is everywhere! It is used in web search, spam filters, recommendation engines, medical diagnostics, ad placement, fraud detection, credit scoring, and I fear in these autonomous cars that I hear so much about. The roads are dangerous enough now; the idea of cars with artificial intelligence, requiring CTRL + ALT + DEL every 100 miles, aimlessly roaming the highways and byways is just too terrifying to contemplate. But, I digress.

It is always important to properly define what one is talking about and machine learning is no different. The website, machinelearningmastery.com, has a full page dedicated to this question, which provides some excellent background material. It also offers a succinct one-liner that is worth adopting as an operational definition: machine learning is the training of a model from data that generalizes a decision against a performance measure.

With this definition in mind, we will require a few things in order to perform machine learning. The first is that we have the data. The second is that a pattern actually exists, which is to say that with known input values from our training data, we can make a prediction or decision based on data that we did not use to train the model. This is the generalization in machine learning. Third, we need some sort of performance measure to see how well we are learning/generalizing, for example, the mean squared error, accuracy, and others. We will look at a number of performance measures throughout the book.

One of the things that I find interesting in the world of machine learning are the changes in the language to describe the data and process. As such, I can't help but include this snippet from the philosopher, George Carlin:

 

"I wasn't notified of this. No one asked me if I agreed with it. It just happened. Toilet paper became bathroom tissue. Sneakers became running shoes. False teeth became dental appliances. Medicine became medication. Information became directory assistance. The dump became the landfill. Car crashes became automobile accidents. Partly cloudy became partly sunny. Motels became motor lodges. House trailers became mobile homes. Used cars became previously owned transportation. Room service became guest-room dining, and constipation became occasional irregularity.

 
 --Philosopher and Comedian, George Carlin

I cut my teeth on datasets that had dependent and independent variables. I would build a model with the goal of trying to find the best fit. Now, I have labeled the instances and input features that require engineering, which will become the feature space that I use to learn a model. When all was said and done, I used to look at my model parameters; now, I look at weights.

The bottom line is that I still use these terms interchangeably and probably always will. Machine learning purists may curse me, but I don't believe I have caused any harm to life or limb.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset