Chapter 5. Regression and Classification

In the previous chapter, we got familiar with supervised and unsupervised learning. Another standard taxonomy of the machine learning methods is based on the label is from continuous or discrete space. Even if the discrete labels are ordered, there is a significant difference, particularly how the goodness of fit metrics is evaluated.

In this chapter, we will cover the following topics:

  • Learning about the origin of the word regression
  • Learning metrics for evaluating the goodness of fit in continuous and discrete space
  • Discussing how to write simple code in Scala for linear and logistic regression
  • Learning about advanced concepts such as regularization, multiclass predictions, and heteroscedasticity
  • Discussing an example of MLlib application for regression tree analysis
  • Learning about the different ways of evaluating classification models

What regression stands for?

While the word classification is intuitively clear, the word regression does not seem to imply a predictor of a continuous label. According to the Webster dictionary, regression is:

"a return to a former or less developed state."

It does also mention a special definition for statistics as a measure of the relation between the mean value of one variable (for example, output) and corresponding values of other variables (for example, time and cost), which is actually correct these days. However, historically, the regression coefficient was meant to signify the hereditability of certain characteristics, such as weight and size, from one generation to another, with the hint of planned gene selection, including humans (http://www.amstat.org/publications/jse/v9n3/stanton.html). More specifically, in 1875, Galton, a cousin of Charles Darwin and an accomplished 19th-century scientist in his own right, which was also widely criticized for the promotion of eugenics, had distributed packets of sweet pea seeds to seven friends. Each friend received seeds of uniform weight, but with substantial variation across the seven packets. Galton's friends were supposed to harvest the next generation seeds and ship them back to him. Galton then proceeded to analyze the statistical properties of the seeds within each group, and one of the analysis was to plot the regression line, which always appeared to have the slope less than one—the specific number cited was 0.33 (Galton, F. (1894), Natural Inheritance (5th ed.), New York: Macmillan and Company), as opposed to either 0, in the case of no correlation and no inheritance; or 1, in the case the total replication of the parent's characteristics in the descendants. We will discuss why the coefficient of the regression line should always be less than 1 in the presence of noise in the data, even if the correlation is perfect. However, beyond the discussion and details, the origin of the term regression is partly due to planned breeding of plants and humans. Of course, Galton did not have access to PCA, Scala, or any other computing machinery at the time, which might shed more light on the differences between correlation and the slope of the regression line.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset