Solving the business question

What are we trying to do with regression? If you are trying to solve a business question that helps predict probabilities or scoring, then regression is a great place to start. Business problems that require scoring are also known as regression problems. In this example, we have scored the likelihood of the individual earning above or below fifty thousand dollars per annum.

The main objective is to create a model that we can use on other data, too. The output is a set of results, but it is also an equation that describes the relationship between a number of predictor variables and the response variable.

What do the terms mean?

For example, you could try to estimate the probability that a given person earns above or below fifty thousand dollars:

  • Error: The difference between predicted value and true value
  • Residuals: The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression--y - ŷ

For most regressions, ideally, we want the residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we're missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.

Think of it like a dartboard. A good model is going to hit the bullseye some of the time (but not everytime). When it doesn't hit the bullseye, it's missing in all of the other buckets evenly (not just missing in the 16 bin) and it also misses closer to the bullseye as opposed to on the outer edges of the dartboard.

Coefficient of determination/R-squared–how well the model fits the data:

  • The proportion of the variation explained by the model
  • 1 is a perfect fit

The term error here represents the difference between the predicted value and the true value. The absolute value or the square of this difference are usually computed to capture the total magnitude of error across all instances, as the difference between the predicted and true values could be negative in some cases. The error metrics measure the predictive performance of a regression model in terms of the mean deviation of its predictions from the true values. Lower error values mean the model is more accurate in making predictions. An overall error metric of 0 means that the model fits the data perfectly.

We can then pass in a single feature vector to our trained model and it will return an expected label – you can view this part of the slide in two ways: the first is that it represents a single feature vector – for example, sepal width/length and petal width/length and our output will be the name of an iris plant or we can consider this as the leftover part of our data, usually 20% of it, which is then used to determine how effective our model is by guessing the labels that we already know from our trained model, which will allow us to find out whether we have a model that is good or bad.

The coefficient of determination, which is also known as R-squared, is also a standard way of measuring how well the model fits the data. It can be interpreted as the proportion of variation explained by the model. A higher proportion is better in this case, where 1 indicates a perfect fit.

Another measure that's useful for these continuous models is Root Mean Square Deviation, or Root Mean Square Error – in this case, we take the square root of the MSE – this will give us a perfect match to the scale of the Y-axis, so it will measure the average error rate in a scale that is a perfect measure of our prediction assessment.

Understanding the performance of the result

The p-value is an indicator that determines the result. It tests the theory that there was no difference in the results. In other words, it tests the null hypothesis that the coefficient is equal to zero, which means that, effectively, there is no difference between the items that you are testing.

A low p-value is usually denoted as < 0.05, or five percent. The p-value indicates that you can reject the null hypothesis. In other words, this means that the predictor is having an effect on the item that you are predicting, which is also known as the response variable.

A predictor that has a low p-value is likely to be a meaningful addition to your model. Changes in the predictor's value are related to changes in the item that you are predicting. If there was no relationship at all, then you would have a larger p-value. The large p-value would be said to be insignificant. This means that the predictor doesn't have a significant effect on the item that you are predicting.

Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.

With the first useful measure when we have a continuous model, for example, trying to predict a runners average pace in a race, which is a continuous not discrete, we will end up with a predicted pace or finish time that we can check against when the runner finishes the race.

In this case, we can sum up this and all of the other races that the runner has run and calculate the difference between all of the times we predicted and the actual times that the runner achieved. If we square this difference we get the Mean Square Error, which is a measure of how good our continuous model is–a zero MSE represents a perfect model where every prediction we made about the runner matches exactly what the runner achieved.

A great measure for the accuracy of our model is an extension of something that we looked at in the previous module when we considered correlation – we ended up with a correlation coefficient called R that gave us a measure between -1 and 1. R2 is generally used to show whether our continuous model is a good fit – it should yield a measure of 0 to 11 being a perfect fit.

The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value is to 1. The areas of the blue squares represent the squared residuals with respect to the continuous model. The areas of the red squares represent the squared residuals with respect to the average value (mean).

If we predict the iris classes we should be able to see that we got some of them right.

You can see here that out of 30 different types of iris data point measures we predicted 19 of them accurately – we need to consider these 19, but also the 11 we got wrong and how and why we got them wrong to understand the good parts and the bad parts of our model – the diagonal line in the center shows that we got 19 right.

This is somewhat harder to picture because this is a many class problem, but as long as each class boils down to state we can look at whether the model is viable for making predictions from the adult dataset.

Next steps

It is good governance to carry out continual evaluation of the data model. Ongoing testing and experimentation are essential for good business decisions, which are based on using machine learning. It may seem as if the analytical process is never finished. Why is this the case?

Andrew Grove wrote a book called Only the Paranoid Survive, which documented how Intel survived many changes and upsets in the computing industry throughout its history. Grove suggested that businesses are affected by six forces, both internal and external:

  • Existing competition
  • Complementary businesses
  • Customers
  • Potential customers
  • Possibility of alternative ways of achieving the same end
  • Suppliers

Grove proposed that if these forces stayed equivalent, that the company will steer a steady course. It's important to note that these forces are highly visible in terms of the data that the company receives. This data could come from websites, customer contacts, external competitive information, stock market APIs, and so on. Data changes over time and the internal and external forces can express themselves through these changes. This is why it's important to keep evaluating our models.

Within the CRISP-DM framework, evaluation is a key part of the process. It assesses the efficiency and validation of the model in preparation for deployment. It also lays out the steps required, and the instructions for carrying out those steps. It also includes a monitoring and maintenance plan, which summarizes the strategy for an ongoing review of the model's performance. It should detect any decline in model performance over time. Note that this is a cycle, and it is not a finished process. Once the models are put into production, they need to be continually checked against the data, and the original business question that they were supposed to answer. The model could be running in Azure ML, but its actual output and results may not be performing well against what it's actually intended to do.

With all machine learning, it's important to prove the model's worth over a series of results. It's important to look at the larger pattern of results, rather than simply any given specific result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset