Selecting the appropriate data that fulfils most of the system's dynamics needs to be carefully done. We want the neural network to forecast future weather based on the current and past weather data, but which variables should we choose? Getting an expert opinion on the subject can be really helpful in understanding the relationship between variables.
While defining a problem to use neural networks on, there are one or more predefined target variables: predict the temperature, forecast precipitation. measure insolation, and so on. But, in some cases, one wants to model all the variables and therefore to find causal relationships between them. Causal relationships can be identified by statistical tools, of which Pearson cross-correlation is the most used:
Here, E[X.Y] is the mean of the multiplication of variables X and Y; E[X] and E[Y] are the means of X and Y respectively; sX and sY are the standard deviation of X and Y respectively; and finally Cx,y is the Pearson coefficient of X related to Y, whose values lie between -1 and 1. This coefficient shows how much a variable X is correlated with a variable Y. Values near 0 denote weak or no correlation at all, while values close to -1 or 1 denote negative or positive correlation, respectively. Graphically, it can be seen by a scatter plot, as shown below:
In the chart on the left, the correlation between the precipitation of the last day (indicated as [-1]) and the maximum temperature is -0.202, which is a weak value of negative correlation. In the middle chart, the correlation between insolation and maximum temperature is 0.376, which is a fair correlation, yet not very significant; one can see a slight positive trend. An example of strong positive correlation is shown in the chart on the right, which is between the last day's maximum temperature and the maximum temperature of the day. This correlation is 0.793, and we can see a thinner cloud of dots indicating the trend.
We are going to use correlation to choose the most appropriate inputs for our neural network. First, we need to code a method in the class DataSet, called correlation. Please note that operations such as mean and standard deviation are implemented in our class ArrayOperations
:
public double correlation(int colx,int coly){ double[] arrx = ArrayOperations.getColumn(data,colx); double[] arry = ArrayOperations.getColumn(data,coly); double[] arrxy = ArrayOperations.elementWiseProduct(arrx, arry); double meanxy = ArrayOperations.mean(arrxy); double meanx = ArrayOperations.mean(arrx); double meany = ArrayOperations.mean(arry); double stdx = ArrayOperations.stdev(arrx); double stdy = ArrayOperations.stdev(arry); return (meanxy*meanx*meany)/(stdx*stdy); }
We will not delve too deeply into statistics in this book, so we recommend a number of references if the reader is interested in more details on this topic.