Choosing input and output variables

Selecting the appropriate data that fulfils most of the system's dynamics needs to be carefully done. We want the neural network to forecast future weather based on the current and past weather data, but which variables should we choose? Getting an expert opinion on the subject can be really helpful in understanding the relationship between variables.

Tip

Regarding time series variables, one can derive new variables by applying historical data. That means, given a certain date, one may consider this date's values and the data collected (and/or summarized) from past dates, therefore extending the number of variables.

While defining a problem to use neural networks on, there are one or more predefined target variables: predict the temperature, forecast precipitation. measure insolation, and so on. But, in some cases, one wants to model all the variables and therefore to find causal relationships between them. Causal relationships can be identified by statistical tools, of which Pearson cross-correlation is the most used:

Choosing input and output variables

Here, E[X.Y] is the mean of the multiplication of variables X and Y; E[X] and E[Y] are the means of X and Y respectively; sX and sY are the standard deviation of X and Y respectively; and finally Cx,y is the Pearson coefficient of X related to Y, whose values lie between -1 and 1. This coefficient shows how much a variable X is correlated with a variable Y. Values near 0 denote weak or no correlation at all, while values close to -1 or 1 denote negative or positive correlation, respectively. Graphically, it can be seen by a scatter plot, as shown below:

Choosing input and output variables

In the chart on the left, the correlation between the precipitation of the last day (indicated as [-1]) and the maximum temperature is -0.202, which is a weak value of negative correlation. In the middle chart, the correlation between insolation and maximum temperature is 0.376, which is a fair correlation, yet not very significant; one can see a slight positive trend. An example of strong positive correlation is shown in the chart on the right, which is between the last day's maximum temperature and the maximum temperature of the day. This correlation is 0.793, and we can see a thinner cloud of dots indicating the trend.

We are going to use correlation to choose the most appropriate inputs for our neural network. First, we need to code a method in the class DataSet, called correlation. Please note that operations such as mean and standard deviation are implemented in our class ArrayOperations:

public double correlation(int colx,int coly){
  double[] arrx = ArrayOperations.getColumn(data,colx);
  double[] arry = ArrayOperations.getColumn(data,coly);
  double[] arrxy = ArrayOperations.elementWiseProduct(arrx, arry);
  double meanxy = ArrayOperations.mean(arrxy);
  double meanx = ArrayOperations.mean(arrx);
  double meany = ArrayOperations.mean(arry);
  double stdx = ArrayOperations.stdev(arrx);
  double stdy = ArrayOperations.stdev(arry);
  return (meanxy*meanx*meany)/(stdx*stdy);
}

We will not delve too deeply into statistics in this book, so we recommend a number of references if the reader is interested in more details on this topic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset