Input selection

One of the key tasks in designing a neural network application is to select appropriate inputs. For the unsupervised case, one wishes to use only relevant variables on which the neural network will find the patterns. For the supervised case, there is a need to map the outputs to the inputs, so one needs to choose only the input variables that somewhat influence the output.

Data correlation

One strategy that helps in selecting good inputs in the supervised case is the correlation between data series. A correlation between data series is a measure of how one data sequence reacts or influences the other. Suppose that we have one dataset containing a number of data series from which we choose one to be an output. Now, we need to select the inputs from the remaining variables.

We then evaluate the influence of one variable at a time on the output in order to decide whether to include it as an input or not. The Pearson coefficient is one of the most used variables:

Data correlation

Where Sx(k)y(k) denotes the covariance between the x and the y variables:

Data correlation

The correlation takes values from -1 to 1, where values close to +1 indicate a positive correlation, values near -1 indicate a negative correlation, and values near 0 indicate no correlation at all.

To exemplify, let's see the following three charts of the two variables X and Y:

Data correlation

In the first chart, to the left, visually, one can see that as one variable decreases, the other increases its value (corr.: -0.8). The middle chart shows the case when the two variables vary in the same direction, therefore a positive correlation (corr.: +0.7). The third chart, to the right, shows a case where there is no correlation between the variables (corr.: -0.1).

There is no threshold rule as to which correlation should be taken into account as a limit; it depends on the application. While absolute correlation values greater than 0.5 may be suitable for one application, in others, values near 0.2 may add a significant contribution.

Dimensionality reduction

Another interesting point is regarding the removal of redundant data. Sometimes, this is desired when there is a lot of available data in both unsupervised and supervised learning. To exemplify, let's see the following chart of two variables:

Dimensionality reduction

It can be seen that both X and Y variables share the same shape, so this can be interpreted as a redundancy, as both variables are carrying almost the same information because of the high positive correlation. Thus, one can consider a technique called Principal Component Analysis (PCA), which is a good approach for dealing with these cases.

The result of PCA will be a new variable summarizing the previous two (or more) variables. Basically, the original data series are subtracted by the mean and then multiplied by the transposed eigenvectors of the covariance matrix:

Dimensionality reduction

Where SXY denotes the covariance between the variables X and Y.

The derived new data will then be as follows:

Dimensionality reduction

Let's see now how a new variable would look like in a chart, compared to the original variables:

Dimensionality reduction

Data filtering

Noisy and bad data are also a source of problems in neural network applications; that's why we need to filter data. One of the common data filtering techniques can be performed by excluding the records that exceed the usual range. For example, temperature values are between -40 and 40, so a value like 50 would be considered an outlier and could be taken out.

The three-sigma rule is a good and effective measure for filtering. It consists of filtering the values that are beyond three times the standard deviation from the mean:

Data filtering
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset