Raw data collected from a data source usually presents different particularities, such as data range, sampling, and category. Some variables result from measurements while others are summarized or even calculated. Preprocessing means to adapt these variable values to a range that neural networks can handle properly.
Regarding weather variables, let's take a look at their range, sampling, and type:
Variable |
Unit |
Range |
Sampling |
Type |
---|---|---|---|---|
Mean temperature |
º C |
10.86 – 29.25 |
Hourly |
Average of hourly measurements |
Precipitation |
mm |
0 – 161.20 |
Daily |
Accumulation of daily rain |
Insolation |
hours |
0 – 10.40 |
Daily |
Count of hours receiving sun radiation |
Mean humidity |
% |
45.00 – 96.00 |
Hourly |
Average of hourly measurements |
Mean wind speed |
km/h |
0.00 – 3.27 |
Hourly |
Average of hourly measurements |
Except for insolation and precipitation, the variables are all measured and share the same sampling, but if we wanted, for example, to use an hourly dataset, we would have to preprocess all the variables to use the same sample rate. Three of the variables are summarized, using daily average values, but if we wanted to we could use hourly data measurements. However, the range would certainly be larger.
Normalization is the process of getting all variables into the same data range, usually with smaller values, between 0
and 1
or -1
and 1
. This helps the neural network to present values within the variable zone in activation functions such as sigmoid or hyperbolic tangent:
Values too high or too low may drive neurons in to producing values too high or too low as well for the activation functions, therefore leading to the derivative for these neurons being too small, near zero. In this book, we implemented two modes of normalization: min-max and z-score.
The min-max normalization should consider a predefined range of the dataset. It is performed right away:
Here, Nmin and Nmax are the normalized minimum and maximum limits respectively, Xmin and Xmax are the variable X's minimum and maximum limits respectively, X is the original value, and Xnorm is the normalized value. If we want the normalization to be between 0 and 1, for example, the equation is simplified to the following:
By applying the normalization, a new normalized dataset is produced and is fed to the neural network. One should also take into account that a neural network fed with normalized values will be trained to produce normalized values on the output, so the inverse (denormalization) process becomes necessary as well:
Or
For the normalization between 0 and 1.
Another mode of normalization is the z-score, which takes into account the mean and standard deviation:
Here, S is a scaling constant, E[X] is the mean of E, and sX is the standard deviation of X. The main difference in this normalization mode is that there will be no limit defined for the range of variables; however, the variables will have values on the same range centered on zero with standard deviation equal to the scaling constant S.
The figure below shows what both normalization modes do with the data:
A class called DataNormalization
is implemented to handle the normalization of data. Since normalization considers the statistical properties of the data, we need to store this statistical information in a DataNormalization
object:
public class DataNormalization { //ENUM normalization types public enum NormalizationTypes { MIN_MAX, ZSCORE } // normalization type public NormalizationTypes TYPE; //statistical properties of the data private double[] minValues; private double[] maxValues; private double[] meanValues; private double[] stdValues; //normalization properties private double scaleNorm=1.0; private double minNorm=-1.0; //… //constructor for min-max norm public DataNormalization(double[][] data,double _minNorm, double _maxNorm){ this.TYPE=NormalizationTypes.MIN_MAX; this.minNorm=_minNorm; this.scaleNorm=_maxNorm-_minNorm; calculateReference(data); } //constructor for z-score norm public DataNormalization(double[][] data,double _zscale){ this.TYPE=NormalizationTypes.ZSCORE; this.scaleNorm=_zscale; calculateReference(data); } //calculation of statistical properties private void calculateReference(double[][] data){ minValues=ArrayOperations.min(data); maxValues=ArrayOperations.max(data); meanValues=ArrayOperations.mean(data); stdValues=ArrayOperations.stdev(data); } //… }
The normalization procedure is performed on a method called normalize
, which has a denormalization counterpart called denormalize:
public double[][] normalize( double[][] data ) { int rows = data.length; int cols = data[0].length; //… double[][] normalizedData = new double[rows][cols]; for(int i=0;i<rows;i++){ for(int j=0;j<cols;j++){ switch (TYPE){ case MIN_MAX: normalizedData[i][j]=(minNorm) + ((data[i][j] - minValues[j]) / ( maxValues[j] - minValues[j] )) * (scaleNorm); break; case ZSCORE: normalizedData[i][j]=scaleNorm * (data[i][j] - meanValues[j]) / stdValues[j]; break; } } } return normalizedData; }
The already implemented NeuralDataSet
, NeuralInputData
, and NeuralOutputData
will now have DataNormalization
objects to handle normalization operations. In the NeuralDataSet
class, we've added objects for input and output data normalization:
public DataNormalization inputNorm; public DataNormalization outputNorm; //zscore normalization public void setNormalization(double _scaleNorm){ inputNorm = new DataNormalization(_scaleNorm); inputData.setNormalization(inputNorm); outputNorm = new DataNormalization(_scaleNorm); outputData.setNormalization(outputNorm); } //min-max normalization public void setNormalization(double _minNorm,double _maxNorm){ inputNorm = new DataNormalization(_minNorm,_maxNorm); inputData.setNormalization(inputNorm); outputNorm = new DataNormalization(_minNorm,_maxNorm); outputData.setNormalization(outputNorm); }
NeuralInputData
and NeuralOutputData
will now have normdata
properties to store the normalized data. The methods to retrieve data from these classes will have a Boolean parameter, isNorm
, to indicate whether the value to be retrieved should be normalized or not.
Considering that NeuralInputData
will provide the neural network with input data, this class will only perform normalization before feeding data into the neural network. The method setNormalization
is implemented in this class to that end:
public ArrayList<ArrayList<Double>> normdata; public DataNormalization norm; public void setNormalization(DataNormalization dn){ //getting the original data into java matrix double[][] origData = ArrayOperations.arrayListToDoubleMatrix(data); //perform normalization double[][] normData = dn.normalize(origData); normdata=new ArrayList<>(); //store the normalized values into ArrayList normdata for(int i=0;i<normData.length;i++){ normdata.add(new ArrayList<Double>()); for(int j=0;j<normData[0].length;j++){ normdata.get(i).add(normData[i][j]); } } }
In NeuralOutputData
, there are two datasets, one for the target and one for the neural network output. The target dataset is normalized to provide the training algorithm with normalized values. However, the neural output dataset is the output of the neural network, that is, it will be normalized first. We need to perform denormalization after setting the neural network output dataset:
public ArrayList<ArrayList<Double>> normTargetData; public ArrayList<ArrayList<Double>> normNeuralData; public void setNeuralData(double[][] _data,boolean isNorm){ if(isNorm){ //if is normalized this.normNeuralData=new ArrayList<>(); for(int i=0;i<numberOfRecords;i++){ this.normNeuralData.add(new ArrayList<Double>()); //… save in the normNeuralData for(int j=0;j<numberOfOutputs;j++){ this.normNeuralData.get(i).add(_data[i][j]); } } double[][] deNorm = norm.denormalize(_data); for(int i=0;i<numberOfRecords;i++) for(int j=0;j<numberOfOutputs;j++) //then in neuralData this.neuralData.get(i).set(j,deNorm[i][j]); } else setNeuralData(_data); }
Finally, the LearningAlgorithm
class needs to include the normalization property:
protected boolean normalization=false;
Now during the training, on every call to the NeuralDataSet
methods that retrieve or write data, the normalization property should be passed in the parameter isNorm
, as in the method forward of the class Backpropagation:
@Override public void forward(){ for(int i=0;i<trainingDataSet.numberOfRecords;i++){ neuralNet.setInputs(trainingDataSet. getInputRecord(i,normalization)); neuralNet.calc(); trainingDataSet.setNeuralOutput(i, neuralNet.getOutputs(), normalization); //… } }
In Java, we are going to use the package edu.packt.neuralnet.chart
to plot some charts and visualize data. We're also downloading historical meteorology data from INMET, the Brazilian Institute of Meteorology. We've downloaded data from several cities, so we could have a variety of climates included in our weather forecasting case.
In this example, we wanted to collect a variety of data from different places, to attest to the capacity of the neural network to forecast it. Since we downloaded it from the INMET website, which covers only Brazilian territory, only Brazilian cities are covered. However, it is a very vast territory with a great variety of climates. Below is a list of places we collected data from:
# |
City Name |
Latitude |
Longitude |
Altitude |
Climate Type |
---|---|---|---|---|---|
1 |
Cruzeiro do Sul |
7º37'S |
72º40'W |
170 m |
Tropical Rainforest |
2 |
Picos |
7º04'S |
41º28'W |
208 m |
Semi-arid |
3 |
Campos do Jordão |
22º45'S |
45º36'W |
1642 m |
Subtropical Highland |
4 |
Porto Alegre |
30º01'S |
51º13'W |
48 m |
Subtropical Humid |
The location of these four cities is indicated on the map below:
The weather data collected is from January 2010 until November 2016 and is saved in the data folder with the name corresponding to the city.
The data collected from the INMET website includes these variables:
For each city, we are going to build a neural network to forecast the weather based on the past. But first, we need to point out two important facts:
To overcome the first issue, we may derive a new column from the date to indicate the solar noon angle, which is the angle at which the solar rays reach the surface at the city at the highest point in the sky (noon). The greater this angle, the more intense and warm the solar radiation is; on the other hand, when this angle is small, the surface will receive a small fraction of the solar radiation:
The solar noon angle is calculated by the following formula and Java implementation in the class WeatherExample
, which will be used in this chapter:
public double calcSolarNoonAngle(double date,double latitude){ return 90-Math.abs(-23.44*Math.cos((2*Math.PI/365.25)*(date+8.5))-latitude); } public void addSolarNoonAngle(TimeSeries ts,double latitude){// to add column double[] sna = new double[ts.numberOfRecords]; for(int i=0;i<ts.numberOfRecords;i++) sna[i]=calcSolarNoonAngle( ts.data.get(i).get(ts.getIndexColumn()), latitude); ts.addColumn(sna, "NoonAngle"); }
In the class WeatherExample
, let's place a method called makeDelays
, which will later be called from the main method. The delays will be made on a given TimeSeries
and up to a given number for all columns of the time series except that of the index column:
public void makeDelays(TimeSeries ts,int maxdelays){ for(int i=0;i<ts.numberOfColumns;i++) if(i!=ts.getIndexColumn()) for(int j=1;j<=maxdelays;j++) ts.shift(i, -j); }
In the WeatherExample
class, we are going to add four TimeSeries
properties and four NeuralNet
properties for each case:
public class WeatherExample { TimeSeries cruzeirodosul; TimeSeries picos; TimeSeries camposdojordao; TimeSeries portoalegre; NeuralNet nncruzeirosul; NeuralNet nnpicos; NeuralNet nncamposjordao; NeuralNet nnportoalegre; //… }
In the main
method, we load data to each of them and delay the columns up to three days before:
public static void main(String[] args) { WeatherExample we = new WeatherExample(); //load weather data we.cruzeirodosul = new TimeSeries(LoadCsv.getDataSet("data", "cruzeirodosul2010daily.txt", true, ";")); we.cruzeirodosul.setIndexColumn(0); we.makeDelays(we.cruzeirodosul, 3); we.picos = new TimeSeries(LoadCsv.getDataSet("data", "picos2010daily.txt", true, ";")); we.picos.setIndexColumn(0); we.makeDelays(we.picos, 3); we.camposdojordao = new TimeSeries(LoadCsv.getDataSet("data", "camposdojordao2010daily.txt", true, ";")); we.camposdojordao.setIndexColumn(0); we.makeDelays(we.camposdojordao, 3); we.portoalegre = new TimeSeries(LoadCsv.getDataSet("data", "portoalegre2010daily.txt", true, ";")); we.portoalegre.setIndexColumn(0); we.makeDelays(we.portoalegre, 3); //…
After loading, we need to remove the NaNs, so we call the method dropNaN
from each time series object:
//… we.cruzeirodosul.dropNaN(); we.camposdojordao.dropNaN(); we.picos.dropNaN(); we.portoalegre.dropNaN(); //…
To save time and effort for future executions, let's save these datasets:
we.cruzeirodosul.save("data","cruzeirodosul2010daily_delays_clean.txt",";"); //… we.portoalegre.save("data","portoalegre2010daily_delays_clean.txt",";");
Now, for all-time series, each column has three delays, and we want the neural network to forecast the maximum and minimum temperature of the next day. We can forecast the future by taking into account only the present and the past, so for inputs we must rely on the delayed data (from -1 to -3 days before), and for outputs we may consider the current temperature values. Each column in the time series dataset is indicated by an index, where zero is the index of the date. Since some of the datasets had missing data on certain columns, the index of a column may vary. However, the index for output variables is the same through all datasets (indexes 2 and 3).
We are interested in finding patterns between the delayed data and the current maximum and minimum temperature. So we perform a cross-correlation analysis combining all output and potential input variables, and select the variables that present at least a minimum absolute correlation as a threshold. So we write a method correlationAnalysis
taking the minimum absolute correlation as the argument. To save space, we have trimmed the code here:
public void correlationAnalysis(double minAbsCorr){ //indexes of output variables (max. and min. temperature) int[][] outputs = { {2,3}, //cruzeiro do sul {2,3}, //picos {2,3}, //campos do jordao {2,3}}; //porto alegre int[][] potentialInputs = { //indexes of input variables (delayed) {10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,38,39,40}, //cruzeiro do sul //… and all others }; ArrayList<ArrayList<ArrayList<Double>>> chosenInputs = new ArrayList<>(); TimeSeries[] tscollect = {this.cruzeirodosul,this.picos,this.camposdojordao,this.portoalegre}; double[][][] correlation = new double[4][][]; for(int i=0;i<4;i++){ chosenInputs.add(new ArrayList<ArrayList<Double>>()); correlation[i]=new double[outputs[i].length][potentialInputs[i].length]; for(int j=0;j<outputs[i].length;j++){ chosenInputs.get(i).add(new ArrayList<Double>()); for(int k=0;k<potentialInputs[i].length;k++){ correlation[i][j][k]=tscollect[i].correlation(outputs[i][j], potentialInputs[i][k]); //if the absolute correlation is above the threshold if(Math.abs(correlation[i][j][k])>minAbsCorr){ //it is added to the chosen inputs chosenInputs.get(i).get(j).add(correlation[i][j][k]); //and we see the plot tscollect[i].getScatterChart("Correlation "+String.valueOf(correlation[i][j][k]), outputs[i][j], potentialInputs[i][k], Color.BLACK).setVisible(true); } } } } }
By running this analysis, we receive the following result for Cruzeiro do Sul (the bold columns are chosen as neural network inputs):
Correlation Analysis for data from Cruzeiro do Sul | |
---|---|
Correlations with the output Variable: MaxTemp NoonAngle:0.0312808 Precipitation__1:-0.115547 Precipitation__2:-0.038969 Precipitation__3:-0.062173 MaxTemp__1:0.497057 MaxTemp__2:0.252831 MaxTemp__3:0.159098 MinTemp__1:-0.033339 MinTemp__2:-0.123063 MinTemp__3:-0.125282 Insolation__1:0.395741 Insolation__2:0.197949 Insolation__3:0.134345 Evaporation__1:0.21548 Evaporation__2:0.161384 Evaporation__3:0.199385 AvgTemp__1:0.432280 AvgTemp__2:0.152103 AvgTemp__3:0.060368 AvgHumidity__1:-0.415812 AvgHumidity__2:-0.265189 AvgHumidity__3:-0.214624 WindSpeed__1:-0.166418 WindSpeed__2:-0.056825 WindSpeed__3:-0.001660 NoonAngle__1:0.0284473 NoonAngle__2:0.0256710 NoonAngle__3:0.0227864 |
Correlations with the output Variable: MinTemp NoonAngle:0.346545 Precipitation__1:0.012696 Precipitation__2:0.063303 Precipitation__3:0.112842 MaxTemp__1:0.311005 MaxTemp__2:0.244364 MaxTemp__3:0.123838 MinTemp__1:0.757647 MinTemp__2:0.567563 MinTemp__3:0.429669 Insolation__1:-0.10192 Insolation__2:-0.101146 Insolation__3:-0.151896 Evaporation__1:-0.115236 Evaporation__2:-0.160718 Evaporation__3:-0.160536 AvgTemp__1:0.633741 AvgTemp__2:0.487609 AvgTemp__3:0.312645 AvgHumidity__1:0.151009 AvgHumidity__2:0.155019 AvgHumidity__3:0.177833 WindSpeed__1:-0.198555 WindSpeed__2:-0.227227 WindSpeed__3:-0.185377 NoonAngle__1:0.353834 NoonAngle__2:0.360943 NoonAngle__3:0.367953 |
The scatter plots show how this data is related:
On the left, there is a fair correlation between the last day's maximum temperature and the current; in the center, a strong correlation between the last day's minimum temperature and the current; and on the right, a weak correlation between NoonAngle
of 3 days before and the current minimum temperature. By running this analysis for all other cities, we determine the inputs for the other neural networks:
Cruzeiro do Sul |
Picos |
Campos do Jordão |
Porto Alegre |
---|---|---|---|
NoonAngle MaxTemp__1 MinTemp__1 MinTemp__2 MinTemp__3 Insolation__1 AvgTemp__1 AvgTemp__2 AvgHumidity__1 NoonAngle__1 NoonAngle__2 NoonAngle__3 |
MaxTemp MaxTemp__1 MaxTemp__2 MaxTemp__3 MinTemp__1 MinTemp__2 MinTemp__3 Insolation__1 Insolation__2 Evaporation__1 Evaporation__2 Evaporation__3 AvgTemp__1 AvgTemp__2 AvgTemp__3 AvgHumidity__1 AvgHumidity__2 AvgHumidity__3 |
NoonAngle MaxTemp__1 MaxTemp__2 MaxTemp__3 MinTemp__1 MinTemp__2 MinTemp__3 Evaporation__1 AvgTemp__1 AvgTemp__2 AvgTemp__3 AvgHumidity__1 NoonAngle__1 NoonAngle__2 NoonAngle__3 |
MaxTemp NoonAngle MaxTemp__1 MaxTemp__2 MaxTemp__3 MinTemp__1 MinTemp__2 MinTemp__3 Insolation__1 Insolation__2 Insolation__3 Evaporation__1 Evaporation__2 Evaporation__3 AvgTemp__1 AvgTemp__2 AvgTemp__3 AvgHumidity__1 AvgHumidity__2 NoonAngle__1 NoonAngle__2 NoonAngle__3 |
We are using four neural networks to forecast the minimum and maximum temperature. Initially, they will have two hidden layers with 20 and 10 neurons each and hypertan and sigmoid activation functions. We will apply min-max normalization. The following method in the class WeatherExample
creates the neural networks with this configuration:
public void createNNs(){ //fill a vector with the indexes of input and output columns int[] inputColumnsCS = {10,14,17,18,19,20,26,27,29,38,39,40}; int[] outputColumnsCS = {2,3}; //this static method hashes the dataset NeuralDataSet[] nnttCS = NeuralDataSet.randomSeparateTrainTest(this.cruzeirodosul, inputColumnsCS, outputColumnsCS, 0.7); //setting normalization DataNormalization.setNormalization(nnttCS, -1.0, 1.0); this.trainDataCS = nnttCS[0]; // 70% for training this.testDataCS = nnttCS[1]; // rest for test //setup neural net parameters: this.nncruzeirosul = new NeuralNet( inputColumnsCS.length, outputColumnsCS.length, new int[]{20,10} , new IActivationFunction[] {new HyperTan(1.0),new Sigmoid(1.0)} , new Linear() , new UniformInitialization(-1.0, 1.0) ); //… }
In Chapter 2, Getting Neural Networks to Learn we have seen that a neural network should be tested to verify its learning, so we divide the dataset into training and testing subsets. Usually about 50-80% of the original filtered dataset is used for training and the remaining fraction is for testing.
A static method randomSeparateTrainTest
in the class NeuralDataSet
separates the dataset into these two subsets. In order to ensure maximum generalization, the records of this dataset are hashed, as shown in the following figure:
The records may be originally sequential, as in weather time series; if we hash them in random positions, the training and testing sets will contain records from all periods.
The neural network will be trained using the basic backpropagation algorithm. The following is a code sample for the dataset Cruzeiro do Sul
:
Backpropagation bpCS = new Backpropagation(we.nncruzeirosul ,we.trainDataCS ,LearningAlgorithm.LearningMode.BATCH); bpCS.setTestingDataSet(we.testDataCS); bpCS.setLearningRate(0.3); bpCS.setMaxEpochs(1000); bpCS.setMinOverallError(0.01); //normalized error bpCS.printTraining = true; bpCS.setMomentumRate( 0.3 ); try{ bpCS.forward(); bpCS.train(); System.out.println("Overall Error:" + String.valueOf(bpCS.getOverallGeneralError())); System.out.println("Testing Error:" + String.valueOf(bpCS.getTestingOverallGeneralError())); System.out.println("Min Overall Error:" + String.valueOf(bpCS.getMinOverallError())); System.out.println("Epochs of training:" + String.valueOf(bpCS.getEpoch())); } catch(NeuralException ne){ }
Using the JFreeCharts
framework, we can plot error evolution for the training and testing datasets. There is a new method in the class LearningAlogrithm
called showErrorEvolution
, which is inherited and overridden by BackPropagation
. To see the chart, just call as in the example:
//plot list of errors by epoch bpCS.showErrorEvolution();
This will show a plot like the one shown in the following figure:
Using this same facility, it is easy to see and compare the neural network output. First, let's transform the neural network output into vector form and add to our dataset using the method addColumn
. Let's name it NeuralMinTemp
and NeuralMaxTemp
:
String[] neuralOutputs = { "NeuralMaxTemp", "NeuralMinTemp"}; we.cruzeirodosul.addColumn(we.fullDataCS.getIthNeuralOutput(0), neuralOutputs[0]); we.cruzeirodosul.addColumn(we.fullDataCS.getIthNeuralOutput(1), neuralOutputs[1]); String[] comparison = {"MaxTemp","NeuralMaxTemp"}; Paint[] comp_color = {Color.BLUE, Color.RED}; final double minDate = 41200.0; final double maxDate = 41300.0;
The class TimeSeries
has a method called getTimePlot
, which is used to plot variables over a specified range:
ChartFrame viewChart = we.cruzeirodosul.getTimePlot("Comparison", comparison, comp_color, minDate, maxDate);