Preprocessing

Raw data collected from a data source usually presents different particularities, such as data range, sampling, and category. Some variables result from measurements while others are summarized or even calculated. Preprocessing means to adapt these variable values to a range that neural networks can handle properly.

Regarding weather variables, let's take a look at their range, sampling, and type:

Variable

Unit

Range

Sampling

Type

Mean temperature

º C

10.86 – 29.25

Hourly

Average of hourly measurements

Precipitation

mm

0 – 161.20

Daily

Accumulation of daily rain

Insolation

hours

0 – 10.40

Daily

Count of hours receiving sun radiation

Mean humidity

%

45.00 – 96.00

Hourly

Average of hourly measurements

Mean wind speed

km/h

0.00 – 3.27

Hourly

Average of hourly measurements

Except for insolation and precipitation, the variables are all measured and share the same sampling, but if we wanted, for example, to use an hourly dataset, we would have to preprocess all the variables to use the same sample rate. Three of the variables are summarized, using daily average values, but if we wanted to we could use hourly data measurements. However, the range would certainly be larger.

Normalization

Normalization is the process of getting all variables into the same data range, usually with smaller values, between 0 and 1 or -1 and 1. This helps the neural network to present values within the variable zone in activation functions such as sigmoid or hyperbolic tangent:

Normalization

Values too high or too low may drive neurons in to producing values too high or too low as well for the activation functions, therefore leading to the derivative for these neurons being too small, near zero. In this book, we implemented two modes of normalization: min-max and z-score.

The min-max normalization should consider a predefined range of the dataset. It is performed right away:

Normalization

Here, Nmin and Nmax are the normalized minimum and maximum limits respectively, Xmin and Xmax are the variable X's minimum and maximum limits respectively, X is the original value, and Xnorm is the normalized value. If we want the normalization to be between 0 and 1, for example, the equation is simplified to the following:

Normalization

By applying the normalization, a new normalized dataset is produced and is fed to the neural network. One should also take into account that a neural network fed with normalized values will be trained to produce normalized values on the output, so the inverse (denormalization) process becomes necessary as well:

Normalization

Or

Normalization

For the normalization between 0 and 1.

Another mode of normalization is the z-score, which takes into account the mean and standard deviation:

Normalization

Here, S is a scaling constant, E[X] is the mean of E, and sX is the standard deviation of X. The main difference in this normalization mode is that there will be no limit defined for the range of variables; however, the variables will have values on the same range centered on zero with standard deviation equal to the scaling constant S.

The figure below shows what both normalization modes do with the data:

Normalization

A class called DataNormalization is implemented to handle the normalization of data. Since normalization considers the statistical properties of the data, we need to store this statistical information in a DataNormalization object:

public class DataNormalization {
  //ENUM normalization types
  public enum NormalizationTypes { MIN_MAX, ZSCORE }
  // normalization type
  public NormalizationTypes TYPE;
  //statistical properties of the data
  private double[] minValues;
  private double[] maxValues;
  private double[] meanValues;
  private double[] stdValues;
  //normalization properties
  private double scaleNorm=1.0;        
  private double minNorm=-1.0;
//…
  //constructor for min-max norm
  public DataNormalization(double[][] data,double _minNorm, double _maxNorm){
    this.TYPE=NormalizationTypes.MIN_MAX;
    this.minNorm=_minNorm;
    this.scaleNorm=_maxNorm-_minNorm;
    calculateReference(data);
  }
  //constructor for z-score norm        
  public DataNormalization(double[][] data,double _zscale){
    this.TYPE=NormalizationTypes.ZSCORE;
    this.scaleNorm=_zscale;
    calculateReference(data);
  }
  //calculation of statistical properties
  private void calculateReference(double[][] data){
    minValues=ArrayOperations.min(data);
    maxValues=ArrayOperations.max(data);
    meanValues=ArrayOperations.mean(data);
    stdValues=ArrayOperations.stdev(data);
  }
//…
}

The normalization procedure is performed on a method called normalize, which has a denormalization counterpart called denormalize:

public double[][] normalize( double[][] data ) {
  int rows = data.length;
  int cols = data[0].length;
  //…
  double[][] normalizedData = new double[rows][cols];
  for(int i=0;i<rows;i++){
    for(int j=0;j<cols;j++){
      switch (TYPE){
        case MIN_MAX:
          normalizedData[i][j]=(minNorm) + ((data[i][j] - minValues[j]) / ( maxValues[j] - minValues[j] )) * (scaleNorm);
          break;
        case ZSCORE:
          normalizedData[i][j]=scaleNorm * (data[i][j] - meanValues[j]) / stdValues[j];
          break;
      }
    }
  }
  return normalizedData;
}

Adapting NeuralDataSet to handle normalization

The already implemented NeuralDataSet, NeuralInputData, and NeuralOutputData will now have DataNormalization objects to handle normalization operations. In the NeuralDataSet class, we've added objects for input and output data normalization:

 public DataNormalization inputNorm;
 public DataNormalization outputNorm;
 //zscore normalization
 public void setNormalization(double _scaleNorm){
   inputNorm = new DataNormalization(_scaleNorm);
   inputData.setNormalization(inputNorm);
   outputNorm = new DataNormalization(_scaleNorm);
   outputData.setNormalization(outputNorm);
 }
 //min-max normalization
 public void setNormalization(double _minNorm,double _maxNorm){
   inputNorm = new DataNormalization(_minNorm,_maxNorm);
   inputData.setNormalization(inputNorm);
   outputNorm = new DataNormalization(_minNorm,_maxNorm);
   outputData.setNormalization(outputNorm);
 }

NeuralInputData and NeuralOutputData will now have normdata properties to store the normalized data. The methods to retrieve data from these classes will have a Boolean parameter, isNorm, to indicate whether the value to be retrieved should be normalized or not.

Considering that NeuralInputData will provide the neural network with input data, this class will only perform normalization before feeding data into the neural network. The method setNormalization is implemented in this class to that end:

 public ArrayList<ArrayList<Double>> normdata;
 public DataNormalization norm; 
 public void setNormalization(DataNormalization dn){
    //getting the original data into java matrix
   double[][] origData = ArrayOperations.arrayListToDoubleMatrix(data);
   //perform normalization
   double[][] normData = dn.normalize(origData);
   normdata=new ArrayList<>();
   //store the normalized values into ArrayList normdata
   for(int i=0;i<normData.length;i++){
     normdata.add(new ArrayList<Double>());
     for(int j=0;j<normData[0].length;j++){
       normdata.get(i).add(normData[i][j]);
     }
  }
}

In NeuralOutputData, there are two datasets, one for the target and one for the neural network output. The target dataset is normalized to provide the training algorithm with normalized values. However, the neural output dataset is the output of the neural network, that is, it will be normalized first. We need to perform denormalization after setting the neural network output dataset:

 public ArrayList<ArrayList<Double>> normTargetData;
 public ArrayList<ArrayList<Double>> normNeuralData;
 public void setNeuralData(double[][] _data,boolean isNorm){
   if(isNorm){ //if is normalized
     this.normNeuralData=new ArrayList<>();
     for(int i=0;i<numberOfRecords;i++){
       this.normNeuralData.add(new ArrayList<Double>());
       //… save in the normNeuralData
       for(int j=0;j<numberOfOutputs;j++){
         this.normNeuralData.get(i).add(_data[i][j]);
       }
     }
     double[][] deNorm = norm.denormalize(_data);
     for(int i=0;i<numberOfRecords;i++)
       for(int j=0;j<numberOfOutputs;j++) //then in neuralData
          this.neuralData.get(i).set(j,deNorm[i][j]);
   }
   else setNeuralData(_data);
 }

Adapting the learning algorithm to normalization

Finally, the LearningAlgorithm class needs to include the normalization property:

protected boolean normalization=false;

Now during the training, on every call to the NeuralDataSet methods that retrieve or write data, the normalization property should be passed in the parameter isNorm, as in the method forward of the class Backpropagation:

@Override 
public void forward(){
  for(int i=0;i<trainingDataSet.numberOfRecords;i++){
    neuralNet.setInputs(trainingDataSet.
getInputRecord(i,normalization));
    neuralNet.calc();
    trainingDataSet.setNeuralOutput(i, neuralNet.getOutputs(), normalization);
//…
  }
}

Java implementation of weather forecasting

In Java, we are going to use the package edu.packt.neuralnet.chart to plot some charts and visualize data. We're also downloading historical meteorology data from INMET, the Brazilian Institute of Meteorology. We've downloaded data from several cities, so we could have a variety of climates included in our weather forecasting case.

Tip

In order to run the training expeditiously, we have selected a small period (5 years), which has more than 2,000 samples.

Collecting weather data

In this example, we wanted to collect a variety of data from different places, to attest to the capacity of the neural network to forecast it. Since we downloaded it from the INMET website, which covers only Brazilian territory, only Brazilian cities are covered. However, it is a very vast territory with a great variety of climates. Below is a list of places we collected data from:

#

City Name

Latitude

Longitude

Altitude

Climate Type

1

Cruzeiro do Sul

7º37'S

72º40'W

170 m

Tropical Rainforest

2

Picos

7º04'S

41º28'W

208 m

Semi-arid

3

Campos do Jordão

22º45'S

45º36'W

1642 m

Subtropical Highland

4

Porto Alegre

30º01'S

51º13'W

48 m

Subtropical Humid

The location of these four cities is indicated on the map below:

Collecting weather data

Source: Wikipedia, user NordNordWest using United States National Imagery and Mapping Agency data, World Data Base II data

The weather data collected is from January 2010 until November 2016 and is saved in the data folder with the name corresponding to the city.

The data collected from the INMET website includes these variables:

  • Precipitation (mm)
  • Max. temperature (ºC)
  • Min. temperature (ºC)
  • Insolation (sunny hours)
  • Evaporation (mm)
  • Avg. temperature (ºC)
  • Avg. humidity (%)
  • Avg. wind speed (mph)
  • Date (converted into Excel number format)
  • Position of the station (latitude, longitude, and altitude)

For each city, we are going to build a neural network to forecast the weather based on the past. But first, we need to point out two important facts:

  • Cities located in high latitudes experience high weather variations due to the seasons; that is, the weather will be dependent on the date
  • The weather is a very dynamic system whose variables are influenced by past values

To overcome the first issue, we may derive a new column from the date to indicate the solar noon angle, which is the angle at which the solar rays reach the surface at the city at the highest point in the sky (noon). The greater this angle, the more intense and warm the solar radiation is; on the other hand, when this angle is small, the surface will receive a small fraction of the solar radiation:

Collecting weather data

The solar noon angle is calculated by the following formula and Java implementation in the class WeatherExample, which will be used in this chapter:

Collecting weather data
public double calcSolarNoonAngle(double date,double latitude){
  return 90-Math.abs(-23.44*Math.cos((2*Math.PI/365.25)*(date+8.5))-latitude);
}
public void addSolarNoonAngle(TimeSeries ts,double latitude){// to add column
  double[] sna = new double[ts.numberOfRecords];
  for(int i=0;i<ts.numberOfRecords;i++)
    sna[i]=calcSolarNoonAngle(
               ts.data.get(i).get(ts.getIndexColumn()), latitude);
  ts.addColumn(sna, "NoonAngle");
}

Delaying variables

In the class WeatherExample, let's place a method called makeDelays, which will later be called from the main method. The delays will be made on a given TimeSeries and up to a given number for all columns of the time series except that of the index column:

public void makeDelays(TimeSeries ts,int maxdelays){
  for(int i=0;i<ts.numberOfColumns;i++)
    if(i!=ts.getIndexColumn())
      for(int j=1;j<=maxdelays;j++)
        ts.shift(i, -j);
  }

Tip

Be careful not to call this method multiple times; it may delay the same column over and over again.

Loading the data and beginning to play!

In the WeatherExample class, we are going to add four TimeSeries properties and four NeuralNet properties for each case:

public class WeatherExample {

    TimeSeries cruzeirodosul;
    TimeSeries picos;
    TimeSeries camposdojordao;
    TimeSeries portoalegre;
    
    NeuralNet nncruzeirosul;
    NeuralNet nnpicos;
    NeuralNet nncamposjordao;
    NeuralNet nnportoalegre;
//…
}

In the main method, we load data to each of them and delay the columns up to three days before:

public static void main(String[] args) {
  WeatherExample we = new WeatherExample();
  //load weather data
  we.cruzeirodosul = new TimeSeries(LoadCsv.getDataSet("data", "cruzeirodosul2010daily.txt", true, ";"));
  we.cruzeirodosul.setIndexColumn(0);
  we.makeDelays(we.cruzeirodosul, 3);
        
  we.picos = new TimeSeries(LoadCsv.getDataSet("data", "picos2010daily.txt", true, ";"));
  we.picos.setIndexColumn(0);
  we.makeDelays(we.picos, 3);
       
  we.camposdojordao = new TimeSeries(LoadCsv.getDataSet("data", "camposdojordao2010daily.txt", true, ";"));
  we.camposdojordao.setIndexColumn(0);
  we.makeDelays(we.camposdojordao, 3);
        
  we.portoalegre = new TimeSeries(LoadCsv.getDataSet("data", "portoalegre2010daily.txt", true, ";"));
  we.portoalegre.setIndexColumn(0);
  we.makeDelays(we.portoalegre, 3);
//…

Tip

This piece of code can take a couple of minutes to execute, given that each file may have more than 2,000 rows.

After loading, we need to remove the NaNs, so we call the method dropNaN from each time series object:

  //…
  we.cruzeirodosul.dropNaN();
  we.camposdojordao.dropNaN();
  we.picos.dropNaN();
  we.portoalegre.dropNaN();
  //…

To save time and effort for future executions, let's save these datasets:

we.cruzeirodosul.save("data","cruzeirodosul2010daily_delays_clean.txt",";");
//…
we.portoalegre.save("data","portoalegre2010daily_delays_clean.txt",";");

Now, for all-time series, each column has three delays, and we want the neural network to forecast the maximum and minimum temperature of the next day. We can forecast the future by taking into account only the present and the past, so for inputs we must rely on the delayed data (from -1 to -3 days before), and for outputs we may consider the current temperature values. Each column in the time series dataset is indicated by an index, where zero is the index of the date. Since some of the datasets had missing data on certain columns, the index of a column may vary. However, the index for output variables is the same through all datasets (indexes 2 and 3).

Let's perform a correlation analysis

We are interested in finding patterns between the delayed data and the current maximum and minimum temperature. So we perform a cross-correlation analysis combining all output and potential input variables, and select the variables that present at least a minimum absolute correlation as a threshold. So we write a method correlationAnalysis taking the minimum absolute correlation as the argument. To save space, we have trimmed the code here:

public void correlationAnalysis(double minAbsCorr){
  //indexes of output variables (max. and min. temperature) 
  int[][] outputs = { 
            {2,3}, //cruzeiro do sul
            {2,3}, //picos
            {2,3}, //campos do jordao
            {2,3}}; //porto alegre
  int[][] potentialInputs = { //indexes of input variables (delayed)
            {10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,38,39,40}, //cruzeiro do sul
            //… and all others
        };
  ArrayList<ArrayList<ArrayList<Double>>> chosenInputs = new ArrayList<>();
  TimeSeries[] tscollect = {this.cruzeirodosul,this.picos,this.camposdojordao,this.portoalegre};
  double[][][] correlation = new double[4][][];
  for(int i=0;i<4;i++){
    chosenInputs.add(new ArrayList<ArrayList<Double>>());
    correlation[i]=new double[outputs[i].length][potentialInputs[i].length];
    for(int j=0;j<outputs[i].length;j++){
      chosenInputs.get(i).add(new ArrayList<Double>());
      for(int k=0;k<potentialInputs[i].length;k++){
        correlation[i][j][k]=tscollect[i].correlation(outputs[i][j], potentialInputs[i][k]);
        //if the absolute correlation is above the threshold
        if(Math.abs(correlation[i][j][k])>minAbsCorr){
          //it is added to the chosen inputs
          chosenInputs.get(i).get(j).add(correlation[i][j][k]);
          //and we see the plot
          tscollect[i].getScatterChart("Correlation "+String.valueOf(correlation[i][j][k]), outputs[i][j], potentialInputs[i][k], Color.BLACK).setVisible(true);
        }
      }
    }
  }
}

By running this analysis, we receive the following result for Cruzeiro do Sul (the bold columns are chosen as neural network inputs):

Correlation Analysis for data from Cruzeiro do Sul

 

Correlations with the output Variable: MaxTemp

NoonAngle:0.0312808

Precipitation__1:-0.115547

Precipitation__2:-0.038969

Precipitation__3:-0.062173

MaxTemp__1:0.497057

MaxTemp__2:0.252831

MaxTemp__3:0.159098

MinTemp__1:-0.033339

MinTemp__2:-0.123063

MinTemp__3:-0.125282

Insolation__1:0.395741

Insolation__2:0.197949

Insolation__3:0.134345

Evaporation__1:0.21548

Evaporation__2:0.161384

Evaporation__3:0.199385

AvgTemp__1:0.432280

AvgTemp__2:0.152103

AvgTemp__3:0.060368

AvgHumidity__1:-0.415812

AvgHumidity__2:-0.265189

AvgHumidity__3:-0.214624

WindSpeed__1:-0.166418

WindSpeed__2:-0.056825

WindSpeed__3:-0.001660

NoonAngle__1:0.0284473

NoonAngle__2:0.0256710

NoonAngle__3:0.0227864

Correlations with the output Variable: MinTemp

NoonAngle:0.346545

Precipitation__1:0.012696

Precipitation__2:0.063303

Precipitation__3:0.112842

MaxTemp__1:0.311005

MaxTemp__2:0.244364

MaxTemp__3:0.123838

MinTemp__1:0.757647

MinTemp__2:0.567563

MinTemp__3:0.429669

Insolation__1:-0.10192

Insolation__2:-0.101146

Insolation__3:-0.151896

Evaporation__1:-0.115236

Evaporation__2:-0.160718

Evaporation__3:-0.160536

AvgTemp__1:0.633741

AvgTemp__2:0.487609

AvgTemp__3:0.312645

AvgHumidity__1:0.151009

AvgHumidity__2:0.155019

AvgHumidity__3:0.177833

WindSpeed__1:-0.198555

WindSpeed__2:-0.227227

WindSpeed__3:-0.185377

NoonAngle__1:0.353834

NoonAngle__2:0.360943

NoonAngle__3:0.367953

The scatter plots show how this data is related:

Let's perform a correlation analysis

On the left, there is a fair correlation between the last day's maximum temperature and the current; in the center, a strong correlation between the last day's minimum temperature and the current; and on the right, a weak correlation between NoonAngle of 3 days before and the current minimum temperature. By running this analysis for all other cities, we determine the inputs for the other neural networks:

Cruzeiro do Sul

Picos

Campos do Jordão

Porto Alegre

NoonAngle

MaxTemp__1

MinTemp__1

MinTemp__2

MinTemp__3

Insolation__1

AvgTemp__1

AvgTemp__2

AvgHumidity__1

NoonAngle__1

NoonAngle__2

NoonAngle__3

MaxTemp

MaxTemp__1

MaxTemp__2

MaxTemp__3

MinTemp__1

MinTemp__2

MinTemp__3

Insolation__1

Insolation__2

Evaporation__1

Evaporation__2

Evaporation__3

AvgTemp__1

AvgTemp__2

AvgTemp__3

AvgHumidity__1

AvgHumidity__2

AvgHumidity__3

NoonAngle

MaxTemp__1

MaxTemp__2

MaxTemp__3

MinTemp__1

MinTemp__2

MinTemp__3

Evaporation__1

AvgTemp__1

AvgTemp__2

AvgTemp__3

AvgHumidity__1

NoonAngle__1

NoonAngle__2

NoonAngle__3

MaxTemp

NoonAngle

MaxTemp__1

MaxTemp__2

MaxTemp__3

MinTemp__1

MinTemp__2

MinTemp__3

Insolation__1

Insolation__2

Insolation__3

Evaporation__1

Evaporation__2

Evaporation__3

AvgTemp__1

AvgTemp__2

AvgTemp__3

AvgHumidity__1

AvgHumidity__2

NoonAngle__1

NoonAngle__2

NoonAngle__3

Creating neural networks

We are using four neural networks to forecast the minimum and maximum temperature. Initially, they will have two hidden layers with 20 and 10 neurons each and hypertan and sigmoid activation functions. We will apply min-max normalization. The following method in the class WeatherExample creates the neural networks with this configuration:

public void createNNs(){
 //fill a vector with the indexes of input and output columns
 int[] inputColumnsCS = {10,14,17,18,19,20,26,27,29,38,39,40};
 int[] outputColumnsCS = {2,3};
 //this static method hashes the dataset
 NeuralDataSet[] nnttCS = NeuralDataSet.randomSeparateTrainTest(this.cruzeirodosul, inputColumnsCS, outputColumnsCS, 0.7);
 //setting normalization
 DataNormalization.setNormalization(nnttCS, -1.0, 1.0);

 this.trainDataCS = nnttCS[0]; // 70% for training 
 this.testDataCS = nnttCS[1]; // rest for test
        
 //setup neural net parameters:
 this.nncruzeirosul = new NeuralNet( inputColumnsCS.length, outputColumnsCS.length, new int[]{20,10} 
    , new IActivationFunction[] {new HyperTan(1.0),new Sigmoid(1.0)}
    , new Linear()
    , new UniformInitialization(-1.0, 1.0) );
//…
}

Training and test

In Chapter 2, Getting Neural Networks to Learn we have seen that a neural network should be tested to verify its learning, so we divide the dataset into training and testing subsets. Usually about 50-80% of the original filtered dataset is used for training and the remaining fraction is for testing.

A static method randomSeparateTrainTest in the class NeuralDataSet separates the dataset into these two subsets. In order to ensure maximum generalization, the records of this dataset are hashed, as shown in the following figure:

Training and test

The records may be originally sequential, as in weather time series; if we hash them in random positions, the training and testing sets will contain records from all periods.

Training the neural network

The neural network will be trained using the basic backpropagation algorithm. The following is a code sample for the dataset Cruzeiro do Sul:

 Backpropagation bpCS = new Backpropagation(we.nncruzeirosul
                ,we.trainDataCS
                ,LearningAlgorithm.LearningMode.BATCH);
 bpCS.setTestingDataSet(we.testDataCS);
 bpCS.setLearningRate(0.3);
 bpCS.setMaxEpochs(1000);
 bpCS.setMinOverallError(0.01); //normalized error
 bpCS.printTraining = true;
 bpCS.setMomentumRate( 0.3 );
        
 try{
   bpCS.forward();
   bpCS.train();

   System.out.println("Overall Error:"      + String.valueOf(bpCS.getOverallGeneralError()));
   System.out.println("Testing Error:"      + String.valueOf(bpCS.getTestingOverallGeneralError()));
   System.out.println("Min Overall Error:"  + String.valueOf(bpCS.getMinOverallError()));
   System.out.println("Epochs of training:" + String.valueOf(bpCS.getEpoch()));
 }
 catch(NeuralException ne){ }

Plotting the error

Using the JFreeCharts framework, we can plot error evolution for the training and testing datasets. There is a new method in the class LearningAlogrithm called showErrorEvolution, which is inherited and overridden by BackPropagation. To see the chart, just call as in the example:

//plot list of errors by epoch 
bpCS.showErrorEvolution();

This will show a plot like the one shown in the following figure:

Plotting the error

Viewing the neural network output

Using this same facility, it is easy to see and compare the neural network output. First, let's transform the neural network output into vector form and add to our dataset using the method addColumn. Let's name it NeuralMinTemp and NeuralMaxTemp:

 String[] neuralOutputs = { "NeuralMaxTemp", "NeuralMinTemp"};
 we.cruzeirodosul.addColumn(we.fullDataCS.getIthNeuralOutput(0), neuralOutputs[0]);
 we.cruzeirodosul.addColumn(we.fullDataCS.getIthNeuralOutput(1), neuralOutputs[1]);
 String[] comparison = {"MaxTemp","NeuralMaxTemp"};
 Paint[] comp_color = {Color.BLUE, Color.RED};
        
 final double minDate = 41200.0;
 final double maxDate = 41300.0;

The class TimeSeries has a method called getTimePlot, which is used to plot variables over a specified range:

ChartFrame viewChart = we.cruzeirodosul.getTimePlot("Comparison", comparison, comp_color, minDate, maxDate);
Viewing the neural network output
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset