Loading/selecting data

First we need to load raw data into our Java environment. Data can be stored in a variety of data sources, from text files to structured database systems. One basic and simple type used is CSV (Comma-Separated Values), which is simple and in general use. In addition, we'll need to transform this data and perform selection before presenting it to the neural network.

Building auxiliary classes

To deal with these tasks, we need some auxiliary classes in the package edu.packt.neuralnet.data. The first will be LoadCsv to read CSV files:

public class LoadCsv {
   //Path and file name separated for compatibility
  private String PATH; 
  private String FILE_NAME;
  private double[][] dataMatrix; 
  private boolean columnsInFirstRow=false;
  private String separator = ",";
  private String fullFilePath;
  private String[] columnNames;
  final double missingValue=Double.NaN;

  //Constructors
  public LoadCsv(String path,String fileName)
  //…
  public LoadCsv(String fileName,boolean _columnsInFirstRow,String _separator)
  //…

  //Method to load data from file returning a matrix
  public double[][] getDataMatrix(String fullPath,boolean _columnsInFirstRow,String _separator)
  //…
  
  //Static method for calls without instantiating LoadCsv object
  public static double[][] getData(String fullPath,boolean _columnsInFirstRow,String _separator)
  //…

  Method for saving data into csv file
  public void save()
  //…

  //…
}

Tip

To save space here, we are not showing the full code. For more details and the full list of methods, please refer to the code and documentation in Appendix C.

We are also creating a class to store the raw data loaded from CSV into a structure containing not only the data but the information on this data, such as column names. This class will be called DataSet, inside the same package:

public class DataSet {
  //column names list
  public ArrayList<String> columns;
  //data matrix 
  public ArrayList<ArrayList<Double>> data;
  
  public int numberOfColumns;
  public int numberOfRecords;

  //creating from Java matrix
  public DataSet(double[][] _data,String[] _columns){
    numberOfRecords=_data.length;
    numberOfColumns=_data[0].length;
    columns = new ArrayList<>();
    for(int i=0;i<numberOfColumns;i++){
      //…
      columns.add(_columns[i]);
      //…
    }
    data = new ArrayList<>();
    for(int i=0;i<numberOfRecords;i++){
      data.add(new ArrayList<Double>());
      for(int j=0;j<numberOfColumns;j++){
        data.get(i).add(_data[i][j]);
      }
    }        
  }
  
  //creating from csv file  
  public DataSet(String filename,boolean columnsInFirstRow,String separator){
    LoadCsv lcsv = new LoadCsv(filename,columnsInFirstRow,separator);
    double[][] _data= lcsv.getDataMatrix(filename, columnsInFirstRow, separator);
    numberOfRecords=_data.length;
    numberOfColumns=_data[0].length;
    columns = new ArrayList<>();
    if(columnsInFirstRow){
       String[] columnNames = lcsv.getColumnNames();
       for(int i=0;i<numberOfColumns;i++){
         columns.add(columnNames[i]);
       }
    }
    else{ //default column names: Column0, Column1, etc.
      for(int i=0;i<numberOfColumns;i++){ 
        columns.add("Column"+String.valueOf(i));
      }
    }
    data = new ArrayList<>();
    for(int i=0;i<numberOfRecords;i++){
      data.add(new ArrayList<Double>());
      for(int j=0;j<numberOfColumns;j++){
        data.get(i).add(_data[i][j]);
      }
    } 
  }
  //…
  //method for adding new column
  public void addColumn(double[] _data,String name)
  //…
  //method for appending new data, number of columns must correspond 
  public void appendData(double[][] _data)
  //…
  //getting all data
  public double[][] getData(){
    return ArrayOperations.arrayListToDoubleMatrix(data);
  }
  //getting data from specific columns 
  public double[][] getData(int[] columns){
    return ArrayOperations.getMultipleColumns(getData(), columns);
  }
  //getting data from one column
  public double[] getData(int col){
    return ArrayOperations.getColumn(getData(), col);
  }
  //method for saving the data in a csv file
  public void save(String filename,String separator)
  //…
}

In Chapter 4, Self-Organizing Maps, we've created a class ArrayOperations in the package edu.packt.neuralnet.math to handle operations involving arrays of data. This class has a large number of static methods and it would be impractical to depict all of them here; however, information can be found in Appendix C.

Getting a dataset from a CSV file

To make the task easier, we've implemented a static method in the class LoadCsv to load a CSV file and automatically convert it into the structure of a DataSet object:

public static DataSet getDataSet(String fullPath,boolean _columnsInFirstRow, String _separator){
  LoadCsv lcsv = new LoadCsv(fullPath,_columnsInFirstRow,_separator);
  lcsv.columnsInFirstRow=_columnsInFirstRow;
  lcsv.separator=_separator;
  try{
    lcsv.dataMatrix=lcsv.csvData2Matrix(fullPath);
    System.out.println("File "+fullPath+" loaded!");
  }
  catch(IOException ioe){
    System.err.println("Error while loading CSV file. Details: " + ioe.getMessage());
  }
  return new DataSet(lcsv.dataMatrix, lcsv.columnNames);
}

Building time series

A time series structure is essential for all problems involving time dimensions or domains, such as forecasting and prediction. A class called TimeSeries implements some time-related attributes such as time column and delay. Let's take a look at the structure of this class:

public class TimeSeries extends DataSet {
  //index of the column containing time information
  private int indexTimeColumn;
   
  public TimeSeries(double[][] _data,String[] _columns){
    super(_data,_columns); //just a call to superclass constructor
  }
  public TimeSeries(String path, String filename){
    super(path,filename);
  }
  public TimeSeries(DataSet ds){
    super(ds.getData(),ds.getColumns());
  }
  public void setIndexColumn(int col){
    this.indexTimeColumn=col;
    this.sortBy(indexTimeColumn);
  }
//…
 }

In time series, one frequent operation is the delay of shift of values. For example, we want to include in the processing not the current but the two past values for the daily temperature. Considering temperature as a time series with a time column (date), we must shift the values in the number of cycles desired (one and two, in this example):

Building time series

Tip

We have used Microsoft Excel® to convert the datetime values into real values. Working with numerical values is always preferred to working with structures such as dates or categories. So in this chapter, we are using numerical values to represent date.

While working with time series, one should pay attention to two points:

  • There may be missing values or no measurements on a specific point of time; this can generate NaNs in the Java matrix.
  • Shifting a column over one time period, for example, is not the same as getting the value of the previous row. That's why it is important to choose one column to be the time reference.

In the ArrayOperations class, we implemented a method shiftColumn to shift the column of a matrix considering a time column for reference. This method is called from another method of the same name in the TimeSeries class, and then used in the shift method:

public double[] shiftColumn(int col,int shift){
  double[][] _data = ArrayOperations.arrayListToDoubleMatrix(data);
  return ArrayOperations.shiftColumn(_data, indexTimeColumn, shift, col);
}
public void shift(int col,int shift){
  String colName = columns.get(col);
  if(shift>0) colName=colName+"_"+String.valueOf(shift);
  else colName=colName+"__"+String.valueOf(-shift);
  addColumn(shiftColumn(col,shift),colName);
}

Dropping NaNs

NaNs are undesired values often present after loading or transforming data. They are undesired because we cannot operate over them. If we feed a NaN value into a neural network, the output will definitely be NaN, just consuming more computational power. That's why it is better to drop them out. In the class DataSet, we've implemented two methods to drop NaNs: one just substitutes a value for them, and the other drops the entire row if it has at least one missing value, as shown in the following figure:

Dropping NaNs
// dropping with a substituting value
public void dropNaN(double substvalue)
//…
// dropping the entire row
public void dropNaN()
//…

Getting weather data

Now that we have the tools to get the data, let's find some datasets on the Internet. In this chapter, we are going to use the data from the Brazilian Institute of Meteorology (INMET: http://www.inmet.gov.br/ in Portuguese), which is freely available on the Internet; we have the rights to use it in this book. However, the reader may use any free weather database on the Internet while developing applications.

Some examples from English language sources are listed below:

Weather variables

Any weather database will have almost the same variables:

  • Temperature (ºC)
  • Humidity (%)
  • Pressure (mbar)
  • Wind speed (m/s)
  • Wind direction (º )
  • Precipitation (mm)
  • Sunny hours (h)
  • Sun energy (W/m2)

This data is usually collected from meteorological stations, satellites, or radar, on an hourly or daily basis.

Tip

Depending on the collection frequency, some variables may be summarized with average, minimum, or maximum values.

Data units may also vary from source to source; that's why units should always be observed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset