First we need to load raw data into our Java environment. Data can be stored in a variety of data sources, from text files to structured database systems. One basic and simple type used is CSV (Comma-Separated Values), which is simple and in general use. In addition, we'll need to transform this data and perform selection before presenting it to the neural network.
To deal with these tasks, we need some auxiliary classes in the package edu.packt.neuralnet.data
. The first will be LoadCsv
to read CSV files:
public class LoadCsv { //Path and file name separated for compatibility private String PATH; private String FILE_NAME; private double[][] dataMatrix; private boolean columnsInFirstRow=false; private String separator = ","; private String fullFilePath; private String[] columnNames; final double missingValue=Double.NaN; //Constructors public LoadCsv(String path,String fileName) //… public LoadCsv(String fileName,boolean _columnsInFirstRow,String _separator) //… //Method to load data from file returning a matrix public double[][] getDataMatrix(String fullPath,boolean _columnsInFirstRow,String _separator) //… //Static method for calls without instantiating LoadCsv object public static double[][] getData(String fullPath,boolean _columnsInFirstRow,String _separator) //… Method for saving data into csv file public void save() //… //… }
We are also creating a class to store the raw data loaded from CSV into a structure containing not only the data but the information on this data, such as column names. This class will be called DataSet, inside the same package:
public class DataSet { //column names list public ArrayList<String> columns; //data matrix public ArrayList<ArrayList<Double>> data; public int numberOfColumns; public int numberOfRecords; //creating from Java matrix public DataSet(double[][] _data,String[] _columns){ numberOfRecords=_data.length; numberOfColumns=_data[0].length; columns = new ArrayList<>(); for(int i=0;i<numberOfColumns;i++){ //… columns.add(_columns[i]); //… } data = new ArrayList<>(); for(int i=0;i<numberOfRecords;i++){ data.add(new ArrayList<Double>()); for(int j=0;j<numberOfColumns;j++){ data.get(i).add(_data[i][j]); } } } //creating from csv file public DataSet(String filename,boolean columnsInFirstRow,String separator){ LoadCsv lcsv = new LoadCsv(filename,columnsInFirstRow,separator); double[][] _data= lcsv.getDataMatrix(filename, columnsInFirstRow, separator); numberOfRecords=_data.length; numberOfColumns=_data[0].length; columns = new ArrayList<>(); if(columnsInFirstRow){ String[] columnNames = lcsv.getColumnNames(); for(int i=0;i<numberOfColumns;i++){ columns.add(columnNames[i]); } } else{ //default column names: Column0, Column1, etc. for(int i=0;i<numberOfColumns;i++){ columns.add("Column"+String.valueOf(i)); } } data = new ArrayList<>(); for(int i=0;i<numberOfRecords;i++){ data.add(new ArrayList<Double>()); for(int j=0;j<numberOfColumns;j++){ data.get(i).add(_data[i][j]); } } } //… //method for adding new column public void addColumn(double[] _data,String name) //… //method for appending new data, number of columns must correspond public void appendData(double[][] _data) //… //getting all data public double[][] getData(){ return ArrayOperations.arrayListToDoubleMatrix(data); } //getting data from specific columns public double[][] getData(int[] columns){ return ArrayOperations.getMultipleColumns(getData(), columns); } //getting data from one column public double[] getData(int col){ return ArrayOperations.getColumn(getData(), col); } //method for saving the data in a csv file public void save(String filename,String separator) //… }
In Chapter 4, Self-Organizing Maps, we've created a class ArrayOperations
in the package edu.packt.neuralnet.math
to handle operations involving arrays of data. This class has a large number of static methods and it would be impractical to depict all of them here; however, information can be found in Appendix C.
To make the task easier, we've implemented a static method in the class LoadCsv
to load a CSV file and automatically convert it into the structure of a DataSet
object:
public static DataSet getDataSet(String fullPath,boolean _columnsInFirstRow, String _separator){ LoadCsv lcsv = new LoadCsv(fullPath,_columnsInFirstRow,_separator); lcsv.columnsInFirstRow=_columnsInFirstRow; lcsv.separator=_separator; try{ lcsv.dataMatrix=lcsv.csvData2Matrix(fullPath); System.out.println("File "+fullPath+" loaded!"); } catch(IOException ioe){ System.err.println("Error while loading CSV file. Details: " + ioe.getMessage()); } return new DataSet(lcsv.dataMatrix, lcsv.columnNames); }
A time series structure is essential for all problems involving time dimensions or domains, such as forecasting and prediction. A class called TimeSeries
implements some time-related attributes such as time column and delay. Let's take a look at the structure of this class:
public class TimeSeries extends DataSet { //index of the column containing time information private int indexTimeColumn; public TimeSeries(double[][] _data,String[] _columns){ super(_data,_columns); //just a call to superclass constructor } public TimeSeries(String path, String filename){ super(path,filename); } public TimeSeries(DataSet ds){ super(ds.getData(),ds.getColumns()); } public void setIndexColumn(int col){ this.indexTimeColumn=col; this.sortBy(indexTimeColumn); } //… }
In time series, one frequent operation is the delay of shift of values. For example, we want to include in the processing not the current but the two past values for the daily temperature. Considering temperature as a time series with a time column (date), we must shift the values in the number of cycles desired (one and two, in this example):
While working with time series, one should pay attention to two points:
In the ArrayOperations
class, we implemented a method shiftColumn
to shift the column of a matrix considering a time column for reference. This method is called from another method of the same name in the TimeSeries
class, and then used in the shift method:
public double[] shiftColumn(int col,int shift){ double[][] _data = ArrayOperations.arrayListToDoubleMatrix(data); return ArrayOperations.shiftColumn(_data, indexTimeColumn, shift, col); } public void shift(int col,int shift){ String colName = columns.get(col); if(shift>0) colName=colName+"_"+String.valueOf(shift); else colName=colName+"__"+String.valueOf(-shift); addColumn(shiftColumn(col,shift),colName); }
NaNs are undesired values often present after loading or transforming data. They are undesired because we cannot operate over them. If we feed a NaN value into a neural network, the output will definitely be NaN, just consuming more computational power. That's why it is better to drop them out. In the class DataSet
, we've implemented two methods to drop NaNs: one just substitutes a value for them, and the other drops the entire row if it has at least one missing value, as shown in the following figure:
// dropping with a substituting value public void dropNaN(double substvalue) //… // dropping the entire row public void dropNaN() //…
Now that we have the tools to get the data, let's find some datasets on the Internet. In this chapter, we are going to use the data from the Brazilian Institute of Meteorology (INMET: http://www.inmet.gov.br/ in Portuguese), which is freely available on the Internet; we have the rights to use it in this book. However, the reader may use any free weather database on the Internet while developing applications.
Some examples from English language sources are listed below: