Humidity forecast using RNN

As the first use case of RNNs, we see how we can train and predict an RNN using the trainr() function. Our purpose is to forecast the humidity of a certain location as a function of the day. The input file contains daily weather observations from multiple Australian weather stations. These observations are obtained from the Australian Commonwealth Bureau of Meteorology and are subsequently processed to create a relatively large sample dataset for illustrating analytics, data mining, and data science using R and the rattle.data package. The weatherAUS dataset is regularly updated and updates of this package usually correspond to updates to this dataset. The data is updated from the Bureau of Meteorology website. The locationsAUS dataset records the location of each weather station. The source dataset is copyrighted by the Australian Commonwealth Bureau of Meteorology and is used with permission.

A CSV version of this dataset is available at the following link:
https://rattle.togaware.com/weatherAUS.csv

The weatherAUS dataset is a dataframe containing over 140,000 daily observations from over 45 Australian weather stations. This dataset contains the following variables:

  • Date: The date of observation (a Date object).
  • Location: The common name of the location of the weather station.
  • MinTemp: The minimum temperature in degrees celsius.
  • MaxTemp: The maximum temperature in degrees celsius.
  • Rainfall: The amount of rainfall recorded for the day in mm.
  • Evaporation: The so-called class a pan evaporation (mm) in the 24 hours to 9 a.m.
  • Sunshine: The number of hours of bright sunshine in the day.
  • WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight.
  • WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight.
  • Temp9am: Temperature (degrees C) at 9 a.m.
  • RelHumid9am: Relative humidity (percent) at 9 a.m.
  • Cloud9am: Fraction of the sky obscured by clouds at 9 a.m. This is measured in oktas, which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A zero measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
  • WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9 a.m. 6 weatherAUS.
  • Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m.
  • Temp3pm: Temperature (degrees C) at 3 p.m.
  • RelHumid3pm: Relative humidity (percent) at 3 p.m.
  • Cloud3pm: Fraction of sky obscured by cloud (in oktas: eighths) at 3 p.m.
  • WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3 p.m.
  • Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m.
  • ChangeTemp: Change in temperature.
  • ChangeTempDir: Direction of change in temperature.
  • ChangeTempMag: Magnitude of change in temperature.
  • ChangeWindDirect: Direction of wind change.
  • MaxWindPeriod: Period of maximum wind.
  • RainToday: Integer 1 if precipitation (mm) in the 24 hours to 9 a.m. exceeds 1 mm, and 0 otherwise.
  • TempRange: Difference between minimum and maximum temperatures (degrees C) in the 24 hours to 9 a.m.
  • PressureChange: Change in pressure.
  • RISK_MM: The amount of rain. A kind of measure of the risk.
  • RainTomorrow: The target variable. Will it rain tomorrow?

In our case, we will use only two of the many variables contained in it:

  • Date: The date of observation (a Date object)
  • RelHumid9am: Relative humidity (percent) at 9 a.m

As said previously, the objective of this example is to forecast the humidity of a certain location as a function of the day. Here is the code that we will use in this example:

##########################################################
### Chapter 6 - Introduction to RNNs - using R ##########
########## Humidity forecasting with RNNs#################
##########################################################

library("rattle.data")

library("rnn")

data(weatherAUS)
View(weatherAUS)

#extract only 1 and 14 clumn and first 3040 rows (Albury location)
data=weatherAUS[1:3040,c(1,14)]
summary(data)

data_cleaned <- na.omit(data)
data_used=data_cleaned[1:3000]

x=data_cleaned[,1]
y=data_cleaned[,2]

head(x)
head(y)

X=matrix(x, nrow = 30)
Y=matrix(y, nrow = 30)

# Standardize in the interval 0 - 1
Yscaled = (Y - min(Y)) / (max(Y) - min(Y))
Y=t(Yscaled)

train=1:70
test=71:100

model <- trainr(Y = Y[train,],
X = Y[train,],
learningrate = 0.05,
hidden_dim = 16,
numepochs = 1000)

plot(colMeans(model$error),type='l',xlab='epoch',ylab='errors')

Yp <- predictr(model, Y[test,])

plot(as.vector(t(Y[test,])), col = 'red', type='l',
main = "Actual vs Predicted Humidity: testing set",
ylab = "Y,Yp")
lines(as.vector(t(Yp)), type = 'l', col = 'black')
legend("bottomright", c("Predicted", "Actual"),
col = c("red","black"),
lty = c(1,1), lwd = c(1,1))

############################################################

We begin analyzing the code line by line, explaining in detail all the features applied to capture the results:

library("rattle.data")
library("rnn")

The first two lines of the initial code are used to load the libraries needed to run the analysis.

Remember that to install a library that is not present in the initial distribution of R, you must use the install.package function. This is the main function to install packages. It takes a vector of names and a destination library, downloads the packages from the repositories and installs them. This function should be used only once and not every time you run the code.

The rattle.data library contains the datasets used as default examples by the rattle package. The datasets themselves can be used independently of the rattle package to illustrate analytics, data mining, and data science tasks.

The rnn library contains several functions for implementing an RNN in R:

data(weatherAUS)
View(weatherAUS)

With this command, we upload the dataset named weatherAUS, as mentioned, contained in the rattle.data library. In the second line, the view function is used to invoke a spreadsheet-style data viewer on the dataframe object, as shown in the following figure:

Returning to the code, as before, we use only two variables. In addition, the dataset contains data from different locations in Australia. We will limit our study to the first location (Albury):

data=weatherAUS[1:3040,c(1,14)]

Let's get a preliminary data analysis using the summary() function:

> summary(data)
Date Humidity9am
Min. :2008-12-01 Min. : 18.00
1st Qu.:2010-12-30 1st Qu.: 61.00
Median :2013-04-27 Median : 76.00
Mean :2013-03-22 Mean : 74.07
3rd Qu.:2015-05-27 3rd Qu.: 88.00
Max. :2017-06-25 Max. :100.00
NA's :9

The summary() function returns a set of statistics for each variable. In particular, it is useful to highlight the result provided for the Humidity9am variable; this represents our target. For this variable, nine cases of missing value were detected. To remove the missing values, we will use the na.omit() function; it drops any rows with missing values and forgets them forever:

data_cleaned <- na.omit(data) 
data_used=data_cleaned[1:3000]

With the second line of code, we limit our analysis to the first 3000 observations. Now we must set the input and the output data to the format required by the trainr() function:

x=data_cleaned[,1]
y=data_cleaned[,2]

In this way, x will represent our input and y our target:

X=matrix(x, nrow = 30)
Y=matrix(y, nrow = 30)

With this piece of code we construct a matrix of 30 lines and 100 columns with the data available. Recall is a size setting required for the function we will use for model building. We can now standardize this:

Yscaled = (Y - min(Y)) / (max(Y) - min(Y))
Y=t(Yscaled)

For this example, we have used the min-max method (usually called feature scaling) to get all the scaled data in the range [0,1]. The formula for this is as follows:

 

During the normalization, we must calculate the minimum and maximum values of each database column. Then we transpose the matrix obtained:

train=1:70
test=71:100

In these lines of code, the dataset is split into 70:30, with the intention of using 70 percent of the data at our disposal to train the network and the remaining 30 percent to test the network. Now is the time to build and train the model:

model <- trainr(Y = Y[train,],
X = Y[train,],
learningrate = 0.05,
hidden_dim = 16,
numepochs = 1000)

The trainr() function trains an RNN in R environment. We have used 16 neurons in the hidden layer and the number of epochs is 1,000. The trainr() function takes a few minutes as the training happens based on X and Y. Here are the last 10 Trained epoch results as displayed on the R prompt:

Trained epoch: 990 - Learning rate: 0.05
Epoch error: 0.382192317958489
Trained epoch: 991 - Learning rate: 0.05
Epoch error: 0.376313106021699
Trained epoch: 992 - Learning rate: 0.05
Epoch error: 0.380178990096884
Trained epoch: 993 - Learning rate: 0.05
Epoch error: 0.379260612039631
Trained epoch: 994 - Learning rate: 0.05
Epoch error: 0.380475314573825
Trained epoch: 995 - Learning rate: 0.05
Epoch error: 0.38169633378182
Trained epoch: 996 - Learning rate: 0.05
Epoch error: 0.373951666567461
Trained epoch: 997 - Learning rate: 0.05
Epoch error: 0.374880624458934
Trained epoch: 998 - Learning rate: 0.05
Epoch error: 0.384185799764121
Trained epoch: 999 - Learning rate: 0.05
Epoch error: 0.381408598560978
Trained epoch: 1000 - Learning rate: 0.05
Epoch error: 0.375245688144538

We can see the evolution of the algorithm by charting the error made by the algorithm to subsequent epochs:

plot(colMeans(model$error),type='l',xlab='epoch',ylab='errors')

This graph shows the epoch versus error:

We finally have the network trained and ready for use; now we can use it to make our predictions. Remember, we've set aside 30 percent of the available data to test the network. It's time to use it:

Yp <- predictr(model, Y[test,])

Finally, to compare the results, let's plot a graph showing the moisture content in the test set and the predicted results in order:

plot(as.vector(t(Y[test,])), col = 'red', type='l', 
main = "Actual vs Predicted Humidity: testing set",
ylab = "Y,Yp")
lines(as.vector(t(Yp)), type = 'l', col = 'black')
legend("bottomright", c("Predicted", "Actual"),
col = c("red","black"),
lty = c(1,1), lwd = c(1,1))

The following figure shows the actual values and predicted values:

From the analysis of the figure, it is possible to note one thing: the data is adapted to a good approximation to indicate that the model is able to predict the humidity conditions with good performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset