Data preparation

Now, we will see how we can prepare our dataset in a way that our LSTM network needs. First, we read the input dataset as follows:

df = pd.read_csv('data/btc.csv')

Then we display a few rows of the dataset:

df.head()

The preceding code generates the following output:

As shown in the preceding data frame, the Close column represents the closing price of Bitcoin. We need only the Close column to make predictions, so we take that particular column alone:

data = df['Close'].values

Next, we standardize the data and bring it to the same scale:

scaler = StandardScaler()
data = scaler.fit_transform(data.reshape(-1, 1))

We then plot and observe the trend of how the Bitcoin price changes. Since we scaled the price, it is not a bigger number:

plt.plot(data)
plt.xlabel('Days')
plt.ylabel('Price')
plt.grid()

The following plot is generated:

Now, we define a function called the get_data function, which generates the input and output. It takes the data and window_size as an input and generates the input and target column.

What is the window size here? We move the x values window_size times ahead and get the y values. For instance, as shown in the following table with window_size equal to 1, the y values are just one time step ahead of the x values:

x

y

0.13

0.56

0.56

0.11

0.11

0.40

0.40

0.63

The get_data() function is defined as follows:

def get_data(data, window_size):
X = []
y = []

i = 0

while (i + window_size) <= len(data) - 1:
X.append(data[i:i+window_size])
y.append(data[i+window_size])

i += 1
assert len(X) == len(y)
return X, y

We choose window_size as 7 and generate the input and output:

X, y = get_data(data, window_size = 7)

Consider the first 1000 points as the train set and the rest of the points in the dataset as the test set:

#train set
X_train = np.array(X[:1000])
y_train = np.array(y[:1000])

#test set
X_test = np.array(X[1000:])
y_test = np.array(y[1000:])

The shape of X_train is shown as follows:

X_train.shape

(1000,7,1)

What does the preceding shape mean? It implies that the sample_size, time_steps, and features functions and the LSTM network require input exactly as follows:

  • 1000 sets the number of data points (sample_size)
  • 7 specifies the window size (time_steps)
  • 1 specifies the dimension of our dataset (features)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset