Summary

In this chapter, a complete ML pipeline was implemented, from collecting historical data, to transforming it into a format suitable for testing hypotheses, training ML models, and running a prediction on Live data, and with the possibility to evaluate many different models and select the best one.

The test results showed that, as in the original dataset, about 600,000 minutes out of 2.4 million can be classified as increasing price (close price was higher than open price); the dataset can be considered imbalanced. Although random forests are usually performed well on an imbalanced dataset, the area under the ROC curve of 0.74 isn't best. As we need to have fewer false positives (fewer times when we trigger purchase and the price drops), we might consider a punishing model for such errors in a stricter way.

Although the results achieved by classifiers can't be used for profitable trading, there is a foundation on top of which new approaches can be tested in a relatively rapid way. Here, some possible directions for further development are listed:

Implementation of the pipeline discussed in the beginning: Convert your time series data into several clusters and train the regression model/classifier for each of them; then classify recent data into one of the clusters and use the prediction model trained for that cluster. As by definition, ML is deriving patterns from data, there might not be only one pattern that fits all of the history of Bitcoin; that's why we need to understand that a market can be in different phases, and each phase has its own pattern.
One of the major challenges with Bitcoin price prediction might be that the training data (historical) doesn't belong to the same distribution as test data during random splits into train-test sets. As the patterns in price changed during 2013 and 2016, they might belong to completely different distributions. It might require a manual inspection of data and some infographics. Probably, someone has already done this research.
One of the main things to try would be to train two one-versus-all classifiers: one is trained to predict when the price grows higher than 20$, for example. Another predicts when the price drop by 20$; so it makes sense to take long/short positions, respectively.
Maybe, predicting the delta of the next minute isn't what we need; we'd rather predict the average price. As the Open price can be much higher than last minute's Close price, and the Close price of the next minute can be slightly less than open but still higher than current, it would make it profitable trade. So how to exactly label data is also an open question.
Try with different time-series window size (even 50 minutes might suit) using ARIMA time series prediction model, as it is one of the most widely used algorithms. Then try to predict price change, not for the next minute but for 2-3 following minutes. Additionally, try by incorporating trading volume as well.
Label the data as price increased if the price was higher by 20$ during at least one of three following minutes so that we can make a profit from trade.
Currently, Scheduler isn't synchronized with Cryptocompare minutes. This means we can get data about the minute interval 12:00:00 - 12:00:59 at any point of the following minute - 12:01:00 or 12:01:59. In the latter case, it doesn't make sense to make trade, as we made a prediction based on already old data.
Instead of making a prediction every minute on older data to accumulate prediction results for actor, it's better to take maximum available HistoMinute data (seven days), split it into time series data using a Scala script that was used for historical data, and predict for seven days' worth of data. Run this as a scheduled job once a day; it should reduce the load on the DB and PredictionActor.
Compared to usual datasets, where the order of rows doesn't matter much, in Bitcoin, historical data rows are sorted by ascending order of date, which means that:
- Latest data might be more relevant to today's price, and less can be more; taking a smaller subset of data might give better performance
- The ways of subsampling data can matter (splitting into train-test sets)
- Finally try with LSTM network for even better predictive accuracy (see chapter 10 for some clue)

The understanding of variations in genome sequences assists us in identifying people who are predisposed to common diseases, solving rare diseases, and finding the corresponding population group of individuals from a larger population group. Although classical ML techniques allow researchers to identify groups (clusters) of related variables, the accuracy and effectiveness of these methods diminish for large and high-dimensional datasets such as the whole human genome. On the other hand, deep neural network architectures (the core of deep learning) can better exploit large-scale datasets to build complex models.

In the next chapter, we will see how to apply the K-means algorithm on large-scale genomic data from the 1,000 Genomes Project aiming at clustering genotypic variants at the population scale. Then we'll train an H2O-based deep learning model for predicting geographic ethnicity. Finally, Spark-based Random Forest will be used to enhance the predictive accuracy.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary