Verushen Coopoo, Application Engineer
Opti-Num Solutions (2019)
In this script, we demonstrate MATLAB’s machine learning abilities to build a regression tree that will be used to forecast the 10-day volatility of Bitcoin. We show that in one environment, you can download up-to-date Bitcoin data, rapidly prototype a machine learning model, and then test and evaluate its effectiveness. In this way, we show you how MATLAB can rapidly increase your productivity by giving you all the tools you need to get data, train models and test them out.
We apply machine learning techniques – specifically, an ensemble of regression trees – to forecast the 10-day volatility of Bitcoin. This example, adapted from a video by Kawee Numpachoroen, covers the full machine learning workflow, which consists of data acquisition and preprocessing along with model building and validation. In addition, feature selection is employed in order to see which features contribute most significantly to the model. We supply the code in the form of this live script for you to get started right away.
- Step 1: Get raw data
- Step 2: Preprocess data
- Step 3: Build machine learning model
- Step 4: Evaluate performance of model
STEP 1: Get raw data
Machine learning algorithms need data – and lots of it – in order to be trained. Quandl is a platform which is a source of financial, economic and alternative datasets. We will be using this to get some Bitcoin data.
Connect to Quandl
In order to get the data, you will need your API key. See here for a guide on how to access it.
Load credential and connect to Quandl.
Set symbols for retrieving data. These are the symbols we have chosen to use as input data in this example.
Retrieve data from Quandl in the past 2000 days.
Pull the data for the first symbol, Bitcoin Market Price, and store it in the first element of d.
Initialise a timetable object to store the data – an improvement on cell arrays.
Run a loop which iterates through every symbol, and then stores the symbol data in d temporarily, and then into timetable T.
Create new variable names.
Save data in table format.
STEP 2: Preprocess data
Now that we have the data representing our various Bitcoin symbols, let’s generate some indicators which we will use as features to train our machine learning model.
Generate technical indicators
Generate n-day momentum and remove the original factors, except MKPRU (Bitcoin market price in USD)
For each symbol and momentum indicator.
Do the same to get n-day returns.
The n-day returns are calculated with a custom local function.
Do the same to get n-day volatility.
Generate response variable (Y)
Set the response variable to be 10-day volatility.
Split data into training and testing datasets
Remove variables that we are planning not to use. We remove the time variable from T as this is not relevant in the construction of the machine learning model.
Split data into:
- data_train: training data
- data_test: testing data
Let’s use one year’s worth of data as the testing set.
Extract training and testing data.
Preview training data
Let’s have a look at what the volatility (the response variable) looks like for the training data.
We will calculate the mean and standard deviation of the volatility too.
Plot the graph:
STEP 3: Build machine learning model
Now that we have all of this data, what is the easiest way to train it? Using the Regression Learner App (check out this video for a reminder on how to use it and export functions from it) we prototyped a function which trains an ensemble of classification trees to perform regression so that we can forecast the volatility. This function will be at the bottom of the live script when you download it. In addition, we edited the function such that it optimises the hyperparameters available for training. This function is called in the line of code below.
Predict the volatility using the test data
Use the trainRegressionModel function that we exported from the Regression Learner App to predict the 10-day volatility using the testing dataset (out-of-sample).
View one of the trees from the ensemble. It appears very crowded as we are dealing with many predictors.
STEP 4: Evaluate performance of model
Using the model we have trained, let’s make a prediction in order to forecast the 10-day volatility of Bitcoin.
We send in the testing data as input to the predictFcn method of the trainedModel object.
Compare the predicted volatility with actual volatility
Plot the predicted volatility VS actual volatility.
The RMSE is a measure of how accurate the model was in making the prediction.
The RMSE of approximately 15% is suitable for a first attempt, but the modelling process can be improved.
When dealing with several features in a machine learning exercise, a recurring question is often “How important are these features to my model?” MATLAB has several feature selection algorithms which do just that. We use RReliefF, which is feature selection designed for the regression of numerical data – as is in our example.
The RReliefF algorithm ranks all the predictors with their weight of importance, weights, along with the index (idx) of the predictor to which that weight corresponds.
Let’s plot the results to see what the most important features are.
The graph above has ranked all 60+ features in terms of a weighting which describes their importance. Hence, we see that the three most important features are all momentum indicators. Specifically, these are the number of transaction per block and average block size momentum indicators. This implies that if we had to retrain the model, we could potentially yield more accurate results by making sure to include these features and neglecting features below a certain threshold of importance.
Why don’t you download the code and experiment with only including the features which are weighted as important? You could start by determining the minimum number of features required to replicate this model, and then find out how many features are actually required to improve it! Sometimes, including too many features can degrade the quality of a machine learning model.
In this example, we have used MATLAB and demonstrated the machine learning workflow: data import, preprocessing and model building. We tested our model, and used feature selection as a possible way to improve it. We show that MATLAB is one cohesive environment for your machine learning experience – from data collection, to model building and model validation.
Numpacharoen, K. Machine Learning Applications in Risk Management: Forecasting Bitcoin Volatility Using the Regression Learner App. [Online]. https://www.mathworks.com/videos/machine-learning-applications-in-risk-management-forecasting-bitcoin-volatility-using-the-regression-learner-app-1536231842528.html Accessed 13/11/2019
Meet the Author
Verushen is a financial Application Engineer at Opti-Num Solutions. He is passionate about using statistical and probabilistic methods to unearth insights from data with the goal of benefitting society. Verushen’s intrigue for technology and human biology led him to pursue a BEngSc in Biomedical Engineering and BScEng in Electrical Engineering from the University of the Witwatersrand. For his Honours project, Verushen co-developed an eye-tracking system which allowed a user to move and click a mouse cursor by moving their pupil.