Building an LSTM Recurrent Neural Network for predicting stock market prices. Applied on each of the following time series data:
- Daily prices of Natural gas, starting from January 1997 to 2018. link
- Daily prices of AMD stock market, starting from 2009 to 2018. link
- Daily prices of Google stock market, starting from 2009 to 2018. link
There are two copies of every notebook in this repository. The small size version of the notebook contains the same content found in the large except some figures that made the other notebook size this large, these figures represent the performance (real prices vs. predictions) of every model on the testing data. In case you want to see these figures, they are available in the other large version of the notebooks. If you want to view files, view the small version because the other copy is too large to be viewed using GitHub or nbviewer, therefore you can download them and view them using Jupyter Notebook.
Important Note: If Github viewer can't load the Jupyter Noteooks, then you can use nbviewer to view the notebooks without having to download them. Just copy the link to the notebook that you want to view, go to nbviewer, paste the link there, and click Go!
- Extracting samples (sequences of days that we want to predict their prices) from the last part of the data to use it later for testing the reuslted model. Then, the rest of the data will be used for training and testing the model.
- Scaling the data in the range (0,1).
Note: we need to scale because if we fit the model on unscaled data that has a wide range of values, then it is possible for large inputs to slow down the learning and convergence of our network and in some cases prevent the network from effectively learning your problem. - Generating input and output sequences that will be used as the input for the Keras LSTM model. A randomly generated list of lengths (between a range whose bounds are in the range [5,100]) that represent window (or list of consecutive days) that the model is going to use in order to the preidct the target (next) day.
6 models are going to be trained and tested in order to experiment the different resutls and pick out what is best. Moreover, two methods of splitting the data will be used, as follows:
- Data sequences will be split into training and testing sets (no validation set)
- Data sequences will be split into training, testing and validation sets.
Note: This will results in 12 models, the first 6 models will use the first method of splitting and the second 6 models will use the second method of splitting.
Concretely, these are the characteristics that we are going to build the models on:
Models trained on data that is split into training and testing sets (no validation)
- Model 1: Save 100 input samples from the last part of the data. (70% training data, 30% testing data)
- Model 2: Save 100 input samples. (50% training, 50% testing)
- Model 3: Save 1000 input samples (70% training, 30% testing)
- Model 4: Save 1000 input samples (50% training, 50% testing)
- Model 5: Save 2000 input samples (70% training, 30% testing)
- Model 6: Save 2000 input samples (50% training, 50% testing)
Models trained on data that is split into training, testing and validation sets
- Model 1: Save 100 input samples from the last part of the data. (70% training data, 15% validation data,15% testing data)
- Model 2: Save 100 input samples. (50% training, 25% validation, 25% testing)
- Model 3: Save 1000 input samples (70% training, 15% validation, 15% testing)
- Model 4: Save 1000 input samples (50% training, 25% validation, 25% testing)
- Model 5: Save 2000 input samples (70% training, 15% validation, 15% testing)
- Model 6: Save 2000 input samples (50% training, 25% validation, 25% testing)
- Fit the model on the training data.
- Evaluate the model on the testing data and plot predictions with the real prices.
- Evaluate the model the last 100, 1000 and 2000 days, respectively. And plot the prediction of the model for these sequence of days with the real prices.
-
Visually: Compare the performance of each model (measured using MSE) and identify the model with the best value using barplots.
-
Tabularly: Evalute the performance of the models, save the results using Pandas dataframes, and store them as CSV files.
Note: the evaluation will be on testing data and the samples that were extracted from the last part of the data (last 100, 1000 and 2000 days).
Data Plot:
Performance of models that have been trained on data that is split into training and testing sets (no validation)
On Testing Data:
On Last 100 Days:
Note: Here we can take all models into consideration when testing on the last 100 days because they all haven't seen the last 100 days in neither training nor testing.
On Last 1000 Days:
Note: Of course, when we want to test on the last 1000 days we should take only into consideration the models in which the last 1000 days are extracted earlier (in this case, they are Model 3 & Model 4), but since Model 5 & 6 have the last 2000 days extracted out from them, but we can't compare their performance against the other two (Model 1 & 2 cannot be taken into consideration because they used 900 days out of the last 1000 days in either training or testing, which means they have seen these days before).
On Last 2000 Days:
Note: Here we will only consider the models in which the last 2000 days are extracted from (though, we can't compare them against other models because they have seen this data before).
Performance of models that have been trained on data that is split into training, testing and validation sets.
On Testing Data:
On Last 100 Days:
Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.
On Last 1000 Days:
Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.
On Last 2000 Days:
Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.
Using Natural Gas Prices Dataset
Data Plot:
Performance of models that have been trained on data that is split into training and testing sets (no validation)
On Testing Data:
On Last 100 Days:
Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.
On Last 1000 Days:
Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.
On Last 2000 Days:
Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.
Performance of models that have been trained on data that is split into training, testing and validation sets.
On Testing Data:
On Last 100 Days:
Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.
On Last 1000 Days:
Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.
On Last 2000 Days:
Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.
Using AMD Stock Market Prices Dataset
Data Plot:
Performance of models that have been trained on data that is split into training and testing sets (no validation)
On Testing Data:
On Last 100 Days:
Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.
On Last 1000 Days:
Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.
On Last 2000 Days:
Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.
Performance of models that have been trained on data that is split into training, testing and validation sets.
On Testing Data:
On Last 100 Days:
Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.
On Last 1000 Days:
Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.
On Last 2000 Days:
Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.
Using Google Stock Market Prices Dataset
What's in the files?
- The code for the all the models, of course.
- The models as .h5 files. They can be restored as follows:
model = load_model('my_model.h5')
- The resulted tables in CSV files.
Contributes are welcome!
Thank you!