Skip to content

Building an LSTM Recurrent Neural Network for Predicting Stock Market Prices.

License

Notifications You must be signed in to change notification settings

MohamedAliHabib/Stock-Price-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stock-Price-Predictor

Building an LSTM Recurrent Neural Network for predicting stock market prices. Applied on each of the following time series data:

  • Daily prices of Natural gas, starting from January 1997 to 2018. link
  • Daily prices of AMD stock market, starting from 2009 to 2018. link
  • Daily prices of Google stock market, starting from 2009 to 2018. link

Getting Started

How to view the files

There are two copies of every notebook in this repository. The small size version of the notebook contains the same content found in the large except some figures that made the other notebook size this large, these figures represent the performance (real prices vs. predictions) of every model on the testing data. In case you want to see these figures, they are available in the other large version of the notebooks. If you want to view files, view the small version because the other copy is too large to be viewed using GitHub or nbviewer, therefore you can download them and view them using Jupyter Notebook.

Important Note: If Github viewer can't load the Jupyter Noteooks, then you can use nbviewer to view the notebooks without having to download them. Just copy the link to the notebook that you want to view, go to nbviewer, paste the link there, and click Go!

Here are the steps of training and testing the Stock Price Predictor:

Data Preprocessing

  • Extracting samples (sequences of days that we want to predict their prices) from the last part of the data to use it later for testing the reuslted model. Then, the rest of the data will be used for training and testing the model.
  • Scaling the data in the range (0,1).
    Note: we need to scale because if we fit the model on unscaled data that has a wide range of values, then it is possible for large inputs to slow down the learning and convergence of our network and in some cases prevent the network from effectively learning your problem.
  • Generating input and output sequences that will be used as the input for the Keras LSTM model. A randomly generated list of lengths (between a range whose bounds are in the range [5,100]) that represent window (or list of consecutive days) that the model is going to use in order to the preidct the target (next) day.

Training and Testing the Predictor

6 models are going to be trained and tested in order to experiment the different resutls and pick out what is best. Moreover, two methods of splitting the data will be used, as follows:

  1. Data sequences will be split into training and testing sets (no validation set)
  2. Data sequences will be split into training, testing and validation sets.

Note: This will results in 12 models, the first 6 models will use the first method of splitting and the second 6 models will use the second method of splitting.


Concretely, these are the characteristics that we are going to build the models on:


Models trained on data that is split into training and testing sets (no validation)

  • Model 1: Save 100 input samples from the last part of the data. (70% training data, 30% testing data)
  • Model 2: Save 100 input samples. (50% training, 50% testing)
  • Model 3: Save 1000 input samples (70% training, 30% testing)
  • Model 4: Save 1000 input samples (50% training, 50% testing)
  • Model 5: Save 2000 input samples (70% training, 30% testing)
  • Model 6: Save 2000 input samples (50% training, 50% testing)


Models trained on data that is split into training, testing and validation sets

  • Model 1: Save 100 input samples from the last part of the data. (70% training data, 15% validation data,15% testing data)
  • Model 2: Save 100 input samples. (50% training, 25% validation, 25% testing)
  • Model 3: Save 1000 input samples (70% training, 15% validation, 15% testing)
  • Model 4: Save 1000 input samples (50% training, 25% validation, 25% testing)
  • Model 5: Save 2000 input samples (70% training, 15% validation, 15% testing)
  • Model 6: Save 2000 input samples (50% training, 25% validation, 25% testing)

Training and testing steps (will be applied on each model):

  1. Fit the model on the training data.
  2. Evaluate the model on the testing data and plot predictions with the real prices.
  3. Evaluate the model the last 100, 1000 and 2000 days, respectively. And plot the prediction of the model for these sequence of days with the real prices.

Compare Models

  1. Visually: Compare the performance of each model (measured using MSE) and identify the model with the best value using barplots.

  2. Tabularly: Evalute the performance of the models, save the results using Pandas dataframes, and store them as CSV files.

Note: the evaluation will be on testing data and the samples that were extracted from the last part of the data (last 100, 1000 and 2000 days).

Results

Natural Gas Prices Dataset

Data Plot:

Data Plot

Performance of models that have been trained on data that is split into training and testing sets (no validation)

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Here we can take all models into consideration when testing on the last 100 days because they all haven't seen the last 100 days in neither training nor testing.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Of course, when we want to test on the last 1000 days we should take only into consideration the models in which the last 1000 days are extracted earlier (in this case, they are Model 3 & Model 4), but since Model 5 & 6 have the last 2000 days extracted out from them, but we can't compare their performance against the other two (Model 1 & 2 cannot be taken into consideration because they used 900 days out of the last 1000 days in either training or testing, which means they have seen these days before).

On Last 2000 Days:

Performance on Last 2000 Days

Note: Here we will only consider the models in which the last 2000 days are extracted from (though, we can't compare them against other models because they have seen this data before).

Performance of models that have been trained on data that is split into training, testing and validation sets.

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.

On Last 2000 Days:

Performance on Last 2000 Days

Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.

Effect of Using a Validation Set on Performance

Using Natural Gas Prices Dataset
Effect of Using a Validation Set

AMD Stock Market Prices Dataset

Data Plot:

Data Plot

Performance of models that have been trained on data that is split into training and testing sets (no validation)

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.

On Last 2000 Days:

Performance on Last 2000 Days

Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.

Performance of models that have been trained on data that is split into training, testing and validation sets.

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.

On Last 2000 Days:

Performance on Last 2000 Days

Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.

Effect of Using a Validation Set on Performance

Using AMD Stock Market Prices Dataset
Effect of Using a Validation Set

Google Stock Market Prices Dataset

Data Plot:

Data Plot

Performance of models that have been trained on data that is split into training and testing sets (no validation)

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.

On Last 2000 Days:

Performance on Last 2000 Days

Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.

Performance of models that have been trained on data that is split into training, testing and validation sets.

On Testing Data:

Performance on Testing Data

On Last 100 Days:

Performance on Last 100 Days

Note: Remember, here all the models are taken into consideration when comparing the performance because they all haven't seen the last 100 days.

On Last 1000 Days:

Performance on Last 1000 Days

Note: Remember, only models (3 & 4 & 5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 1000 days.

On Last 2000 Days:

Performance on Last 2000 Days

Note: Remember, only models (5 & 6) are taken into consideration when comparing the performance because they all haven't seen the last 2000 days.

Effect of Using a Validation Set on Performance

Using Google Stock Market Prices Dataset
Effect of Using a Validation Set

Notes

What's in the files?

  1. The code for the all the models, of course.
  2. The models as .h5 files. They can be restored as follows:
    model = load_model('my_model.h5')
  3. The resulted tables in CSV files.

Contributes are welcome!
Thank you!