Skip to content

Latest commit

 

History

History
45 lines (30 loc) · 3.05 KB

File metadata and controls

45 lines (30 loc) · 3.05 KB

Neural Networks and optimizers from scratch

Motivation

The aim of this project is to consolidate my understanding about neural networks, and to refine my internal representation of neural networks as a computation graph.

I wanted to gain intuition about how and why different optimizers converge / behave. Therefore, I implemented a number of optimizers from scratch based on the papers they were published in.

Project

In this ipython notebook, I wrote a neural network with an object-oriented approach and tested it on the MNIST dataset. The optimisers are contained in this script.

For the tests, the network architecture used was 2 linear layers with relu activation followed by an output layer to a softmax function. The Layer and Model objects created can handle an arbitrary number of layers with different units.

Optimizers

The optimizers I have implemented in this notebook includes (so far):

  1. Minibatch Gradient Descent (Vanilla)
  2. SGD with Momentum
  3. Nesterov Momentum (or Nesterov Accelerated Gradient)
  4. Adagrad
  5. RMSprop
  6. Adam
  7. Nadam
  8. Adadelta
  9. Adamax
  10. QHAdam

Decaying Momentum (Demon) can be applied to any optimizer that inherits from the Adam subclass and the SGDM subclass, and Decoupled Weight decay can be applied to any optimizer that inheritis from the Adam subclass. This can result in optimizers such as DemonQHAdamW or DemonNesterov.

The graph below shows training loss over epochs for a few select optimizers: img

This one shows validation accuracy over epochs: img

QHAdamW performed the best in training loss, while Nesterov performed the best in validation accuracy in this task.

It is noted that SGD with momentum / Nesterov momentum may be 'simpler' gradient descent algorithms, but they perform quite well over in convergence over epochs.

With knowledge from my previous tests, these momentum optimizers are quite sensitive to the learning rate, as opposed to an algorithm from the "Adam's family".

To-do

  • Perhaps convert optimizers to separate objects for easier handling of arguments / optional parameters
  • Convolutional layer and pooling from scratch, to test with CIFAR10 dataset