toy-BERT

A toy implementation of BERT i built from scratch

Caveats:

It is trained on a very small piece of data
It has not NSP added in the pretraining according to the paper, but it is really easy to do so.
I have not adhered to best policies,(below avg in a couple of places, just due to this being a toy project and I have time constraints)
It has been trained with less iterations, hopefully will run on a larger corpus for greater iterations.

Enjoy!