A toy implementation of BERT i built from scratch
Caveats:
- It is trained on a very small piece of data
- It has not NSP added in the pretraining according to the paper, but it is really easy to do so.
- I have not adhered to best policies,(below avg in a couple of places, just due to this being a toy project and I have time constraints)
- It has been trained with less iterations, hopefully will run on a larger corpus for greater iterations.
Enjoy!