Skip to content

ElektrischesSchaf/LayerNorm_GRU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LayerNorm GRU

Table of contents

Introduction and environment

Why we need LayerNorm

Activation functions, such as tanh and sigmoid have saturation area, as showed the their first derivatives.

Sigmoid and hyperbolic tangent (tanh) First derivatives

For the values outside (-4, +4), the output will be very close to zero, and their gradients might also vanish, incurring the gradient vanishing problem.

What is LayerNorm in GRU

The structure of a GRU cell

has two tanh and one sigmoid function. The following show the mathematical equations for original GRU and LayerNorm GRU.
Original GRU LayerNorm GRU

For more insight, where we simulate two extreme distributions of data and show the before and after effect of LayerNorm.

Before LayerNorm After LayerNorm

After passing them into LayerNorm, the new distributions lie inside (-4, +4), perfect working area for activation functions.

How does it improve our model

The result from one of my GRU models in BCI.

References

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016)

About

Implement layer normalization GRU in pytorch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages