Why not detach the hidden state of GRU from the computational graph? #161

MejiroSilence · 2024-12-19T04:14:15Z

In RNNs, gradients accumulate over time steps. If the sequence is long, gradients can become very large (exploding gradients) or very small (vanishing gradients), leading to unstable training or difficulty in convergence. Detaching the hidden state can limit gradient propagation within each time step, preventing gradient accumulation over the entire sequence, thus mitigating exploding/vanishing gradient problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not detach the hidden state of GRU from the computational graph? #161

Why not detach the hidden state of GRU from the computational graph? #161

MejiroSilence commented Dec 19, 2024

Why not detach the hidden state of GRU from the computational graph? #161

Why not detach the hidden state of GRU from the computational graph? #161

Comments

MejiroSilence commented Dec 19, 2024