The loss depends on the max length of sequence in a batch, which should not be the case. #78

harshit158 · 2019-01-30T14:48:12Z

loss = tf.reduce_mean(-log_likelihood)

Here log_likelihood is the unnormalized score for each of the samples, and depends on the number of timesteps for a batch.
For eg:
Batch 1: Max sequence length (timesteps) = 100
Batch 2 : Max sequence length (timesteps) = 1000
The scale of loss values will be considerably different for both batches

Shouldn't the loss be:
loss = tf.reduce_mean(-log_likelihood / tf.shape(self.logits)[1])
tf.shape(self.logits)[1] => This represents the max timesteps (seq length) for that batch.
Hence this makes the loss independent of the sequence length.

guillaumegenthial · 2019-04-12T04:47:26Z

See my comment on the blog post and feel free to open a PR (with re-training results to prove that it really helps).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss depends on the max length of sequence in a batch, which should not be the case. #78

The loss depends on the max length of sequence in a batch, which should not be the case. #78

harshit158 commented Jan 30, 2019

guillaumegenthial commented Apr 12, 2019

The loss depends on the max length of sequence in a batch, which should not be the case. #78

The loss depends on the max length of sequence in a batch, which should not be the case. #78

Comments

harshit158 commented Jan 30, 2019

guillaumegenthial commented Apr 12, 2019