-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where is Expectation #5
Comments
That would give more precise loss value, however, I'm afraid that applying this might consume too much time for training. I quickly wrote a sample code of your idea. if self.weights is not None:
nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs * self.n_samples, replacement=True).view(batch_size, -1)
else:
nwords = FT(batch_size, context_size * self.n_negs * self.n_samples).uniform_(0, self.vocab_size - 1).long()
ivectors = self.embedding.forward_i(iword).unsqueeze(2)
ovectors = self.embedding.forward_o(owords)
nvectors = self.embedding.forward_o(nwords).neg()
oloss = t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1)
nloss = t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs, self.n_samples).mean(3).sum(2).mean(1) |
I am completely agree with you that will make training slower. My collegue and me concluded that your approach also leads to convergence. But I wish there would be thoughts about this formula and about implementation in the README or in code. Thank you |
If you have some good idea for implementation, please go ahead for PR. Otherwise, I'll close this issue. Thank you. |
Thanks for publishing this repo. I have found it very useful in improving the performance of my own word2vec implementation. On this particular issue, I note that
is equivalent to minimising:
Given that all of the vectors start out 'small', and do not become excessively large in the course of training, numerical stability does not seem to be an issue. (Indeed, if there were some problem with stability I suspect it might arise anyway, since each function is computed successively, rather than using some numerically-stable compound implementation in the manner of The same argument should apply to the 'improved' computation. Ultimately, the order of application of I have tried the above simplification in my own implementation. It seems to work, in the sense that it converges and produces a sensible-looking result, although I have not done any testing to check that it produces the same embeddings (all else being equal). However, the speed-up is hardly worth it - about 5% for small vectors, e.g. around 25 elements, but with much smaller relative benefits for larger vectors, since the matrix multiplications dominate the computation time. The advantage of retaining the Incidentally, as far as I can tell from the original word2vec code (https://github.com/dav/word2vec/blob/9b8b58001ba5d58babe1c62327a8501b62cd6179/src/word2vec.c#L529) they use a fixed number of negative samples (just five by default), and it looks like they compute the |
@msummerfield Thanks for the detailed feedback! Awesome. |
In this formula we have an expectation of$w_i$ . That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$ . Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?
The text was updated successfully, but these errors were encountered: