Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glove method - lemmatization #8

Open
MarekKeskyll opened this issue May 29, 2021 · 1 comment
Open

Glove method - lemmatization #8

MarekKeskyll opened this issue May 29, 2021 · 1 comment

Comments

@MarekKeskyll
Copy link

Hi!
Questions:

  1. Why didn't you use lemmatization when processing your document's? Is there a reason behind that?
  2. Why did you use this glove pre-trained model(dimensions)?
  3. Can you validate the results somehow?
@4OH4
Copy link
Owner

4OH4 commented Jun 1, 2021

Hi there,

Good question on the use of lemmatization - I did use it on the TF-idf model, but not for Glove. I think its more important for TF-idf, in order to get accurate word counts. I don't remember why I did not use it with the Glove model (the example is based on some project work that I did, but borrows heavily from the documentation) - I would expect that I tried it and found that for the particular use case I was looking at it did not offer a performance benefit.

The glove-wiki-gigaword-50 model is the smallest of the Gensim models trained on Wikipedia, in terms of model complexity. I was originally looking at near real-time processing of high volumes of data, so compute requirements and latency was an issue. This model is the fastest to run. You would expect to get accuracy benefits if moving to a more complex model, although they may be small and at the cost of significant additional memory requirements.

For my application, I was comparing against human operators that were conducting information retrieval tasks. Our metric was (something like) how often does the most similar document appear in the top-1, top-3, or top-10 positions. That is quite a hard route for validation though, and can be quite expensive. Perhaps validation against an equivalent gold standard automation technique might be better? TF-idf is an established technique that is well used, so is a reasonable baseline against which you could compare more advanced techniques such as Glove, to see if they offer an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants