You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While building a Corpus, using the litstudy.build_corpus() method I have found that min_docs and max_docs_ratio are not working as expected.
For example, when forcing outliers to be kept in Corpus by setting min_docs=1 and max_docs_ratio=1, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):
Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs that I'm working with.
The text was updated successfully, but these errors were encountered:
Interesting problem, I'm not sure what is causing this problem. I'll look into this. The lack of proper tests for build_corpus and Corpus do not help, unfortunately :-(. Now might be to time to invest into those.
Look at the code, do you have any feeling on what problem could be? The only thing that look suspicious to me is the call to filter_extremes.
Indeed. It seems that dic.filter_extremes(keep_n=max_tokens) is providing a similar functionality as preprocess_outliers(), so even if the preprocess_outliers() filter is behaving as expected (which I believe it is), once the filter_extremes() is called it overlaps the desired behavior.
I think it would be better to just keep filter_extremes() and incorporating the idea of using min_docs, max_docs and max_tokens in this method. I've checked the documentation and it might work:
Description
While building a
Corpus
, using thelitstudy.build_corpus()
method I have found thatmin_docs
andmax_docs_ratio
are not working as expected.For example, when forcing outliers to be kept in Corpus by setting
min_docs=1
andmax_docs_ratio=1
, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):Expected behavior
After performing a "dumb filter" on my database, prior to building the Corpus:
I was expecting to see 'curtailment' as a "forced outlier".
But it gives me:
False
Observations
Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the
curtailment_docs
that I'm working with.The text was updated successfully, but these errors were encountered: