You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).
Why is this a problem? This is not the same as in scikit.
a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still)
d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.
We should probably use scikit here. This would, of course, affect teaching materials.
The text was updated successfully, but these errors were encountered:
Orange uses the following formula for IDF:
math.log10(number_of_docs/number_of_docs_with_word)
. In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which usesmath.log10(1 + number_of_docs/number_of_docs_with_word)
.Why is this a problem? This is not the same as in scikit.
a) IDF is
math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth is
math.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still)
d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.
We should probably use scikit here. This would, of course, affect teaching materials.
The text was updated successfully, but these errors were encountered: