Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOW not showing attributes in sparse #1063

Open
ajdapretnar opened this issue Jun 17, 2024 · 0 comments
Open

BOW not showing attributes in sparse #1063

ajdapretnar opened this issue Jun 17, 2024 · 0 comments

Comments

@ajdapretnar
Copy link
Collaborator

ajdapretnar commented Jun 17, 2024

Describe the bug
A bit tricky bug to describe. There are two underlying issues:

  • Bag of Words can return all 0 features.
  • Data Table in sparse is not showing all nan features (probably it can't or maybe word=nan).

To Reproduce
Say we have the following documents:

cat dog sleep
Cat sleeps. 1 0 1
Dog sleeps 0 1 1
Cat sleeps, dog sleeps. 1 2 1

When computing TF-IDF for "sleep", the IDF is 0. There are three ways of computing IDF.

  1. math.log10(number_of_docs / number_of_docs_with_word) (how we do it)
  2. math.log10(1 + number_of_docs / number_of_docs_with_word) (how we do it with Smooth IDF)
  3. math.log10(number_of_docs / (number_of_docs_with_word + 1)) (how it is recommended)

How scikit does it: idf(t) = log [ n / df(t) ] + 1 or idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1 if smooth = True.

To reproduce: Create Corpus (with above docs). Bow (TF-IDF). Data Table.

Expected behavior

  1. Resolve the nan result in bow. Why do we not use scikit-learn? Should we reconsider how we compute IDF?
  2. Resolve the display of nan in sparse.

Orange version:
3.37.0

Text add-on version:
1.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant