-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Score documents: fix word preprocessing #707
Score documents: fix word preprocessing #707
Conversation
ffc33fe
to
7318613
Compare
callback((i + 1) / len(pps)) | ||
return words | ||
return [w[0] for w in words_c.tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that there is a word in tokens.
E.g. using Remove urls
may result in an empty list of tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
7318613
to
e691a16
Compare
callback((i + 1) / len(pps)) | ||
return words | ||
return [w[0] for w in words_c.tokens if len(w) > 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may result in an empty list, which is could probably cause some unexpected behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(w) > 0
can be just len(w)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but I usually use len(w) > 0
because it is more readable. I can change to len(w)
anyway
e691a16
to
90a054a
Compare
Codecov Report
@@ Coverage Diff @@
## master #707 +/- ##
==========================================
- Coverage 73.90% 73.88% -0.02%
==========================================
Files 72 72
Lines 9427 9429 +2
Branches 1286 1287 +1
==========================================
Hits 6967 6967
- Misses 2218 2219 +1
- Partials 242 243 +1 |
90a054a
to
df9e115
Compare
Issue
The implementation of word preprocessing use protected method and does not work for all preprocessors
Description of changes
Implements word preprocessing via converting word list to Corpus. We should enhance preprocessors with a method that works on a list of strings in the feature, but we need more use cases like this. With only one use case it does not make sense to make such a big change in preprocessors.
Includes