Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score documents: fix word preprocessing #707

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

PrimozGodec
Copy link
Collaborator

Issue

The implementation of word preprocessing use protected method and does not work for all preprocessors

Description of changes

Implements word preprocessing via converting word list to Corpus. We should enhance preprocessors with a method that works on a list of strings in the feature, but we need more use cases like this. With only one use case it does not make sense to make such a big change in preprocessors.

Includes
  • Code changes
  • Tests
  • Documentation

callback((i + 1) / len(pps))
return words
return [w[0] for w in words_c.tokens]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that there is a word in tokens.
E.g. using Remove urls may result in an empty list of tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

callback((i + 1) / len(pps))
return words
return [w[0] for w in words_c.tokens if len(w) > 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in an empty list, which is could probably cause some unexpected behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(w) > 0 can be just len(w)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I usually use len(w) > 0 because it is more readable. I can change to len(w) anyway

@codecov-commenter
Copy link

Codecov Report

Merging #707 (7318613) into master (96c295c) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 7318613 differs from pull request most recent head 90a054a. Consider uploading reports for the commit 90a054a to get more accurate results

@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
- Coverage   73.90%   73.88%   -0.02%     
==========================================
  Files          72       72              
  Lines        9427     9429       +2     
  Branches     1286     1287       +1     
==========================================
  Hits         6967     6967              
- Misses       2218     2219       +1     
- Partials      242      243       +1     

@VesnaT VesnaT merged commit 7a5f5fe into biolab:master Aug 27, 2021
@PrimozGodec PrimozGodec deleted the score-documents-preprocessors branch March 29, 2023 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants