Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordCloud filters words regardless of passed regex #266

Open
jasperjkearney opened this issue May 18, 2017 · 12 comments
Open

WordCloud filters words regardless of passed regex #266

jasperjkearney opened this issue May 18, 2017 · 12 comments

Comments

@jasperjkearney
Copy link

If the optional regexp argument is passed to a WordCloud class, stopwords, '\s', and numbers are still removed from the word cloud's words.
This behaviour (not so much the removal of stopwords and possessives but more numbers) seems unexpected to me, but this is up for contention.
Happy to create pull request.

@amueller
Copy link
Owner

Hey. There is a (stalled) PR here: #215

I think it would be cool if we could allow this without adding yet another option. So the question is if it's feasible to reproduce the current behavior by modifying the regexp and removing the hard-coded isnumber.

Any input welcome!
stopwords and normalize_plurals are independent of the regexp argument, and you can control them via the parameters.

@jasperjkearney
Copy link
Author

jasperjkearney commented May 18, 2017

My apologies, I missed stopwords and normalize_plurals.

\w[^\d\W']+ seems to work.

@amueller
Copy link
Owner

amueller commented May 18, 2017

That excludes 90s and hax0r which are included with the current definition, right?

@jasperjkearney
Copy link
Author

Yes, my mistake, I should run tests. I'll work on a better regex.

@amueller
Copy link
Owner

cool thanks :)

@jasperjkearney
Copy link
Author

I've been struggling for a while with this and can't find a regex that works in the same way as the current one whilst not matching strings that are entirely numbers.
A possible solution could be to check if a custom regex has been passed before filtering numbers from the matches, but that is kind of unwieldy, what do you think?

@amueller
Copy link
Owner

hm... I guess we do need to add another boolean option...

@jasperjkearney
Copy link
Author

Maybe it would be worth negating these two lines if a custom regex is used:

        # remove 's
        words = [word[:-2] if word.lower().endswith("'s") else word
                 for word in words]
        # remove numbers
        words = [word for word in words if not word.isdigit()]

That makes the most sense to me anyway but would slightly change the functionality.

@amueller
Copy link
Owner

but a custom regex doesn't necessarily mean that people don't want plurals removed.

@jasperjkearney
Copy link
Author

jasperjkearney commented May 19, 2017

I guess, although this whole issue caught my attention because the class was not respecting my custom regex and I had to find the normalise plurals flag.
So the boolean option would be something like remove_numbers?

(Also it does not in fact remove plurals but possesives, which was also a little confusing)

@amueller
Copy link
Owner

It removes plurals and possessives, so it might be a misnomer. We can rename it to remove_trailing_s or something like that?
I understand that this could be confusing, but if you overwrite the behavior for removing numbers, it's impossible to recreate the default behavior with a different regex and there is strong coupling of the arguments.
And yeah remove_numbers sounds fine. Sorry for the slow reply, I was on vacation.

@jasperjkearney
Copy link
Author

jasperjkearney commented May 29, 2017

I understand what you mean.
Will try to work on this later in the week.
P.S. hope you enjoyed your vacation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants