WordCloud filters words regardless of passed regex #266

jasperjkearney · 2017-05-18T17:49:25Z

If the optional regexp argument is passed to a WordCloud class, stopwords, '\s', and numbers are still removed from the word cloud's words.
This behaviour (not so much the removal of stopwords and possessives but more numbers) seems unexpected to me, but this is up for contention.
Happy to create pull request.

The text was updated successfully, but these errors were encountered:

amueller · 2017-05-18T17:54:46Z

Hey. There is a (stalled) PR here: #215

I think it would be cool if we could allow this without adding yet another option. So the question is if it's feasible to reproduce the current behavior by modifying the regexp and removing the hard-coded isnumber.

Any input welcome!
stopwords and normalize_plurals are independent of the regexp argument, and you can control them via the parameters.

jasperjkearney · 2017-05-18T18:35:27Z

My apologies, I missed stopwords and normalize_plurals.

\w[^\d\W']+ seems to work.

amueller · 2017-05-18T18:40:39Z

That excludes 90s and hax0r which are included with the current definition, right?

jasperjkearney · 2017-05-18T18:43:12Z

Yes, my mistake, I should run tests. I'll work on a better regex.

amueller · 2017-05-18T18:43:28Z

cool thanks :)

jasperjkearney · 2017-05-19T01:08:10Z

I've been struggling for a while with this and can't find a regex that works in the same way as the current one whilst not matching strings that are entirely numbers.
A possible solution could be to check if a custom regex has been passed before filtering numbers from the matches, but that is kind of unwieldy, what do you think?

amueller · 2017-05-19T01:40:37Z

hm... I guess we do need to add another boolean option...

jasperjkearney · 2017-05-19T01:43:55Z

Maybe it would be worth negating these two lines if a custom regex is used:

        # remove 's
        words = [word[:-2] if word.lower().endswith("'s") else word
                 for word in words]
        # remove numbers
        words = [word for word in words if not word.isdigit()]

That makes the most sense to me anyway but would slightly change the functionality.

amueller · 2017-05-19T01:50:55Z

but a custom regex doesn't necessarily mean that people don't want plurals removed.

jasperjkearney · 2017-05-19T01:56:47Z

I guess, although this whole issue caught my attention because the class was not respecting my custom regex and I had to find the normalise plurals flag.
So the boolean option would be something like remove_numbers?

(Also it does not in fact remove plurals but possesives, which was also a little confusing)

amueller · 2017-05-29T16:05:53Z

It removes plurals and possessives, so it might be a misnomer. We can rename it to remove_trailing_s or something like that?
I understand that this could be confusing, but if you overwrite the behavior for removing numbers, it's impossible to recreate the default behavior with a different regex and there is strong coupling of the arguments.
And yeah remove_numbers sounds fine. Sorry for the slow reply, I was on vacation.

jasperjkearney · 2017-05-29T16:09:16Z

I understand what you mean.
Will try to work on this later in the week.
P.S. hope you enjoyed your vacation

amueller added the Feature Request label Apr 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordCloud filters words regardless of passed regex #266

WordCloud filters words regardless of passed regex #266

jasperjkearney commented May 18, 2017

amueller commented May 18, 2017

jasperjkearney commented May 18, 2017 •

edited

Loading

amueller commented May 18, 2017 •

edited

Loading

jasperjkearney commented May 18, 2017

amueller commented May 18, 2017

jasperjkearney commented May 19, 2017

amueller commented May 19, 2017

jasperjkearney commented May 19, 2017

amueller commented May 19, 2017

jasperjkearney commented May 19, 2017 •

edited

Loading

amueller commented May 29, 2017

jasperjkearney commented May 29, 2017 •

edited

Loading

WordCloud filters words regardless of passed regex #266

WordCloud filters words regardless of passed regex #266

Comments

jasperjkearney commented May 18, 2017

amueller commented May 18, 2017

jasperjkearney commented May 18, 2017 • edited Loading

amueller commented May 18, 2017 • edited Loading

jasperjkearney commented May 18, 2017

amueller commented May 18, 2017

jasperjkearney commented May 19, 2017

amueller commented May 19, 2017

jasperjkearney commented May 19, 2017

amueller commented May 19, 2017

jasperjkearney commented May 19, 2017 • edited Loading

amueller commented May 29, 2017

jasperjkearney commented May 29, 2017 • edited Loading

jasperjkearney commented May 18, 2017 •

edited

Loading

amueller commented May 18, 2017 •

edited

Loading

jasperjkearney commented May 19, 2017 •

edited

Loading

jasperjkearney commented May 29, 2017 •

edited

Loading