Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizer mishandles "X%.", returns "X %." #196

Open
ChanceNCounter opened this issue May 17, 2021 · 7 comments
Open

Normalizer mishandles "X%.", returns "X %." #196

ChanceNCounter opened this issue May 17, 2021 · 7 comments
Labels
bug Something isn't working multi_lang relates to several languages

Comments

@ChanceNCounter
Copy link
Contributor

normalize("Set Volume to 50%.") -> "Set Volume to 50 %."

This is bad. It should probably, at worst, return "Set Volume to 50 % ."

@ChanceNCounter ChanceNCounter added bug Something isn't working multi_lang relates to several languages labels May 17, 2021
@Badboy-16
Copy link

Hi @ChanceNCounter
I would like to work on this issue. As this would be my first contribution to this project, I'll complete the steps required to become a contributor and submit a PR shortly. :)

@ChanceNCounter
Copy link
Contributor Author

Sounds good! I think it should ideally maintain the percentage as such, meaning that when the normalized phrase is passed to a tokenizer, one of the tokens should be "50%". But that's my opinion.

In the long run, the oddness of the current behavior aside, there might be a design choice to be made here: @krisgesling, what are your thoughts on the extractors and percentages?

@krisgesling
Copy link
Contributor

Yeah agreed - the % is inherently tied to the number eg it's not the same as "50 apples", if anything it's closer to "0.5".

Thanks for digging into this @Badboy-16 :)

@JarbasAl
Copy link
Collaborator

since the point of normalize was making intent parsing etc easier, this just makes it harder to detect numbers or percentages, eg, a voc file containing "percent" and "%" will no longer match in adapt, any downstream that is depending on tokens being number words might also suddenly fail

this change was intentionally part of normalization process

@ChanceNCounter
Copy link
Contributor Author

this change was intentionally part of normalization process

Okay but the current state of affairs is unacceptable.

@JarbasAl
Copy link
Collaborator

then normalize the symbol into a word

@ChanceNCounter
Copy link
Contributor Author

I think we might be talking about different things here. The periods in the issue title are literal.

The normalizer handles "5%" correctly. It mishandles "5%.", returning "5 %."

"%." is nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working multi_lang relates to several languages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants