-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AggressiveTokenizer uses \W which allows "_"
underscore character. However ALL other tokenizers REMOVE the underscore characters
#523
Comments
"_"
underscore character. However ALL other tokenizers REMOVE the underscore characters
AggressiveTokenizer uses \W which allows
P.S. I also see that |
I added See #531 |
Thanks @Hugo-ter-Doest. I also might suggest using a modified version of this regex https://github.com/regexhq/punctuation-regex/blob/master/index.js#L12 |
@Hugo-ter-Doest if you can handle this I would gladly tip you - just email me at [email protected] with your PayPal address - I'm writing a very advanced classifier that handles every edge case and need tokenization to adhere strictly based off the language's use of punctuation |
Also, I'm on the latest version, v2.1.2 and I still see underscores via AggressiveTokenizr English usage. |
Can you give me some examples. I don't see underscores anymore. Also, the test I added to the spec is working fine:
|
It doesn't handle line-breaks |
For example:
On Wed Aug number_rponnmzvyi number_rponnmzvyi at number_rponnmzvyi : number_rponnmzvyi Ulises Ponce wrote:
> Hi!
>
> Is there a command to insert the signature using a combination of keys and not
> to have sent the mail to insert it then?
I simply put it (them) into my (nmh) component files (components,
replcomps, forwcomps and so on). That way you get them when you are
editing your message. Also, by using comps files for specific
folders you can alter your .sig per folder (and other tricks). See
the docs for (n)mh for all the details.
There might (must?) also be a way to get sedit to do it, but I've
been using gvim as my exmh message editor for a long time now. I
load it with a command that loads some email-specific settings, eg,
to "syntax" colour-highlight the headers and quoted parts of an
email)... it would be possible to map some (vim) keys that would add
a sig (or even give a selection of sigs to choose from).
And there are all sorts of ways to have randomly-chosen sigs...
somewhere at url_clbkqiipty .. ok, here we go:
url_clbkqiipty
(Warning... it's old, May number_rponnmzvyi ).
> Regards,
> Ulises
Hope this helps.
Cheers
Tony
_______________________________________________
Exmh-users mailing list
email_qmonrtrsxz
url_clbkqiipty |
In the meanwhile I've implemented https://github.com/yoshuawuyts/newline-remove |
Did you try first segmenting the text in sentences and then tokenize in words? |
See my comment on the other issue - the v2.1.2 published version does not match up to what's on GitHub |
You're right. I fixed it now with a new patch. |
Thank you! |
cc @Hugo-ter-Doest
e.g. I'm left with tokens like "____________________________"
Pretty sure this affects all locales of stemming/tokenization as well
The text was updated successfully, but these errors were encountered: