Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remove emoji #111

Closed
rzsgrt opened this issue Jul 21, 2020 · 7 comments
Closed

Add remove emoji #111

rzsgrt opened this issue Jul 21, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@rzsgrt
Copy link

rzsgrt commented Jul 21, 2020

Hi, do you think its great to have remove emoticon in preprocessing? Its on top of this one. Would help on it.

@jbesomi
Copy link
Owner

jbesomi commented Jul 22, 2020

Hey @rezasugiarto, thank you for opening an issue! 🎉

It sounds like a good idea. Can you please provide us with an example with a sample code? We will need to make sure the solution is fast enough and correct.

If you haven't done yet, please read CONTRIBUTING.md.

Also, you might find this python toolkit useful: python-ftfy.

Regards,

@polvoazul
Copy link

I have code that does this! It is not very fast, but it works. I also have code that tokenizes emails and telephone numbers. Would you be intested in a PR to include these?

@rzsgrt
Copy link
Author

rzsgrt commented Jul 31, 2020

Hi, sorry I haven't done it yet. maybe we can collaborate on it @polvoazul

@jbesomi
Copy link
Owner

jbesomi commented Jul 31, 2020

Hey @polvoazul, thank you for your comment.

Can you give us further information regarding your code? Yes, they might be useful.

Regarding tokenizing emails and telephone numbers, for now we are simply using a regular expression but we were considering switch to spacy #131. What's and how does it works your solution?

@rzsgrt
Copy link
Author

rzsgrt commented Jul 31, 2020

Hi guys

This is snippet from what i do in my other project

import emoji

def remove_emoji(text:str)->str:
    text = emoji.get_emoji_regexp().sub(u"",text)
    return text

image

What do you guys think?

@henrifroese
Copy link
Collaborator

I believe the import / new dependency of the emoji package might be unnecessary. In your sample code, you use the RegEx from emoji.get_emoji_regexp(). Why not just locally look at and copy the RegEx once, and put it in the code, so we don't need to import it in the library? That should work

@rzsgrt
Copy link
Author

rzsgrt commented Aug 5, 2020

Ohh, i think i missed it. It would be perfect i guess

@rzsgrt rzsgrt closed this as completed Apr 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants