-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenize phrasal-verbs #473
Comments
Imho that is not what tokenization is meant for. Tokenization splits a text into words (and punctuation, if necessary) and "take off" consists two words. Combining them into a phrasal verb requires partial parsing or chunking. |
@Hugo-ter-Doest |
It's not yet in natural, but I'm working on that to use it for named entity recognition. You can have a preview at a CYK and Earley parsers here in this branch: parsers are in lib/natural/parsers Feel free to already use that, but it may still change. |
cool, I'll have a look. |
You could tokenize your sentence, tag each token's part of speech, and then find patterns. For example, VERB + DET or VERB + PREPOSITION. I use that to find noun phrases (JJ|NN+). |
@Hugo-ter-Doest Do you have a set timeline as to when you would be able to integrate the code into Natural's codebase? |
You can implement that, for now, using some sort of pattern matching (e.g. spaCy) such as you would walk the array of tokens, and find whatever patterns you are looking for (e.g. NOUN followed by PREP, or as many NOUNS/ADJ followed by PREP, etc). You can look at spaCy's code (python) and port it to Node and Natural's token structure: https://github.com/explosion/spaCy/tree/master/spacy/matcher |
is there is a way to tokenize a sentence taking into consideration phrasal-verbs.
example:
"The flight take off at three o'clock"
output should be:
[the, flight, take off, at, three, o'clock]
take off should be tokenized as one word.
The text was updated successfully, but these errors were encountered: