tokenize phrasal-verbs #473

ayman-ibrahim · 2018-11-14T21:56:30Z

is there is a way to tokenize a sentence taking into consideration phrasal-verbs.
example:

"The flight take off at three o'clock"

output should be:
[the, flight, take off, at, three, o'clock]

take off should be tokenized as one word.

Hugo-ter-Doest · 2018-11-14T22:11:18Z

Imho that is not what tokenization is meant for. Tokenization splits a text into words (and punctuation, if necessary) and "take off" consists two words. Combining them into a phrasal verb requires partial parsing or chunking.

ayman-ibrahim · 2018-11-14T22:16:05Z

@Hugo-ter-Doest
Ok, do you know if there's a way to combine phrasal verbs in natural library ?

Hugo-ter-Doest · 2018-11-14T22:23:05Z

It's not yet in natural, but I'm working on that to use it for named entity recognition. You can have a preview at a CYK and Earley parsers here in this branch:
https://github.com/Hugo-ter-Doest/natural/tree/NER/

parsers are in lib/natural/parsers
a chunker based on the Earley parser is in lib/natural/NER

Feel free to already use that, but it may still change.

ayman-ibrahim · 2018-11-14T22:26:03Z

cool, I'll have a look.
Thanks.

lazharichir · 2018-11-20T22:37:09Z

You could tokenize your sentence, tag each token's part of speech, and then find patterns. For example, VERB + DET or VERB + PREPOSITION. I use that to find noun phrases (JJ|NN+).

privateOmega · 2019-01-07T11:03:22Z

@Hugo-ter-Doest Do you have a set timeline as to when you would be able to integrate the code into Natural's codebase?

lazharichir · 2019-04-11T09:14:38Z

You can implement that, for now, using some sort of pattern matching (e.g. spaCy) such as you would walk the array of tokens, and find whatever patterns you are looking for (e.g. NOUN followed by PREP, or as many NOUNS/ADJ followed by PREP, etc).

You can look at spaCy's code (python) and port it to Node and Natural's token structure: https://github.com/explosion/spaCy/tree/master/spacy/matcher

Hugo-ter-Doest mentioned this issue Dec 7, 2018

Phrase tagging aka Chunking with Natural #446

Open

Hugo-ter-Doest added the Feature Request label Dec 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize phrasal-verbs #473

tokenize phrasal-verbs #473

ayman-ibrahim commented Nov 14, 2018

Hugo-ter-Doest commented Nov 14, 2018

ayman-ibrahim commented Nov 14, 2018

Hugo-ter-Doest commented Nov 14, 2018

ayman-ibrahim commented Nov 14, 2018

lazharichir commented Nov 20, 2018

privateOmega commented Jan 7, 2019

lazharichir commented Apr 11, 2019

tokenize phrasal-verbs #473

tokenize phrasal-verbs #473

Comments

ayman-ibrahim commented Nov 14, 2018

Hugo-ter-Doest commented Nov 14, 2018

ayman-ibrahim commented Nov 14, 2018

Hugo-ter-Doest commented Nov 14, 2018

ayman-ibrahim commented Nov 14, 2018

lazharichir commented Nov 20, 2018

privateOmega commented Jan 7, 2019

lazharichir commented Apr 11, 2019