Create a linguistic parser #406

atimmer · 2019-11-20T16:28:24Z

Explanation

A TextContainer object should have a getTree method that returns the tree based on the text. This tree should be generated by a linguistic parser that knows how to split a text into sentences and words. The getTree method should return a tree with Sentence and Word objects.

The Sentence object should contain the content of the sentence and the relative indexes within the text container.
The Word object should contain the content of the word and the relative indexes within the text container.

Better suggestions for the name of the linguistic parser are welcome. The linguistic parser should use the code we already have available in the current code. So the sentence parser can be reused. I've created an issue to track the removal of the HTML specific code from the sentence parser

Tasks

Create a method on the textContainer class
- Create tokenizer for sentences based on current tokenizer
- Create tokenizer for words based on current tokenizer
- Call tokenizers in getTree method.
- Return sentence objects, a sentence object contains word objects
Create unit tests (copy from existing tokenizer)
- Create unit tests covering this issue SentenceTokenizer incorrectly processes punctuation marks within words #402

Technical decisions

The linguistic parser can use the current sentence tokenizer.

Feedback?

The text was updated successfully, but these errors were encountered:

maartenleenders · 2019-12-12T10:15:16Z

I've pushed my work to 406-create-linguistic-parser (not so much yet).

Some clarifications I've gotten during my assignment:

The issue includes building the linguistic parser helper functions.
The issue includes building the Word and Sentence objects.

Remaining tasks added to the issue ☝️

manuelaugustin · 2020-01-22T10:50:52Z

Closed in favor of the following Jira issues: https://yoast.atlassian.net/browse/LIN-80?atlOrigin=eyJpIjoiNjhlOWIzYzgxNjZlNDJkNmIwMzgwMmQ3OTI4MDRhNTQiLCJwIjoiaiJ9 & https://yoast.atlassian.net/browse/LIN-164?atlOrigin=eyJpIjoiZTc5NzdhOGJlMTc0NGNjZWIyODA3NDdhMmUzMjAzMzYiLCJwIjoiaiJ9

atimmer added this to the StructuredTree milestone Nov 20, 2019

manuelaugustin added the 8 Story points label Nov 21, 2019

atimmer mentioned this issue Nov 22, 2019

Delete the current sentence & word tokenizers/parsers #405

Open

manuelaugustin added the component: parse tree label Dec 4, 2019

manuelaugustin assigned maartenleenders Dec 6, 2019

maartenleenders removed their assignment Dec 12, 2019

manuelaugustin self-assigned this Jan 16, 2020

manuelaugustin mentioned this issue Jan 21, 2020

LIN-80 Create sentence parser #459

Merged

3 tasks

manuelaugustin removed their assignment Jan 22, 2020

igorschoester self-assigned this Jan 22, 2020

manuelaugustin closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a linguistic parser #406

Create a linguistic parser #406

atimmer commented Nov 20, 2019 •

edited by manuelaugustin

Loading

maartenleenders commented Dec 12, 2019

manuelaugustin commented Jan 22, 2020

Create a linguistic parser #406

Create a linguistic parser #406

Comments

atimmer commented Nov 20, 2019 • edited by manuelaugustin Loading

Explanation

Tasks

Technical decisions

Feedback?

maartenleenders commented Dec 12, 2019

manuelaugustin commented Jan 22, 2020

atimmer commented Nov 20, 2019 •

edited by manuelaugustin

Loading