-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate matches in one sentence #300
Comments
whoa! thank you Robin. That certainly looks like a bug! I'll look at fixing it this week. same thing for markdown and latex output, too. 😬 |
Looking at the code that generates the html output, I think the While parsing the input string it needs to remember the position where in the |
right. I had avoided using character offsets, because the text keeps changing, as it gets parsed. But yeah, it's clear that some design changes need to come. I'd like to seperate html/markdown/latex outputs into seperate projects all-together. My initial goal with this library was just to get json out of a page, but it's become obvious that rendering stuff is a big part of the project, and needs to be done properly. You're free to pursue any solution for this. I'll merge anything! |
Yep, it looks complicated. If continuing with this approach, following steps would need to fix all the offsets generated by previous steps, if they are changing the text. An alternative approach would be to parse the input string by generating an abstract syntax tree. And then to define the transformations from the AST to other data formats (text, json, html, markdown, etc). Checking for duplicates and writing down which n-th occurrence it is will probably not work, because there are too many edge cases with the other transformation steps. To correctly implement that you'd need access to the already completely transformed string before the current match. Not rendering seems better than rendering the wrong things, but the same condition holds: To correctly implement this you'd need access to the already completely transformed string before the match. Otherwise there will be over-blocking (not-render things that are actually fine currently) - which might still be better than the current situation. Counter-Example: Another issue, that just came to my mind while thinking about the current order of transformations (links, bold, italic): wtf("[[Page 1|'''link''']]").sentences(0).html() Produces: <span class="sentence"><b>link</b></span> Instead of: <span class="sentence"><a class="link" href="./Page_1"><b>link</b></a></span> Because the link parser thinks the final |
yeah, thanks Robin. I understand your point. I guess since my motivation for this library has been to get data out of wikipedia, and it's doing that fine in this case, I can't see myself dramatically re-writing things in such a way. If there were cases where we couldn't parse the document, without a proper AST interpreter, I'd probably be more keen. Maybe I should review the concepts and state-of-the-art tooling, but I feel like if you generated a AST, you're still just doing all sorts of look-ahead/look-behind checks, loops, nested conditions, etc. I mean, For parsing a table, you still have to know Maybe I gotta stew on this issue a little more. The real problem is having two identical things within a sentence. Maybe we should operate on a smaller unit. Maybe we should parse individual words, which would perhaps resolve this. Maybe having two identical links in one sentence is so niche that we can ignore it. Maybe html output is something we should have an astrix on, admitting it can't be done very well. There's been other clever proposals to add some token structured data back into the string - like cheers |
created another issue for the |
Parsing individual words: Is Speaking of parsing on a word level, the parsing on the sentence level isn't optimal either. Here is a new bug for you :P const parsed = wtf("'''A normal sentence. And a second sentence.'''")
parsed.sentences(0).html()
parsed.sentences(1).html() Resulting in: <span class="sentence">'''A normal sentence.</span>
<span class="sentence">And a second sentence.'''</span> Expected (or similiar): <span class="sentence"><b>A normal sentence.</b></span>
<span class="sentence"><b>And a second sentence.'''</b></span> Token structure proposal: Generating a token structure (syntactic analysis) is the first step of parsing, so it can't harm, but without a semantic analysis it won't help much with nested structures I think. I haven't looked at issue #201 in detail, but the tokens are concatenated into the text and need to be str_replaced, instead of generating a separate data structure. That might cause problems when the following steps need to operate inside the text of the tokens or if they don't detect the tokens correctly and interpret them as text or interpret text as tokens (the last point is very unlikely, because Just for fun, because I find this problem interesting and I like to program things, I hacked together a little parser that generates an AST data structure on a sentence level only (so it only supports anchors, bolds and italics). It's far from perfect, doesn't support all features that your parser offers, should be optimized and will probably fail horrible with invalid input data, but it avoids a lot of potential parsing issues that happen with the ordered replace parsing and it has a lot of examples as unit tests: It basically boils down to this regexp to tokenize the string: /('''''|''''|'''|''|<\/?[bi]>|\[\[|\]\]|\|)/g And the following grammar (EBNF): Sentence = Token, { Token } ;
Token = Text | Link | BoldAndItalic | Bold | Italic ;
Link = "[[", { Token }, [ "|", { Token } ], "]]" ;
BoldAndItalic = "''''", {{ Token }}, "''''" ;
Bold = BoldTag | BoldQuote ;
BoldTag = "<b>", { Token }, "</b>" ;
BoldQuote = "'''", {{ Token }}, "'''" ;
Italic = ItalicTag | ItalicQuote ;
ItalicTag = "<i>", { Token }, "</i>" ;
ItalicQuote = "''", {{ Token }}, "''" ;
Text = ? all text tokens from tokenization step ? ; Interestingly Example: const { compile, parse } = require('./parser')
const input = "pre [[link|pre ''italic'' post]] post"
const tokens = compile(input)
const ast = parse(tokens)
console.log(tokens)
console.log(ast)
console.log(ast.toString())
console.log(ast.toText())
console.log(ast.toHTML())
console.log(ast.toMarkdown()) Output:
|
hey cool! |
Example 1:
Result:
Expected Result:
Example 2:
Result:
Expected Result:
The text was updated successfully, but these errors were encountered: