Refactor: separate parsing and output generation via events #11

tomtau · 2024-06-03T13:26:12Z

tomtau · 2024-06-15T01:05:58Z

For the typed API also:

rules should be somehow configurable in case someone wants to derive to parsers in the same module

Tartasprint · 2024-10-18T22:33:31Z

I glanced over quick-xml's event API, since it was mentioned in pest-parser/pest#885 (reply in thread), and it looks nice !

A question I have is what kind of events are intended for user to have ?
I would guess non silent rules, and node tags (but looking at the pest3 grammar of grammars it seems they were abandonned).

I'll now try to see if there can be ambiguity with that choice from the event listener.
An dumb exemple is a rule containing a choice with the two sides having a common prefix, such as:

dumb_choice = other_rule - "a" | other_rule - "b"

When parsing dumb_choice the listener will receive an other_rule event, but won't know if it is starting left side or the right side of the choice expression. Yet it doesn't matter that much, since that prefix could be factored out of the choice.

A thing I noticed in quick-xml is that for rules that are nested there is a start event and a stop event. So for rules like:

operator = "+" | "-" | "/" | "*"

isolated events would be fired, named operator; and for rules like:

expression = "[" ~ [..omitted..] ~ operator ~ [..omitted..] ~ "]"

would fire a sequence similar to "start expression ....operator....end expression".

But what about a recursive rule like:

separated = "0".."9" - ("," - separated)?

Here applying the previous suggestion looks horrible to me. Also the behaviour is not well defined since the rule has some times element nesting in it and others not.

I will continue this analysis/questions about what decisions have been made later.

tomtau · 2024-10-19T01:02:56Z

I glanced over quick-xml's event API, since it was mentioned in pest-parser/pest#885 (reply in thread), and it looks nice !

Yes, it's the closest to what I had in mind. There's also PEGTL:
https://github.com/taocpp/PEGTL/blob/main/doc/Contrib-and-Examples.md#examples
https://github.com/taocpp/PEGTL/tree/main/src/example/pegtl

which looks interesting, but it's a bit different.

A question I have is what kind of events are intended for user to have ?

Probably rule-level events if the rule matched (but I guess for a debugger etc., it'd probably need other events)

I would guess non silent rules, and node tags (but looking at the pest3 grammar of grammars it seems they were abandonned).

Those weren't sure whether to include them, but some thoughts:

with meta-rules without parameters, there is an overlap with silent rules, but I guess the distinction can be that meta-rules will be expanded in the parse tree / won't produce events on their own, while silent rules will (which the event processor can ignore and not add them to AST if it chooses to, but it's up to each use case implementation).
tags aren't included yet, because it wasn't sure whether they'll be needed with the current typed AST. But @oovm had some more ideas for it Restoration of the pest3 work effort 🙌 pest#885 (comment) and I can imagine e.g. group tags could be useful for user events. So I assume we can add tags to pest3's meta-grammar?

When parsing dumb_choice the listener will receive an other_rule event, but won't know if it is starting left side or the right side of the choice expression. Yet it doesn't matter that much, since that prefix could be factored out of the choice.

I guess for unlabelled choice branches, the event could include the branch index?

Here applying the previous suggestion looks horrible to me. Also the behaviour is not well defined since the rule has some times element nesting in it and others not.

Why does it look horrible? In XML, one may also have nested recursive tags and the events will be fired in that order (I think?).

Anyway, this is all open to discussion and implementation on what would make most sense (I haven't thought in detail about it, quick-xml-like API looked like it could work, but there may be instances where it's not nice which didn't occur to me).

Tartasprint · 2024-10-19T16:15:30Z

Yes, it's the closest to what I had in mind. There's also PEGTL:

Reading their Getting Started page it looks like they have a tracer, which could be similar to events. But I don't remember enough about C++ to understand what is going on in there.

Probably rule-level events if the rule matched (but I guess for a debugger etc., it'd probably need other events)

Agree. Maybe make a low-level tracer for debugging like in PEGTL. Maybe even such a tracer could be the main event generator.

I don't think an event generator needs to handle multiple listeners at the same time, so usage could look like:

// For regular user:
for event in event_generator.clone() {
    ...
}

// For low level stuff:
for event in event_generator.clone().low_level() {
    ...
}

// For a step by step VM
let event = event_generator.next()
let low_level = event_generator.new_low_level()

Doing things this way would basically turn the parser in to an iterable pest-VM. The regular iteration yielding highlevel events, and the low level iteration yielding more details like every attempt/failure.

So I assume we can add tags to pest3's meta-grammar?

I am not sure about that. Indeed it would be useful to add such tags for generating events, but maybe if the event system is done right they will be just boiler plate ? the examples given by oovm show tags indicating the start/end of rules (from what I understood, I maybe wrong). I hope pest users won't need to add those things.

the event could include the branch index?
The branch tags could be useful though, to make this more user friendly.

Why does it look horrible?

Rewriting that rule like that:

separated = "0".."9" - (separation - separated)?
separation = ","

is horrible because to the listener of things it will look like;

start separated
"9"
separation
start separated
"8"
separation
start separated
....
"1"
stop separated
stop separated
...
stop separated

It would be nicer if it looked like:

start separated
"9"
separation
"8"
separation
...
"1"
stop separation

How to get there ? I don't know 😅.

Would it be worth that I try now experimenting with a low level "tracer" ? From there, if it went well, it would be possible to experiment with higher level events.

tomtau · 2025-01-19T00:27:55Z

@Tartasprint @TheVeryDarkness one possible alternative output format would be to process parsing into cstree -- so that could be one potential concrete use case to motivate this refactoring that one could choose between the outputting the current typed API and cstree.

@Tartasprint for that event "separation" vs "separated", maybe it could pass some kind of call depth/trace?

tomtau added this to pest3 near-term Jun 3, 2024

tomtau mentioned this issue Jun 3, 2024

Benchmark results #10

Closed

tomtau moved this to Todo in pest3 near-term Jun 3, 2024

tomtau added the help wanted Extra attention is needed label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: separate parsing and output generation via events #11

Refactor: separate parsing and output generation via events #11

tomtau commented Jun 3, 2024

tomtau commented Jun 15, 2024

Tartasprint commented Oct 18, 2024

tomtau commented Oct 19, 2024

Tartasprint commented Oct 19, 2024

tomtau commented Jan 19, 2025

Refactor: separate parsing and output generation via events #11

Refactor: separate parsing and output generation via events #11

Comments

tomtau commented Jun 3, 2024

tomtau commented Jun 15, 2024

Tartasprint commented Oct 18, 2024

tomtau commented Oct 19, 2024

Tartasprint commented Oct 19, 2024

tomtau commented Jan 19, 2025