-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement Pratt parsing #614
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this looks great. You put a lot of work into this and I see care is taken in cases like the handling of infinite loops and such
src/combinator/precedence.rs
Outdated
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like names to make their role in the grammar clear. I'm trying to decide how this does and if there is anything better. We use separated
for this kind of thing but unsure how to tie it in without it being overly verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expression
? Similar to how combine
names this parser https://docs.rs/combine-language/latest/combine_language/fn.expression_parser.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, I wasn't aware of that crate. We should at least add expression_parser
to our aliases.
We likely should pick one and deal with it and rename it if a better name comes up. expression
seems just as good as any other. It fits with how this would be used in a language grammar.
src/combinator/precedence.rs
Outdated
}; | ||
|
||
/// An adapter for the [`Parser`] trait to enable its use in the [`precedence`] parser. | ||
pub trait PrecedenceParserExt<I, E> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our rule of thumb is grammar-level concepts are standalone functions and value processing are trait functions. These seem grammar-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing! Do you suggest making prefix
, infix
.. standalone functions? I followed the concern in the attached issue to make <parser>.prefix(...)
work:
So I assume we'd go with
let calc = pratt( digits1.map(Expr::Int), ( '-'.prefix(Right(1), |r| unary(r, Op::Neg)); '+'.infix(Left(0), |l, r| binary(l, r, Op::Add)); '!'.prefix(Right(3), |r| unary(r, Op::Fact)); ) );
These prefix
, infix
, postfix
are strange beasts in this implementation. They are both grammar-level constructs and value processing functions because the data processing is required by the parser to advance further.
I though about the interface like prefix(0, "-").map(|(a, b)| a - b)
. It looks clean but what should be the default operator
when we don't call map
. For a unary
it is easy -- just return the argument, but for a binary
it is unclear how to merge operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I previously looked at this and viewed the associativity as not grammar level and now looked at it and viewed it as grammar level. Normally trait methods are reserved for purely operating on the data, like map
. This is actually changing how we parse the results. The one weird middle ground we have is cut_err
. Unsure if this should be treated like that or not.
I also feel like a bunch of free functions floating around might not be the easiest for discovery and use.
I also suspect the ultimate answer will be dependent on the answer to other questions as I alluded to the interplay of my comments in #614 (comment)
src/combinator/precedence.rs
Outdated
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of parsers will need to be updated
src/combinator/precedence.rs
Outdated
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know if aliases with spaces works
src/combinator/precedence.rs
Outdated
/// Type-erased unary predicate that folds an expression into a new expression. | ||
/// Useful for supporting not only closures but also arbitrary types as operator predicates within the [`precedence`] parser. | ||
pub trait UnaryOp<O> { | ||
/// Invokes the [`UnaryOp`] predicate. | ||
fn fold_unary(&mut self, o: O) -> O; | ||
} | ||
/// Type-erased binary predicate that folds two expressions into a new expression similar to | ||
/// [`UnaryOp`] within the [`precedence`] parser. | ||
pub trait BinaryOp<O> { | ||
/// Invokes the [`BinaryOp`] predicate. | ||
fn fold_binary(&mut self, lhs: O, rhs: O) -> O; | ||
} | ||
|
||
impl<O, F> UnaryOp<O> for F | ||
where | ||
F: Fn(O) -> O, | ||
{ | ||
#[inline(always)] | ||
fn fold_unary(&mut self, o: O) -> O { | ||
(self)(o) | ||
} | ||
} | ||
impl<O, F> BinaryOp<O> for F | ||
where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer that impls be closely organized with their trait
src/combinator/mod.rs
Outdated
@@ -174,6 +175,7 @@ pub use self::core::*; | |||
pub use self::debug::*; | |||
pub use self::multi::*; | |||
pub use self::parser::*; | |||
pub use self::precedence::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are dumping a lot of stray types into combinator
. The single-line summaries should make it very easy to tell they are related to precedence
(maybe be the first word) and somehow help the user know what, if any, they should care about
src/combinator/precedence.rs
Outdated
} | ||
} | ||
|
||
macro_rules! impl_parser_for_tuple { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we find the right "mode" by falling through an alt
of parsers for that mode and then get the weight and fold function from the first success.
Looking at the example
precedence(
digit1.try_map(|d: &str| d.parse::<i32>()),
(
"-".value(2).prefix(|x| -1 * x),
"+".value(2).prefix(|x| x),
"!".value(2).postfix(|x| factorial(x)),
"+".value(0).infix(|a, b| a + b),
"-".value(0).infix(|a, b| a + b),
"*".value(1).infix(|a, b| a * b),
"/".value(1).infix(|a, b| a / b),
),
)
.parse_next(i)
the alt seems unfortunate for performance reasons. dispatch
can be used in this case and I suspect a lot of other cases. It would be a big help to find a way to set this up so dispatch
can be used as well, e.g.
precedence(
digit1.try_map(|d: &str| d.parse::<i32>()),
dispatch!{any;
'-' => prefix(2, |x| -1 * x),
'+' => prefix(2, |x| x),
'!' => postfix(2, |x| factorial(x)),
'+' => infix(0, |a, b| a + b),
'-' => infix(0, |a, b| a + b),
'*' => infix(1, |a, b| a * b),
'/' => infix(1, |a, b| a / b),
_ => fail,
),
)
.parse_next(i)
(note: I might give other suggestions that run counter to this, we'll need to weigh out each one and figure out how we want to balance the different interests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea. The implementation in nom
rust-bakery/nom#1362 uses 3 parsers one for each affix in the function signature. It has an advantage that you can use dispatch
and there is no prefix
, postfix
.. functions. The downside is that from just the function signature it's hard to remember which parser is which. And also fail
is used if the parser is not provided.
Maybe abuilder
pattern would be ideal e.g:
precedence(digit1.parse_to::<i32>())
.prefix(dispatch!{any;
"+" => empty.map(|_| move |a, b| a + b)
"-" => empty.map(|_| move |a, b| a + b)
})
.postfix(dispatch!{any; ...})
.infix(dispatch!{any; ...})
.build()
.parse_next(i)
Methods are similar to repeat().fold
. This approach would eliminate stray functions, reduce the performance impact of iterating over each parser for each affix, and allow users to use dispatch or any custom logic they want.
I’m not sure yet how this would interact with binding power. I'm going to research whether we can drop the explicit index argument and rely on array indexing, how binding power interact between affixes .etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across another idea. Instead of dispathing by the parser type maybe we could dispatch by the input type:
enum Input<'i, I>{
Prefix(&'i mut I),
Postfix(&'i mut I),
Infix(&'i mut I)
}
then it would be possible to (I'm not sure yet if Deref would work or not)
dispatch!{any;
'-' => prefix(2, |x| -1 * x),
'+' => prefix(2, |x| x),
'!' => postfix(2, |x| factorial(x)),
'+' => infix(0, |a, b| a + b),
'-' => infix(0, |a, b| a + b),
'*' => infix(1, |a, b| a * b),
'/' => infix(1, |a, b| a / b),
_ => fail,
)
then prefix
parsers would do
if let Prefix() = input {
...
} else {
fail
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice because we only run the needed parsers (less alt
, no dispatch
just to get an error).
I agree about the opacity of parameter order. Its also not great if a parameter is unused (e.g. no postfix).
If we go with the "builder" approach, we are again violating the API design guidelines as discussed in #614 (comment). However, it does make it nice for naming the parameters and only requiring what is needed.
Maybe one way of looking at this is the error reporting. What happens if I use an operator in the wrong location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite getting the input type and dealing with custom input types in the middle of a parser adds a lot of complexity.
src/combinator/precedence.rs
Outdated
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence
is core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming support
This looks to be agnostic of streaming support like separated
is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hide behind a feature flagunstable-pratt
If this is going to start off unstable, then its fine noting most of my feedback in the "tracking" issue and not resolving all of it here
src/combinator/precedence.rs
Outdated
// recursive function | ||
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can do this without recursion? Parser authors are generally cautious around recursion due to blowing up the stack.
Hmm, seems all of the other ones use recursion
src/combinator/precedence.rs
Outdated
} | ||
|
||
// recursive function | ||
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #614 (comment)
While looking at it I found a potential error. My current implementation is unsound. Consider the 2 + 4 + 7. In the current parser it evaluates as (4 + 7) + 2 instead of (2 + 4) + 7 when operators have equal binding power . That is why the algorithm uses associativity. While + or * are commutative someone might want to parse function calls for example
Co-authored-by: Ed Page <[email protected]>
This feature was an overengineering based on suggestion "Why make our own trait" in winnow-rs#614 (comment)
works without it
…d be - based on review "Why allow non_snake_case?" in winnow-rs#614 (comment) - remove `allow_unused` based on "Whats getting unused?" winnow-rs#614 (comment)
until we find a satisfactory api based on winnow-rs#614 (comment) > "We are dumping a lot of stray types into combinator. The single-line summaries should make it very easy to tell they are related to precedence"
based on "Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence is core." winnow-rs#614 (comment)
the api has an unsound problem. The `Parser` trait is implemented on the `&Operator` but inside `parse_next` a mutable ref and `ReffCell::borrow_mut` are used which can lead to potential problems. We can return to the API later. But for now lets keep only the essential algorithm and pass affix parsers as 3 separate entities Also add left_binding_power and right_binding_power to the operators based on winnow-rs#614 (comment)
I will write the documentation later
This feature was an overengineering based on suggestion "Why make our own trait" in winnow-rs#614 (comment)
…d be - based on review "Why allow non_snake_case?" in winnow-rs#614 (comment) - remove `allow_unused` based on "Whats getting unused?" winnow-rs#614 (comment)
until we find a satisfactory api based on winnow-rs#614 (comment) > "We are dumping a lot of stray types into combinator. The single-line summaries should make it very easy to tell they are related to precedence"
based on "Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence is core." winnow-rs#614 (comment)
the api has an unsound problem. The `Parser` trait is implemented on the `&Operator` but inside `parse_next` a mutable ref and `ReffCell::borrow_mut` are used which can lead to potential problems. We can return to the API later. But for now lets keep only the essential algorithm and pass affix parsers as 3 separate entities Also add left_binding_power and right_binding_power to the operators based on winnow-rs#614 (comment)
- require explicit `trace` for operators - fix associativity handling for infix operators: `1 + 2 + 3` should be `(1 + 2) + 3` and not `1 + (2 + 3)`
until we find a satisfactory api based on winnow-rs#614 (comment) > "We are dumping a lot of stray types into combinator. The single-line summaries should make it very easy to tell they are related to precedence"
Another difference from the #618 algorithm is that this recursive one backtracks when the left binding power of an operand is less than the current right binding power. It could affect the performance if parsing the operators is expensive.
The #618 always stacks all parsed operators into a vector. |
Operator callbacks should be able to return fn ternary(cond: Expr) -> ... {
move |i: &mut Input| {
let (a, b) = separated_pair(pratt_parser , ':' , pratt_parser).parse_next(i)?;
Ok(Expr::Ternary(cond, a, b)
}
}
fn pratt_parser() -> ... {
precedence(
...,
trace("postfix", '?'.value((5, ternary))),
...
)
} |
src/combinator/precedence.rs
Outdated
trace( | ||
"prefix", | ||
dispatch! {any; | ||
'+' => empty.value((9, (|_: &mut _, a| Ok(a)) as _)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to ask, why is as _
needed everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler don't want to cast {closure34234234}
into function pointer fn(&mut I, O) -> O
automatically:(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it to be a function pointer rather than a dyn
or an impl
?
I assume #618 (comment) is related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is related. &dyn
also requires casting but with an additional &
. I've switched to function pointers recently because I had trouble convincing the compiler that the lifetime of &mut I
in the parser definition Parser<I, (usize, &dyn Fn(&mut I, Operand) -> PResult<Operand, E>), E>,
is not related to the lifetime of the &dyn Fn
, but it seems not always the case. So function pointers are actually easier and more type inference friendly. I doubt that the closure's variable capture is worth having dynamically dispatched &dyn
callbacks in this context.
It cannot be impl
because the return types of the dispatch
are different closures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we provided an infix
function, the compiler would do the coercion for us, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked. Yes this simple guidance works
fn infix<I, O>(
a: Assoc,
f: fn(&mut I, O, O) -> PResult<O>,
) -> (
Assoc,
fn(&mut I, O, O) -> PResult<O>,
) {
(a, f)
}
...
"*".value(infix(Assoc::Right(28), |_: &mut _, a, b| Ok(Expr::Pow(Box::new(a), Box::new(b)))))
|i: &mut &'i str, a| Ok(Expr::PreIncr(Box::new_in(a, bump))) And it seems it is not possible to use in type `&'f (dyn Fn(&'i mut I, Operand) -> Result<Operand, ErrMode<E>> + 'f)`, reference has a longer lifetime than the data it references I will try to make |
- builder pattern - `Prefix` `Postfix` `Infix` struct tuples
Updated with the new API.
use Infix::*;
expression(0, digit1.parse_to::<i32>())
.prefix(dispatch! {any;
'+' => Prefix(12, |_, a| Ok(a)),
'-' => Prefix(12, |_, a: i32| Ok(-a)),
_ => fail
})
.infix(dispatch! {any;
'+' => Left(5, |_, a, b| Ok(a + b)),
'-' => Left(5, |_, a, b| Ok(a - b)),
'*' => Left(7, |_, a, b| Ok(a * b)),
'/' => Left(7, |_, a, b| Ok(a / b)),
'%' => Left(7, |_, a, b| Ok(a % b)),
'^' => Right(9, |_, a, b| Ok(a ^ b)),
_ => fail
})
.parse_next(i) |
src/combinator/expression.rs
Outdated
#[doc(alias = "precedence_climbing")] | ||
#[inline(always)] | ||
pub fn expression<I, ParseOperand, O, E>( | ||
start_power: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should start_power
also be a "builder" method?
- Only used in very specific cases
- Can better focus documentation on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would referring to it as precedence
or precedence_level
be a more approachable user-facing term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_precedence_level
or min_precedence_level
, allowed_precedence
? This is needed for if the binding power of the operator < start_precedence
the parser stops parsing. This is exposed to allow parsing a subset of an expression such as a ? b : c, 1 + 2
(parsing stops at ,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a user sets it, is it the same precedence level they set for their operator? If so, I'd use current_
as the prefix
src/combinator/expression.rs
Outdated
#[inline(always)] | ||
pub fn prefix<NewParsePrefix>( | ||
self, | ||
parser: NewParsePrefix, | ||
) -> Expression<I, O, ParseOperand, NewParsePrefix, Post, Pix, E> | ||
where | ||
NewParsePrefix: Parser<I, Prefix<I, O, E>, E>, | ||
{ | ||
Expression { | ||
start_power: self.start_power, | ||
parse_operand: self.parse_operand, | ||
parse_prefix: parser, | ||
parse_postfix: self.parse_postfix, | ||
parse_infix: self.parse_infix, | ||
i: Default::default(), | ||
o: Default::default(), | ||
e: Default::default(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these, how come we don't do
#[inline(always)] | |
pub fn prefix<NewParsePrefix>( | |
self, | |
parser: NewParsePrefix, | |
) -> Expression<I, O, ParseOperand, NewParsePrefix, Post, Pix, E> | |
where | |
NewParsePrefix: Parser<I, Prefix<I, O, E>, E>, | |
{ | |
Expression { | |
start_power: self.start_power, | |
parse_operand: self.parse_operand, | |
parse_prefix: parser, | |
parse_postfix: self.parse_postfix, | |
parse_infix: self.parse_infix, | |
i: Default::default(), | |
o: Default::default(), | |
e: Default::default(), | |
} | |
} | |
#[inline(always)] | |
pub fn prefix<NewParsePrefix>( | |
mut self, | |
parser: NewParsePrefix, | |
) -> Expression<I, O, ParseOperand, NewParsePrefix, Post, Pix, E> | |
where | |
NewParsePrefix: Parser<I, Prefix<I, O, E>, E>, | |
{ | |
self.parse_prefix = parser; | |
self | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different types of the Expression
due to the different type of the new parser
Closes: #131
This is the initial implementation of the
operator precedence
parser based on the Pratt algorithm https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing.The interface:
Any parser that returns a number can be considered as operator with its
binding power
. To convert the parser intoprefix
,infix
orpostfix
, we use the extension trait on top of theimpl Parser
.Implementation details
Binding power, associativity, Pratt and Dijkstra
The Pratt algorithm only uses theEDIT: I am wrong here. We need to use associativity forbinding power
.Left
andright
associativity are used in Dijkstra'sShunting Yard
algorithm. (It seems likechumsky
incorrectly names its implementation aspratt
while usingshunting yard
internally).infix
operators to determine the order of the operators with equal binding power.I named the combinator
precedence
with aliases topratt
,shunting yard
,precedence climbing
,separated
. If any of the names are confusing, we may consider better alternatives.The implementation is partially inspired by https://gist.github.com/ilonachan/3d92577265846e5327a3011f7aa30770.
But it doesn't use any
Rc<...>
allocations for closures; instead, it uses uses& RefCell<& dyn ...>
references.Operators from the tuple are statically dispatched by the requested affix so
prefix.as_postfix()
will returnfail
whileprefix.as_prefix()
with return the actual parser. This may have a performance issue because it iterates through the whole tuple each time the parser is requested for any affix:See the trace:
(I tried to make a some sort of a linked list where each parser holds an index to the next one with the same affix but, it quickly became a nightmare of lifetimes, complexity, and compilation errors so it was rejected).
Performance
A quick criterion benchmark with a simple input &str
1-2*4+12-561-5-6*6/9-3+1*-2*4-758*3
shows:TODO
Implement the api for Vec<..> and &[...]unstable-pratt
EDIT:
Things to consider
i64
but may overflow due to operations likeprecedence + 1
andprecedence - 1
. Consider replacing+
and-
with checks against theAssoc
enum instead.