Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement take_until_parser_matches #469

Closed

Conversation

tomalexander
Copy link

Implement a macro that will consume input until the child parser matches

@tomalexander
Copy link
Author

Hey, sorry that took so long. Ran into an old friend I hadn't talked to in over half a decade and I've been recovering from that hangover.

@coveralls
Copy link

coveralls commented Mar 20, 2017

Coverage Status

Coverage increased (+0.1%) to 82.845% when pulling 803e17c on tomalexander:take_until_parser_matches into c5090ef on Geal:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 87.96% when pulling f387919 on tomalexander:take_until_parser_matches into db94701 on Geal:master.

@chengsun
Copy link
Contributor

chengsun commented Mar 23, 2017

This looks very useful, but a bit broken as implemented. Specifically, if the child parser never matches, take_until_parser_matches will not return an Error or Incomplete. For example:

named!(end <&[u8],&[u8]>, tag!("bbb"));
named!(before_bbb, take_until_parser_matches!(end));

assert_eq!(before_bbb(&b"ab"[..]), Incomplete(Needed::Size(3)));
// assert fails: actual value is Done(&b"b"[..], &b"a"[..])

Also, maybe we could also have a take_until_parser_matches_and_consume variant?

@tomalexander
Copy link
Author

Thanks for taking a look! That behavior was my intent, but I just checked and you're right. The other take_until* macros throw an error if theres no match. I'll revise (and add more tests for that)

@Geal
Copy link
Collaborator

Geal commented Jul 17, 2017

any news on this?

@tomalexander
Copy link
Author

tomalexander commented Jul 17, 2017 via email

@Geal Geal force-pushed the master branch 3 times, most recently from 6c99077 to e7ca818 Compare May 14, 2018 12:06
@Geal Geal force-pushed the master branch 2 times, most recently from fbd7d65 to cfc19f1 Compare May 9, 2019 11:10
@tomalexander tomalexander force-pushed the take_until_parser_matches branch 2 times, most recently from b9a827d to 7d05a21 Compare March 8, 2020 03:37
@tomalexander
Copy link
Author

Hey, back from the dead. I'm loving the switch to functions in nom version 5! I've updated the PR the latest master, rewrote it as a function instead of a macro, and fixed the style issue chengsun raised about it not returning an error if the parser does not find anything to match.

I haven't implemented take_until_parser_matches_and_consume because I think that would be the same as just a simple terminated(take_until_parser_matches(foo), foo) but if you'd still like it in, I can add it.

@tomalexander
Copy link
Author

I thought about it some more and I decided to add in a take_until_parser_matches_and_consume because I realized its a performance thing. As in, terminated(take_until_parser_matches(foo), foo) still has to run the parser foo one additional time, even though we just successfully ran it in take_until_parser_matches. I'm still using the same ErrorKind though, should I be making a distinct one for take_until_parser_matches_and_consume ?

@Geal Geal added this to the 6.0 milestone Apr 6, 2020
@tomalexander
Copy link
Author

tomalexander commented Apr 12, 2020

Hey, I just made a small design change, so I want to call it out here to make sure that the project is aware of it and agree with it. Namely, I changed it to attempt to use its inner parser on the empty string after the full input has been captured.

Previously, take_until_parser_matches would apply its inner parser on all substrings until the end of the input, but not including the end of the input. So for example, if the input was "foo" it would test f("foo"), f("oo") and f("o") but not f(""). This meant that it could never capture the full input.

An example use-case where this change is helpful is trying to read until the end of the input, excluding any trailing whitespace. You could write this parser as:

assert_eq!(
    Ok(("\n", "foo bar baz")),
    take_until_parser_matches::<_, _, _, (_, ErrorKind)>(all_consuming(multispace0))(
        "foo bar baz\n"
    )
);

but if there was no trailing whitespace, even though multispace0 accepts an empty input, this would fail:

assert_eq!(
    Ok(("", "foo bar baz")),
    take_until_parser_matches::<_, _, _, (_, ErrorKind)>(all_consuming(multispace0))(
        "foo bar baz"
    )
);

After this patch, both code blocks succeed.

Also I see I need to rebase my branch on the latest changes. I'll get to that now.

@tomalexander tomalexander force-pushed the take_until_parser_matches branch from 0b00048 to 74e1603 Compare April 12, 2020 21:35
@tomalexander
Copy link
Author

I think its important to point out the difference between this and many_till since the two are very similar parsers.

many_till works by running the first parser and then checking the second parser. Then it runs the first parser again, and then checks the second parser. This repeats until the second parser has a result.

take_until_parser_matches instead iterates forward checking the second parser at each step, returning everything from before the match.

The difference is useful when the first parser is aggressive/insatiable (I want to say 'greedy' but I know in parsing thats the opposite). So, for example, lets say you had a string:
"ababababefg" and you wanted to match as many "ab"s as possible, followed by as many "a"s as possible, as long as its all before a "befg". With many_till you might try to write this as:

let x: IResult<&str, (Vec<&str>, &str)> =
            many_till(alt((tag("ab"), tag("a"))), tag("befg"))("ababababefg");
        assert_eq!(x, Ok(("", (vec!["ab", "ab", "ab", "a"], "befg"))));

But that won't work because the first parser will eat the "b" in "befg", causing the second parser to never match.

Whereas take_until_parser_matches is only attempting to match the end/2nd parser so you could write:

        let x: IResult<&str, (Vec<&str>, &str)> = tuple((
            map_parser(
                take_until_parser_matches(tag("befg")),
                many1(alt((tag("ab"), tag("a")))),
            ),
            tag("befg"),
        ))("ababababefg");
        assert_eq!(x, Ok(("", (vec!["ab", "ab", "ab", "a"], "befg"))));

and it would work.

These are obviously contrived examples, but I think it would be particularly useful for parsing org-mode headlines. The headlines support "tags" at the end of the line, so you could have a headline like:

*** Handling segfaults :thisIsATag:andSoIsThis:

but if the ending isn't a valid tag (for example, having spaces or not having a closing colon) then it gets treated as part of the headline

*** Handling segfaults :this isnt a tag:
*** Handling segfaults :thisAlsoIsntATag

so with take_until_parser_matches you could do:

let (i, (headline, tags)) = tuple((take_until_parser_matches(alt((tag_parser, line_ending))), opt(tag_parser)))(i)?;

but with many_till I don't really see a solution other than:

many_till(anychar, alt((tag_parser, line_ending)))

leaving you with an unfortunate vector of single characters instead of a nice slice.

@tomalexander tomalexander changed the title Implement the take_until_parser_matches macro Implement the take_until_parser_matches Jul 11, 2020
@tomalexander tomalexander force-pushed the take_until_parser_matches branch from 74e1603 to 9aed9e4 Compare July 12, 2020 00:05
@tomalexander tomalexander changed the title Implement the take_until_parser_matches Implement take_until_parser_matches Jul 12, 2020
@NickNick
Copy link

NickNick commented Dec 4, 2020

Hi! I was taking this for a spin, but got a panic when the 'take_split' tried to index a string between code points:

thread 'test_nom_token_recipe' panicked at 'byte index 10 is not a char boundary; it is inside 'ö' (bytes 9..11) of ` frozen rösti rounds to oven`', /home/nick/.cargo/registry/src/github.com-1ecc6299db9ec823/nom-6.0.1/src/traits.rs:504:7

This happens here:

        for ind in 0..=input_length {
            let (remaining, taken) = i.take_split(ind);
            match f.parse(remaining) {
                Err(_) => (),

I changed it into this (added InputIter, use iter_indices):

    Input: InputTake + InputIter + InputLength + Clone,
    F: Parser<Input, O, Error>,
    Error: ParseError<Input>,
{
    move |input: Input| {
        let i = input;
        for (ind, _) in i.iter_indices() {
            let (remaining, taken) = i.take_split(ind);

Which seems to work for me, but please do check it. :)

Of course, thanks for writing it in the first place! Couldn't have done it myself. 👍

@tomalexander
Copy link
Author

Hey thanks for the report and the suggested change! I'll try to look at it this weekend (I'm taking vacation days this week to work on a side project so I don't want get sidetracked during the week)

@tomalexander
Copy link
Author

Thanks! I like the change. The only change I needed to add was for checking the parser after the input has been exhausted.

The take_until_parser_matches parser iterates through the input
attempting to apply the inner parser to it. Upon successfully applying
the inner parser, it will consume and return the input up until the
inner parser's match while not consuming the contents of the inner
parser.
Previously, take_until_parser_matches would apply its inner parser on all substrings until the end of the input, but not including the end of the input. So for example, if the input was `"foo"` it would test `f("foo")`, `f("oo")` and `f("o")` but not `f("")`. This meant that it could never capture the full input.

An example use-case where this change is helpful is trying to read until the end of the input, excluding any trailing whitespace. You could write this parser as:

```python
assert_eq!(
    Ok(("\n", "foo bar baz")),
    take_until_parser_matches::<_, _, _, (_, ErrorKind)>(all_consuming(multispace0))(
        "foo bar baz\n"
    )
);
```

but if there was no trailing whitespace, even though `multispace0` accepts an empty input, this would fail:

```python
assert_eq!(
    Ok(("", "foo bar baz")),
    take_until_parser_matches::<_, _, _, (_, ErrorKind)>(all_consuming(multispace0))(
        "foo bar baz"
    )
);
```

After this patch, both code blocks succeed.
- Migrated to the Parser<> trait
- Changed Fn -> FnMut as required by the Parser trait
- separated_list -> separated_list1
…ed by @NickNick on github.

This avoids the issue of splitting an index of a string between unicode codepoints.
@tomalexander tomalexander force-pushed the take_until_parser_matches branch from 1b7896c to 9708fb9 Compare December 13, 2020 22:09
@meritozh
Copy link

this is very useful. any process on this PR?

@tomalexander
Copy link
Author

Thanks! Last I heard from @Geal he was very busy so he didn't have a lot of time to spend on PRs. AFAIK all thats remaining is code review and merging.

@Geal
Copy link
Collaborator

Geal commented May 15, 2021

Yup, still pretty busy, but I intend to do a pass on PRs soon

@tomalexander
Copy link
Author

Thats wonderful news, thanks!

@meritozh
Copy link

@Geal hi,this PR not merged into 6.2?Or we delay it to 7.0?

@homersimpsons
Copy link
Contributor

@Geal I guess this PR could be closed as it seems to be a duplicate of take_till

@meritozh
Copy link

@homersimpsons take_till only accept InputTakeAtPosition which mostly is &[u8] or &str, take_until_parser_matches accept Parser.

@chotchki
Copy link
Contributor

I just wanted to add that this pull request works great for my use case.

I'm trying to parse Postgres compatible sql constant strings. There is a variant that auto joins separated strings together ONLY if they are separated by a newline. The take_until_parser_matches_and_consume function is perfect for this.

@Stargateur
Copy link
Contributor

See my comment I don't think this is a good addition to nom

@Geal Geal modified the milestones: 6.0, 8.0 Jan 1, 2023
@tomalexander
Copy link
Author

Good enough for me, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants