Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update streaming input and input type columns #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

This repo tries to assess Rust parsing performance.

| crate | parser type | action code | integration | input type | precedence climbing | parameterized rules | streaming input |
|-----------|-------------|-------------|--------------------|------------------------|---------------------|---------------------|-----------------|
| [chumsky] | combinators | in source | library | `&str` | ? | ? | ? |
| [combine] | combinators | in source | library | `&str` | ? | ? | ? |
| [lalrpop] | LR(1) | in grammar | build script | `&str` | No | Yes | No |
| [nom] | combinators | in source | library | `&[u8]`, custom | No | Yes | Yes |
| [peg] | PEG | in grammar | proc macro (block) | `&str`, `&[T]`, custom | Yes | Yes | No |
| [pest] | PEG | external | proc macro (file) | `&str` | Yes | No | No |
| [pom] | combinators | in source | library | `&str` | ? | ? | ? |
| [winnow] | combinators | in source | library | `&str`, `&[T]`, custom | No | Yes | Yes |
| [yap] | combinators | in source | library | `&str`, `&[T]`, custom | No | Yes | ? |
| crate | parser type | action code | integration | input type | precedence climbing | parameterized rules | streaming input |
|-----------|-------------|-------------|--------------------|-------------------------|---------------------|---------------------|-------------------------------------------------------------------------|
| [chumsky] | combinators | in source | library | `&str`, `&[T]`, custom | ? | ? | [Yes](https://docs.rs/chumsky/latest/chumsky/stream/struct.Stream.html) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"streaming input" means "can it handle operating on a partial/incomplete input"

  • For chumksy, that looks to be the equivalent of our Stream trait. I'm not seeing anything about partial / incomplete input
  • For combine, I think this is the more proper link but ... there is no documentation on the topic
  • For nom, I guess that is the best that can be done? There really isn't a good resource on it
  • For winnow, the link should go to https://docs.rs/winnow/latest/winnow/stream/struct.Partial.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is sufficient for a parser to be able to make progress with partial input to support "streaming input".
As an extreme example:

  • Assume you get 1 new token / minute
  • It takes 1 minute to process a token.
  • You parse 100 tokens.
    Then a streaming parser will be finished at minute 101.
    A non-streaming parser will have to wait 100 minutes for all the tokens then spend another 100 minutes on processing resulting in 200 minutes to finish parsing.

It seems there are two ways of doing this.

  1. External partial state (what nom and winnow use with their Incomplete error variants) where a parser takes a partial input and maybe a partial state and returns the new partial or complete state.
  2. Internal partial state (what some of the others use) where a parser repeatedly takes partial input and holds a partial state only returning once some part of final state is generated.

If you can parse a token iterator you can parse a stream. You write your input stream as an iterator.

For chumksy, you can use an iterator over the stream of input using the from_iter method.

I'm not super familiar with combine, but the link you posted seems to be a newtype to signal a certain behavior when reaching the end of input. This is not necessary for parsing a stream because you don't need to reach the end of the stream until you have all the tokens. Write your stream to block or await for available tokens and your parser doesn't need to know.
For example in yap_streaming fizzbuzz example new tokens can take an unbounded amount of time, but the parser can process all tokens it has so far received without ever knowing that it waited for input.
I think the combine link is correct because the options available at that page seem to be how one would handle different kinds of streams.

I have no problem changing the winnow link.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For chumksy, you can use an iterator over the stream of input using the from_iter method.

While that does use an iterator, without seeing an example showing this use case, I question how it would work. For example, how do you handle the end span?

I think the combine link is correct because the options available at that page seem to be how one would handle different kinds of streams.

This all seems very handy wavy guesses as to how its supposed to work and without verified examples, who knows if all of the practical aspects are taken care of.

And examples only help in calling attention to it and not fully resolving it. For example, I commented in the issue about IO error handling for yap_streaming but also blocking in the parser could end up with serious ramifications for an application.

I am curious, how do you know when you can stop keeping state for backtracking? Is a marker made for the outer most backtracking and as you unwind past it, you free it, allowing the buffer to be reused?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that does use an iterator, without seeing an example showing this use case, I question how it would work. For example, how do you handle the end span?

If I have time I'll see if I can setup an example. Maybe I'll prove myself wrong, though I don't see what would prevent parsing a Reader::bytes() iterator.

I'm not sure what you mean by handling the end span. Maybe its related to how chumsky handles the None case here with the eoi span?

    pub(crate) fn next(&mut self) -> (usize, S, Option<I>) {
        match self.pull_until(self.offset).cloned() {
            Some((out, span)) => {
                self.offset += 1;
                (self.offset - 1, span, Some(out))
            }
            None => (self.offset, self.eoi.clone(), None),
        }
    }

This all seems very handy wavy guesses as to how its supposed to work and without verified examples, who knows if all of the practical aspects are taken care of.

I'll file an issue on combine after we agree on a definition for "streaming input". We might already? Just double-checking.


And examples only help in calling attention to it and not fully resolving it. For example, I commented in the issue about IO error handling for yap_streaming but also blocking in the parser could end up with serious ramifications for an application.

I'll comment there on handling IO and blocking. I'm not sure what the comment on examples fully means.


I am curious, how do you know when you can stop keeping state for backtracking? Is a marker made for the outer most backtracking and as you unwind past it, you free it, allowing the buffer to be reused?

I can't speak to how chumsky does it. In yap_streaming backtracking can only occur with TokenLocation so creating one adds the current offset to a list and removed from the list when dropped. Items are only copied to the buffer if a TokenLocation exists which might need it when a reset occurs. Items are only dropped from the buffer if the oldest TokenLocation in the list is younger.

| [combine] | combinators | in source | library | `&str`, `&[T]`, custom | ? | ? | [Yes](https://docs.rs/combine/latest/combine/stream/index.html) |
| [lalrpop] | LR(1) | in grammar | build script | `&str` | No | Yes | No |
| [nom] | combinators | in source | library | `&str`, `&[u8]`, custom | No | Yes | [Yes](https://docs.rs/nom/latest/nom/bytes/streaming/index.html) |
| [peg] | PEG | in grammar | proc macro (block) | `&str`, `&[T]`, custom | Yes | Yes | No |
| [pest] | PEG | external | proc macro (file) | `&str` | Yes | No | No |
| [pom] | combinators | in source | library | `&str` | ? | ? | No |
| [winnow] | combinators | in source | library | `&str`, `&[T]`, custom | No | Yes | [Yes](https://docs.rs/winnow/latest/winnow/stream/index.html) |
| [yap] | combinators | in source | library | `&str`, `&[T]`, custom | No | Yes | [Yes](https://docs.rs/yap_streaming/) |

# Results

Expand Down