Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the requirement for std::io::Seek for PacketReader etc. #12

Open
TonalidadeHidrica opened this issue Nov 7, 2020 · 10 comments · May be fixed by #18 or #36
Open

Remove the requirement for std::io::Seek for PacketReader etc. #12

TonalidadeHidrica opened this issue Nov 7, 2020 · 10 comments · May be fixed by #18 or #36

Comments

@TonalidadeHidrica
Copy link
Contributor

Currently, readers like PacketReader requires std::io::Seek. But the parsing process of Ogg container is actually streamable. Therefore, it may be more useful to remove those constraints for most of functions, except for seek_bytes and seek_absgp, which truly demand Seek.

@est31
Copy link
Member

est31 commented Nov 7, 2020

Seek is required so that an additional layer of buffering can be avoided. Basically, sometimes there can be crap between ogg pages, so the ogg reader reads chunks of data to read until the next header. Once it found the header, it knows its exact position so it can seek there and further code doesn't need to handle partially parse chunks.

That's why Seek is required.

@TonalidadeHidrica
Copy link
Contributor Author

TonalidadeHidrica commented Nov 7, 2020

So, is seek required when returning to beginning of the header that was found while recovering from corrupt? Then, isn't it enough to store merely thirty-or-something page header buffer? Maybe I am just misunderstanding your explanation. Could you show me the code for the processed you described?

@est31
Copy link
Member

est31 commented Nov 7, 2020

@TonalidadeHidrica

ogg/src/reading.rs

Lines 752 to 773 in 1da041d

/// Reads until the new page header, and then returns the page header array.
///
/// If no new page header is immediately found, it performs a "recapture",
/// meaning it searches for the capture pattern, and if it finds it, it
/// reads the complete first 27 bytes of the header, and returns them.
///
/// Ok(None) is returned if the stream has ended without an uncompleted page
/// or non page data after the last page (if any) present.
fn read_until_pg_header(&mut self) -> Result<Option<[u8; 27]>, OggReadError> {
let mut r = UntilPageHeaderReader::new();
use self::UntilPageHeaderResult::*;
let mut res = try!(r.do_read(&mut self.rdr));
loop {
res = match res {
Eof => return Ok(None),
Found => break,
ReadNeeded => try!(r.do_read(&mut self.rdr)),
SeekNeeded => try!(r.do_seek(&mut self.rdr))
}
}
Ok(Some(r.into_header()))
}

@TonalidadeHidrica
Copy link
Contributor Author

After reading the code for n hours, I finally understand why Seek is needed. When searching for header, if there wasn't a magic within the first 27 bytes, it reads the succeeding output in a (maximum of) 1024-byte chunk each. This is repeated until OggS magic is found. When it was found in the midst of the buffer read, the pointer of underlying reader should be moved back so that the subsequent process can read the content of the page succeeding right after the page header, and that's why the reader has to be Seeked, right?

But I think this can be done by reading byte-by-byte from the reader while matching against the magic pattern. Then it no longer require Seek. Additionally, I doubt the current algorithm enables us to avoid "an additional layer of buffering", because it simply discards the bytes it obtained from the reader rather than passing them to chunk reader. On the contrary, if some buffered reader is provided to the ogg reader, it even results in the data buffered twice.

What do you think? If my idea seems to be good, I'll refactor the UntilPageHeaderReader.

@est31
Copy link
Member

est31 commented Nov 12, 2020

When it was found in the midst of the buffer read, the pointer of underlying reader should be moved back so that the subsequent process can read the content of the page succeeding right after the page header, and that's why the reader has to be Seeked, right?

Yup that's correct.

But I think this can be done by reading byte-by-byte from the reader while matching against the magic pattern. Then it no longer require Seek.

That would work from a functionality standpoint, but note that it would then call the read function many many times. If the implementation calls a syscall inside, it would be quite slow from a performance standpoint. This can be fixed by using a BufReader or requiring the BufRead trait. That's what I meant by additional level of buffering.

@TonalidadeHidrica
Copy link
Contributor Author

I know that there is a concern of read function called over and over again, but I think it should be the user's responsibility to avoid it by wrapping the reader by a BufRead-ish reader. We don't have to reinvent the wheel, why not delegate it to an existing one?

By the way, when it comes to adopt my idea, which do you think is better: require BufRead instead of Read, or write an extra caution on the document? The former enables us to make sure that the user will never write an inefficient code, while the latter provides with more flexibility. I don't think wrapping with BufReader internally is not an good idea neither.

@est31
Copy link
Member

est31 commented Nov 12, 2020

require BufRead instead of Read, or write an extra caution on the document?

That'd be the best option imo. I'm not that convinced that the Seek requirement is that bad tho.

@TonalidadeHidrica
Copy link
Contributor Author

TonalidadeHidrica commented Nov 12, 2020

I'm wondering which of those two:

  • require BufRead instead of Read, or
  • write an extra caution on the document

is better, and I want your opinion. (Sorry for my confusing text.)

I think that it'd be better to remove Seek requirements because of RustAudio/rodio#333 . When decoding an streamed audio data via internet, as the author of the issue says, it's hard to implement Seek on the reader itself. Removing the bound makes the library more flexible.

@florian1345
Copy link

I have recently run into this issue myself. Do I understand correctly that seeking can only occur in UntilPageHeaderReader::do_read up to ~1024 (or whatever the length of the buffer) bytes backwards? If so, I would suggest a solution that would preserve the behavior of all projects that currently use this crate.

  • Add a trait that requires a method to read into a newly allocated buffer of 1024 and return any AsRef<[u8]> (which I will call read_revertable) as well as a method to seek backward at most 1024 bytes (which I will call revert).
  • Blanket-implement that trait for all types implementing Seek as doing what UntilPageHeaderReader::do_read does now: allocate a new [u8; 1024], read into it, and return it. As for revert, just use Seek.
  • Implement the new trait for a wrapper around types implementing only Read that, on read_revertable, reads into its own buffer and returns only a reference to it. When it is asked to revert, it remembers to output bytes from the stored buffer first before querying the wrapped Read. Branch prediction should keep performance loss at a minimum in the common case when no seeking is required.
  • Replace the allocation of a buffer and the reading done in UntilPageHeaderReader::do_read by a call to read_revertable and the seeking by a call to revert.

Unless I am missing something, this should still yield the same behavior for types implementing Seek, but allow users to efficiently use the decoder with types that do not implement Seek with an explicit opt-in. It also only buffers whenever it is necessary, so it should be more light-weight than using a BufReader.

An alternative fix would be to expose the size of the internal buffer to allow users to implement a custom buffering Seek wrapper without having to cache the entire file or rely on implementation details of this crate that may be subject to change.

I might be missing something here, but if any of these ideas sound good, I can attempt an implementation myself.

@est31
Copy link
Member

est31 commented Aug 2, 2022

@florian1345 that proposal sounds good! You are welcome to file a PR :).

florian1345 added a commit to florian1345/ogg that referenced this issue Aug 2, 2022
Implemented my proposal to resolve RustAudio#12 .
@florian1345 florian1345 linked a pull request Aug 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants