Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzip support #18

Open
slowkow opened this issue Dec 3, 2015 · 6 comments
Open

bgzip support #18

slowkow opened this issue Dec 3, 2015 · 6 comments

Comments

@slowkow
Copy link

slowkow commented Dec 3, 2015

Would it be possible to support files compressed with bgzip? Here's the link to source code. This would be very valuable for bioinformaticians.

Right now, here's what I get:

zindex test3.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Opening database test3.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 18 bytes of 129.16 MiB (0.00%)
Index reading complete
Flushing
Done
Closing database

It works after I convert from bgzip to gzip:

zcat test3.gz | gzip > test4.gz
zindex test4.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Warning: Rebuilding existing index test4.gz.zindex
Opening database test4.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 123.81 MiB (0.00%)
Progress: 85.41 MiB of 123.81 MiB (68.98%)
Index reading complete
Flushing
Done
Closing database
@mattgodbolt
Copy link
Owner

I'd happily accept a patch to support this file format, but without clear documentation on what the file format is, plus a good way to "fast forward" and store partial decompression information, it may be very difficult.

@schelhorn
Copy link

schelhorn commented Dec 15, 2016

I'd value support for this as well; the BGZF file format is gunzip compatible and the specs are here. The tabix index is published here.

@mattgodbolt
Copy link
Owner

Thanks for the +1. I'll see what I can do. Time for zindex/zq is seriously limited at the moment.

@lonphan
Copy link

lonphan commented May 31, 2017

+1 for bgzip.

@mattgodbolt
Copy link
Owner

Just trying to understand this a bit more. It seems like:

  • BGZF is really a sequence of compressed gzip blocks, each with extra information. The blocks are concatenated which means the compression state is not required at each block boundary (zindex was specifically written to avoid having to do this on the source file).
  • tabix is an indexing system that understands the BZGF file format and is able to index it and then offer random access to the blocks of the file.

I'm not quite sure how zindex would fit into this? Perhaps someone here can share an example file and use case of queries?

At the very least zindex should support the concatenated gzip files (which is spec compliant), even if it doesn't use the tabix format in any way. There might then be an option to drop the need for the compression buffers in the zindex indices, which will make them smaller.

mattgodbolt added a commit that referenced this issue Jun 9, 2017
@mattgodbolt
Copy link
Owner

Ok: I now support what I believe is the bgzip format; though without understanding any of its tables etc. As bgzip is just concatenated gzip files (with extra trailer info) it should "just work". @slowkow and/or @schelhorn can you give it a go please? Again, this doesn't use or understand the tabix part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants