-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to use htsFile in synced reader (for #1862) #1868
Conversation
htslib/synced_bcf_reader.h
Outdated
* With fname as NULL, index file must be present in default location with | ||
* default name, as file_ptr->fn doesn't contain index information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels problematic. I can see that as bcf_sr_add_reader
is public API then we can't easily change it (other than to make a new API), but bcf_sr_add_hreader
is new so we probably should limit it straight off the bat.
I'd prefer the index filename to be an extra argument. If NULL then it has the same caveats as above (as called by bcf_sr_add_reader
), but if non-NULL then it specifies the index filename. This means people explicitly using this API can correctly handle differing cases of filenames and indices.
Note this is complicated in VCF too due to differing conventions of indices and many index variants in use. Eg VCF.gz may well be TBI, BAI or CSI. (I've no idea why TBI exists for standard formats as it's not needed and just muddies the waters with formats within formats.)
We also won't have a tabix index for BCF, so tbx_index_load is inappropriate there. Or is it just an historically misnamed function? (Quite possible in this twisty maze of indices all alike!)
Edit: apparently so. I don't understand what's going on in tabix index land, but the tbx component of this function name appears to be a total irrelevance regarding index format, but it is creating an tabix structure. I don't understand what the point is of a tabix structure for a non-tabix index. @pd3 care to elucidate what's going on here? Why is tbx_t relevant for CSI indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, bcftools uses bcf_sr_add_reader
extensively (44 times), and apparently the inability to specify the index filename has never been a problem for it. I assume just hunting around for .csi, .bai and .tbi is just widely understood behaviour. It does handle other variants like foo.bcf
and foo.csi
(Picard style naming), but I'm not sure where that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the filename handling is bugged anyway.
bcftools view foo.bcf
works with either foo.bcf.csi or foo.csi.
However bcftools view foo.vcf.gz
looks for foo.vcf.gz.csi and foo.vcf.csi. It doesn't work with the filename suffix absent (ie foo.gz) and foo.csi. However GATK appears to generate indices called foo.vcf.gz.tbi, so is different to how Picard handles BAM indices. Nice to have consistency ;-) So I suspect looking for the index in two levels of suffix is a misfeature on VCF and BCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed fname to idx name that index file can be explicitly passed.
if index path is NULL, tries to get index file from default path with default name, as earlier.
@@ -274,7 +290,7 @@ int bcf_sr_add_reader(bcf_srs_t *files, const char *fname) | |||
BGZF *bgzf = hts_get_bgzfp(reader->file); | |||
if ( bgzf && bgzf_check_EOF(bgzf) == 0 ) { | |||
files->errnum = no_eof; | |||
hts_log_warning("No BGZF EOF marker; file '%s' may be truncated", fname); | |||
hts_log_warning("No BGZF EOF marker; file '%s' may be truncated", file_ptr->fn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use fname in preference to file_ptr->fn?
Or maybe if fname is optionally NULL, do something like this at the start:
if (!fname)
fname = file_ptr->fn;
That makes the warning logs consistent and also removes the ?: operator later on which is code duplicated 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_ptr->fn gives the proper name of file, fname may have filename as well as index information.
With fname changed to idxname, file name has to be retrieved using this itself.
Also the ?: stuff removed as fname changed to idxname.
@@ -133,13 +133,15 @@ int main(int argc, char *argv[]) | |||
{"targets",required_argument,NULL,'t'}, | |||
{"no-index",no_argument,NULL,1000}, | |||
{"args",no_argument,NULL,1001}, | |||
{"usefptr",no_argument,NULL,'u'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a line to the usage statement too.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
The PR seems to work fine from what I can tell. The additions to the test harness help this validation. Thanks. I added a few minor comments to address, but overall it solves the issue. |
This is like bcf_sr_add_reader but the caller supplies an existing open htsFile instead of a filename. Fixes samtools#1862
Fixes #1862
Added bcf_sr_add_hreader to add already opened htsfile to reader list.
The file added can be set to close along with reader or closed explicitly by user. Aux data in synced_bcf_reader.c is updated to hold the file closure status.
When the file name is passed along with file pointer, the index can be anywhere and mentioned using ##idx## method with file name. As file name in htsfile does not maintain index information, the index has to be present in default location with default name (i.e. same location as file with file name.) when no filename is passed to this method.
bcf_sr_add_reader updated to open the file and passes the file to bcf_sr_add_hreader with autoclose enabled.
Test utility and script updated to use the new interface. With -u argument, it uses the new interface where file pointer is made and used.