Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to use htsFile in synced reader (for #1862) #1868

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

vasudeva8
Copy link
Contributor

Fixes #1862
Added bcf_sr_add_hreader to add already opened htsfile to reader list.
The file added can be set to close along with reader or closed explicitly by user. Aux data in synced_bcf_reader.c is updated to hold the file closure status.
When the file name is passed along with file pointer, the index can be anywhere and mentioned using ##idx## method with file name. As file name in htsfile does not maintain index information, the index has to be present in default location with default name (i.e. same location as file with file name.) when no filename is passed to this method.

bcf_sr_add_reader updated to open the file and passes the file to bcf_sr_add_hreader with autoclose enabled.

Test utility and script updated to use the new interface. With -u argument, it uses the new interface where file pointer is made and used.

Comment on lines 253 to 254
* With fname as NULL, index file must be present in default location with
* default name, as file_ptr->fn doesn't contain index information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels problematic. I can see that as bcf_sr_add_reader is public API then we can't easily change it (other than to make a new API), but bcf_sr_add_hreader is new so we probably should limit it straight off the bat.

I'd prefer the index filename to be an extra argument. If NULL then it has the same caveats as above (as called by bcf_sr_add_reader), but if non-NULL then it specifies the index filename. This means people explicitly using this API can correctly handle differing cases of filenames and indices.

Note this is complicated in VCF too due to differing conventions of indices and many index variants in use. Eg VCF.gz may well be TBI, BAI or CSI. (I've no idea why TBI exists for standard formats as it's not needed and just muddies the waters with formats within formats.)

We also won't have a tabix index for BCF, so tbx_index_load is inappropriate there. Or is it just an historically misnamed function? (Quite possible in this twisty maze of indices all alike!)

Edit: apparently so. I don't understand what's going on in tabix index land, but the tbx component of this function name appears to be a total irrelevance regarding index format, but it is creating an tabix structure. I don't understand what the point is of a tabix structure for a non-tabix index. @pd3 care to elucidate what's going on here? Why is tbx_t relevant for CSI indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, bcftools uses bcf_sr_add_reader extensively (44 times), and apparently the inability to specify the index filename has never been a problem for it. I assume just hunting around for .csi, .bai and .tbi is just widely understood behaviour. It does handle other variants like foo.bcf and foo.csi (Picard style naming), but I'm not sure where that happens.

Copy link
Contributor

@jkbonfield jkbonfield Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the filename handling is bugged anyway.

bcftools view foo.bcf works with either foo.bcf.csi or foo.csi.

However bcftools view foo.vcf.gz looks for foo.vcf.gz.csi and foo.vcf.csi. It doesn't work with the filename suffix absent (ie foo.gz) and foo.csi. However GATK appears to generate indices called foo.vcf.gz.tbi, so is different to how Picard handles BAM indices. Nice to have consistency ;-) So I suspect looking for the index in two levels of suffix is a misfeature on VCF and BCF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed fname to idx name that index file can be explicitly passed.
if index path is NULL, tries to get index file from default path with default name, as earlier.

@@ -274,7 +290,7 @@ int bcf_sr_add_reader(bcf_srs_t *files, const char *fname)
BGZF *bgzf = hts_get_bgzfp(reader->file);
if ( bgzf && bgzf_check_EOF(bgzf) == 0 ) {
files->errnum = no_eof;
hts_log_warning("No BGZF EOF marker; file '%s' may be truncated", fname);
hts_log_warning("No BGZF EOF marker; file '%s' may be truncated", file_ptr->fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use fname in preference to file_ptr->fn?

Or maybe if fname is optionally NULL, do something like this at the start:

if (!fname)
    fname = file_ptr->fn;

That makes the warning logs consistent and also removes the ?: operator later on which is code duplicated 3 times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_ptr->fn gives the proper name of file, fname may have filename as well as index information.
With fname changed to idxname, file name has to be retrieved using this itself.
Also the ?: stuff removed as fname changed to idxname.

@@ -133,13 +133,15 @@ int main(int argc, char *argv[])
{"targets",required_argument,NULL,'t'},
{"no-index",no_argument,NULL,1000},
{"args",no_argument,NULL,1001},
{"usefptr",no_argument,NULL,'u'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a line to the usage statement too.
Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@jkbonfield
Copy link
Contributor

The PR seems to work fine from what I can tell. The additions to the test harness help this validation. Thanks.

I added a few minor comments to address, but overall it solves the issue.

This is like bcf_sr_add_reader but the caller supplies an existing
open htsFile instead of a filename.

Fixes samtools#1862
@jkbonfield jkbonfield merged commit dac8f29 into samtools:develop Jan 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: request, make synced_bcf_reader::bcf_sr_add_reader equivalent that accepts htsFile
2 participants