-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various spec violations #24
Comments
Let's split this up then. Hotfix PRs addressing errors take priority. Hotfix PRs would be against the master branch and would embrace the current design flaws. Also, PRs addressing inconsistencies could be hotfixes too provided they don't alter the existing public API. My preference is to address all known errors before making breaking changes (v0.3).
It'd be great to get rid of that The direction I was thinking is a two-stage approach, which would facilitate error recovery (https://github.com/BioJulia/XAM.jl/projects/4#card-42156212). The first stage would fill the record by copying in the current line. Then the second stage would attempt to index the copied data then return a truthy value. I'm not sure what the performance implication would be with the two-stage approach as it parses the line twice? I'd be keen to hear your ideas about error recovery.
I agree with rewriting the memory layout -- that's been bugging me. I also agree with your proposed way of consolidating the accessor and has functions. I think we should use PRs with developmental or breaking changes would be against the develop branch. |
Right, makes sense. I'll begin with that
I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error. Silent errors are the worst, especially in a scientific setting. I'm not sure there's a lot of sense in allowing bad records to be used. In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there. If we begin correcting errors created by bad writers, we have a neverending problem on our hands. Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a I'm quite ambivalent about how to handle missing data. I think there are two good options:
|
I mostly agree. I think as a default we should expect specification-compliant BAM/SAM files and error when there is non-compliance. But, I think we only need to enforce this expectation at the line-level, not holistically, so that when it comes to iteration, users can make a judgement call and opt to skip errors.
I agree we should error.
Yes, a poor choice of words on my part. The record would probably be unusable as well as indicative of other issues relating to the quality of the file. What I was thinking is that the data should be accessible for diagnosis and easy to acquire. For example, after an error, we should be able to seek position, then read erroneous data into
I agree with that too and have no intention of merging the workaround. For me, the issue was that XAM in its current state, was not able to work with or around an error that was judged trivial.
In the current release, and certainly with SAM, indexing of the data occurs as the line/record is determined. In a sense, the record already has the indexes before the raw data gets copied into it. Also, we do not explicitly check successful indexing; we assume indexing was successful if the record fills without error. I'd need to look at BAM again to see what it does with records, but the parsing of the BAM header is the same as SAM. In the context of skipping errors, was proposing to have both index-then-fill and fill-then-index schemes. Though perhaps a cleaner approach would be to have Automa seek the end of the line/record if it falls into an invalid state and then returns a flag for error handling whether that be a throw or skip. With this approach, the pointer appropriately positions for the next line/record, and there is only a single pass over the data.
I'd forgotten about |
Dear @CiaranOMara
I've been working a bit with XAM.jl. Great package, but I've found quite a number of small errors and mistakes I think should be fixed. Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward. So I'd like to discuss it a bit here:
List of errors/inconsistencies
Proposals/enhancements
I've implemented many of these changes at https://github.com/jakobnissen/XAM.jl, but that "fork" is stale, and I'd rather merge the fixes into the original XAM.jl
The text was updated successfully, but these errors were encountered: