Rehaul BAM and SAM accessor functions #28

jakobnissen · 2020-08-07T13:45:07Z

Changes: (updated as I go along):

Changes to "has"-functions, e.g. hasposition to always return a Bool
Accessor functions now return what they're supposed to (see Various spec violations #24 )
There are no more unfilled BAM records, so isfilled and hasflag has been removed
Changed printing of records in the REPL

Things to mull over

Should we even have has-functions, or just return a specific sentinel value when the record doesn't have the information available, e.g. nothing? The latter may be neater.
Which sentinel value? I'm leaning towards nothing (because it so qucikly errors if you dont handle it correctly, that leads to fewer bugs for the end user), but if you want missing, that's fine as well.

Closes https://github.com/BioJulia/BioAlignments.jl/pull/33/files.

jakobnissen · 2020-08-07T14:27:57Z

Changed printing of records:
Before, empty record:

XAM.BAM.Record:
      template name: nothing
               flag: 4
       reference ID: nothing
           position: nothing
    mapping quality: nothing
              CIGAR: 
  next reference ID: nothing
      next position: nothing
    template length: nothing
           sequence: nothing
       base quality: nothing
     auxiliary data:

After, empty record

XAM.BAM.Record:
      template name: 
               flag: 0x0004
       reference ID: 
           position: 
    mapping quality: 
              CIGAR: 
  next reference ID: 
      next position: 
    template length: 
           sequence: 
       base quality: 
     auxiliary data:

Before, filled record:

XAM.BAM.Record:
      template name: HWI-1KL120:88:D0LRBACXX:1:1101:2205:2204
               flag: 83
       reference ID: 1
           position: 132616
    mapping quality: 60
              CIGAR: 101M
  next reference ID: 1
      next position: 132491
    template length: 226
           sequence: GGTCCCACCTTGTCCTCCTCCTACACATACTCGGATGCTTCCTCCTCAACCTTGGCACCCACCTCCTTCTTACTGGGCCCAGGAGCCTTCAATGCCCAGGA
       base quality: UInt8[0x21, 0x21, 0x21, 0x1f, 0x1e, 0x22, 0x1d, 0x1e, 0x1e, 0x1e  …  0x27, 0x25, 0x25, 0x25, 0x25, 0x25, 0x25, 0x22, 0x22, 0x22]
     auxiliary data: XT=U NM=3 SM=37 AM=37 X0=1 X1=0 XM=3 XO=0 XG=0 MD=4A4C82A8

After, filled record:

XAM.BAM.Record:
      template name: "HWI-1KL120:88:D0LRBACXX:1:1101:2205:2204"
               flag: 0x0053
       reference ID: 1
           position: 132616
    mapping quality: 0x3c
              CIGAR: "101M"
  next reference ID: 1
      next position: 132491
    template length: 226
           sequence: GGTCCCACCTTGTCCTCCTCCTACACAT…CTGGGCCCAGGAGCCTTCAATGCCCAGGA
       base quality: ▆▆▆▆▆▆▄▆▆▆▆▆▆▇▇▇▇▇▇▇▆▆▄▆▄▄▆▆…▇▇██████████████▇▇▇▇▇▇▇▇▇▇▆▆▆
     auxiliary data: XT='U' NM=0x03 SM=0x25 AM=0x25 X0=0x01 X1=0x00 XM=0x03 XO=0x00 XG=0x00 MD="4A4C82A8"

CiaranOMara · 2020-08-08T15:23:19Z

src/common.jl

+function quality_string(quals::Vector{UInt8})
+    characters = Vector{Char}(undef, length(quals))
+    for i in eachindex(quals)
+        @inbounds qual = quals[i]
+        if qual < 10
+            char = ' '
+        elseif qual < 15
+            char = '▁'
+        elseif qual < 20
+            char = '▂'
+        elseif qual < 25
+            char = '▃'
+        elseif qual < 30
+            char = '▄'
+        elseif qual < 35
+            char = '▆'
+        elseif qual < 40
+            char = '▇'
+        elseif qual < 255
+            char = '█'
+        else
+            char = '?'
+        end
+        @inbounds characters[i] = char
+    end
+    return join(characters)
+end


What about factoring out the kernel?

function quality_char(qual::UInt8) if qual < 10 return ' ' end if qual < 15 return '▁' end if qual < 20 return '▂' end if qual < 25 return '▃' end if qual < 30 return '▄' end if qual < 35 return '▆' end if qual < 40 return '▇' end if qual < 255 return '█' end return '?' end function quality_string(quals::Vector{UInt8}) return String(quality_char.(quals)) end

CiaranOMara · 2020-08-13T08:22:57Z

Should we even have has-functions, or just return a specific sentinel value when the record doesn't have the information available, e.g. nothing? The latter may be neater.

I think the has and is functions that provide an interpretation of the data/record are useful; the clear case being the flag field. However, the has functions are not necessary for checking the presence of a value in a mandatory field.

Would it be accurate to say that we're attempting to balance out the issue of consistency between SAM and BAM: whether it is useful to have consistency at the data-level or whether it is sufficient to have consistency at the print-level?
Suppose we're to have better consistency between SAM and BAM, some default null values encountered need to be translated. If these translations occur at the data-level, they can introduce a form of double-checking (#26), which I'm fine with if we're clear that we want consistency between SAM and BAM at the data-level where possible.

In any case, I generally think accessors should assume the presence of value.

Also, once the raw byte data of a record has been interpreted, I'm not wedded to the idea that the record must maintain an exact byte structure of the original.

Which sentinel value? I'm leaning towards nothing (because it so qucikly errors if you dont handle it correctly, that leads to fewer bugs for the end user), but if you want missing, that's fine as well.

I would now lean towards nothing too -- the use of missing was annoying when chaining field comparisons.

BTW, I like that base quality print out.

jakobnissen · 2020-08-13T08:24:45Z

I think it's useful to think in terms of three levels:

The data level, i.e. how does a record struct look: Here, we are locked to the BAM and SAM specs and can't really do much.
The functional interface (API level) of the package. Here, I mean what a user of XAM.jl experiences when they try to do something simple, e.g. "Extract the base qualites in Phred-33 encoding". What we basically want here is to make it as easy as possible to work with, and not necessarily tie the interface too closely to the data level. I.e. the fact that a missing quality is encoded as 0xff should not really matter for the user.
The print level. This is the least important one and just REPL cosmetics.

The issue with the API level is that there may be a conflict between having an easy API and having a performant one. In some circumstances it may be too difficult to NOT expose some of the underlying data, even though it shouldnt be relevant at the API level.

Actually I think the best thing may be to just make a new branch and play around a bit with it. Then I'll try to use the package in some small test code and get a feel for it.

CiaranOMara · 2020-08-13T08:25:33Z

You present the levels well.

I wonder whether we gain anything by not locking ourselves into the spec at the data-level?

To elaborate, there are the raw data files as per BAM and SAM specs, and our interpretation of the raw data which we keep in our struct. We could translate the raw data (null defaults) as they are loaded into our struct and then again if they are written to file.

If we can rule out translation at that level, then the remaining spot for translation is in the accessors and setters.

CiaranOMara · 2020-08-13T08:25:51Z

I don't think there is anything to gain by pushing the translation to the bottom layer. It increases complexity and introduces a performance penalty in that the fields would be translated unnecessarily for records that after a primary interrogation of the flag field would be discarded. It makes sense to defer translation to if and when you require the field data. I'm not particularly bothered by the possibility of repeated translation with multiple field access, as these fields in my experience are single-use, wouldn't you agree @jakobnissen?

jakobnissen · 2020-08-13T08:27:06Z

Yes, I agree. There's also something rather nice about having the layout of a BAM struct be exactly equal to its representation in a file.

CiaranOMara and others added 4 commits July 8, 2020 23:30

Improve BAM.quality performance

cf168ac

Closes https://github.com/BioJulia/BioAlignments.jl/pull/33/files.

Make BAM record layout match BAM specs

a6fff2b

Change BAM accessor functions

cfd4149

Improve printing of BAM records

198e031

CiaranOMara reviewed Aug 8, 2020

View reviewed changes

CiaranOMara force-pushed the develop branch from 20dbf80 to dde235f Compare October 19, 2020 14:38

jakobnissen added 5 commits February 28, 2021 13:13

Switch to CodecBGZF

fc16f7d

Merge branch 'codecbgzf' into rehaul

8054407

Disable cross tests

fb886c0

Add quality iter

3bc38ee

Add no-copy AUX data

8c219c2

CiaranOMara force-pushed the develop branch from 46b6e36 to d6502f0 Compare April 28, 2021 01:59

CiaranOMara force-pushed the develop branch from 2bae036 to ccc9112 Compare June 29, 2021 03:26

CiaranOMara force-pushed the develop branch from ccc9112 to 18dae0d Compare July 6, 2021 04:47

CiaranOMara force-pushed the develop branch from 2782c2d to 5e7973c Compare March 22, 2022 11:56

CiaranOMara force-pushed the develop branch from 5e7973c to 52d0ce2 Compare July 2, 2022 06:38

CiaranOMara force-pushed the develop branch from 481d120 to 9a9f2c1 Compare February 28, 2023 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rehaul BAM and SAM accessor functions #28

Rehaul BAM and SAM accessor functions #28

jakobnissen commented Aug 7, 2020 •

edited

Loading

jakobnissen commented Aug 7, 2020 •

edited

Loading

CiaranOMara Aug 8, 2020

CiaranOMara commented Aug 13, 2020

jakobnissen commented Aug 13, 2020

CiaranOMara commented Aug 13, 2020

CiaranOMara commented Aug 13, 2020

jakobnissen commented Aug 13, 2020

Rehaul BAM and SAM accessor functions #28

Are you sure you want to change the base?

Rehaul BAM and SAM accessor functions #28

Conversation

jakobnissen commented Aug 7, 2020 • edited Loading

Changes: (updated as I go along):

Things to mull over

jakobnissen commented Aug 7, 2020 • edited Loading

CiaranOMara Aug 8, 2020

Choose a reason for hiding this comment

CiaranOMara commented Aug 13, 2020

jakobnissen commented Aug 13, 2020

CiaranOMara commented Aug 13, 2020

CiaranOMara commented Aug 13, 2020

jakobnissen commented Aug 13, 2020

jakobnissen commented Aug 7, 2020 •

edited

Loading

jakobnissen commented Aug 7, 2020 •

edited

Loading