Speed up kputll. #1805

jkbonfield · 2024-07-09T17:09:35Z

The kputuw function is considerably faster as it encodes 2 digits at a time and also utilises __builtin_clz. This changes kputll to use the same 2 digits at a time trick. I have a __builtin_clzll variant too, but with longer numbers it's not the main bottleneck and we fall back to kputuw for small numbers. This avoids complicating the code with builtin checks and alternate versions.

An alternative, purely for sam_format1_append would be something like:

static inline int kputll_fast(long long c, kstring_t *s) {
    return c <= INT_MAX && c >= INT_MIN ? kputw(c, s) : kputll(c, s);
}
#define kputll kputll_fast

This works as BAM/CRAM only support 32-bit numbers for POS, PNEXT and TLEN anyway, so ll vs w is an irrelevant distinction. However I chose to modify the header file so it fixes other callers.

Overall compressed BAM to uncompressed SAM conversion is about 5% quicker (tested on 10 million short-read seqs; it'll be minimal on long seqs). This includes decode time and other functions too. The sam_format1_append only component of that is about 15-25% quicker depending on compiler and version.

The kputuw function is considerably faster as it encodes 2 digits at a time and also utilises __builtin_clz. This changes kputll to use the same 2 digits at a time trick. I have a __builtin_clzll variant too, but with longer numbers it's not the main bottleneck and we fall back to kputuw for small numbers. This avoids complicating the code with builtin checks and alternate versions. An alternative, purely for sam_format1_append would be something like: static inline int kputll_fast(long long c, kstring_t *s) { return c <= INT_MAX && c >= INT_MIN ? kputw(c, s) : kputll(c, s); } #define kputll kputll_fast This works as BAM/CRAM only support 32-bit numbers for POS, PNEXT and TLEN anyway, so ll vs w is an irrelevant distinction. However I chose to modify the header file so it fixes other callers. Overall compressed BAM to uncompressed SAM conversion is about 5% quicker (tested on 10 million short-read seqs; it'll be minimal on long seqs). This includes decode time and other functions too. The sam_format1_append only component of that is about 15-25% quicker depending on compiler and version.

jkbonfield mentioned this pull request Jul 11, 2024

Optimise samtools depth histogram incrementing code. samtools/samtools#2078

Merged

daviesrob assigned whitwham Jul 11, 2024

jkbonfield force-pushed the kputll branch from 16fa935 to 6b9d7f1 Compare July 11, 2024 13:11

whitwham merged commit 19a27e9 into samtools:develop Jul 16, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up kputll. #1805

Speed up kputll. #1805

jkbonfield commented Jul 9, 2024

Speed up kputll. #1805

Speed up kputll. #1805

Conversation

jkbonfield commented Jul 9, 2024