Document performance considerations? #125

alexlenail · 2022-05-03T16:19:51Z

I'd like to use pyBigWig to collect values at many intervals from many bigwigs, and I'd love to know what's performant.

is there overhead to opening a bigwig with pyBigWig? i.e. what's the runtime difference between:

with pyBigWig.open(bigwig_file) as bw:
    for chrom, start, stop in intervals:
        bw.values(chrom, start, stop)

and

for chrom, start, stop in intervals:
    with pyBigWig.open(bigwig_file) as bw:
        bw.values(chrom, start, stop)

If the former is optimal, is there any advantage to the intervals being sorted?
Do you know relative performance of pyBigWig entries() queries of bigBed files versus tabix queries of gzipped bed files?

The text was updated successfully, but these errors were encountered:

gokceneraslan · 2022-05-07T06:21:49Z

I think a vectorized version of bw.values would be much better e.g.

bw.values(np.array([chrom]*3), np.array([79250, 86700, 87277]), np.array([80250, 87700, 88277]), numpy=True)

which returns a list of numpy arrays, without iterating over the intervals in a loop. But I guess this is not implemented yet.

alexlenail · 2022-05-10T02:21:59Z

@dpryan79 what is the fastest way to get arrays of values from a bigwig file for each of many genomic intervals (i.e. entries in a bed file)?

BradBalderson · 2024-04-04T23:52:02Z

For others, I found a better solution for the above-described task was to use the bigWigAverageOverBed tool from UCSC.

BradBalderson · 2024-04-04T23:53:20Z

$ ./bigWigAverageOverBed

bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -stats=stats.ra - Output a collection of overall statistics to stat.ra file
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.
   -minMax - include two additional columns containing the min and max observed in the area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document performance considerations? #125

Document performance considerations? #125

alexlenail commented May 3, 2022

gokceneraslan commented May 7, 2022 •

edited

Loading

alexlenail commented May 10, 2022

BradBalderson commented Apr 4, 2024

BradBalderson commented Apr 4, 2024 •

edited

Loading

Document performance considerations? #125

Document performance considerations? #125

Comments

alexlenail commented May 3, 2022

gokceneraslan commented May 7, 2022 • edited Loading

alexlenail commented May 10, 2022

BradBalderson commented Apr 4, 2024

BradBalderson commented Apr 4, 2024 • edited Loading

gokceneraslan commented May 7, 2022 •

edited

Loading

BradBalderson commented Apr 4, 2024 •

edited

Loading