A python extension, written in C, for quick access to 2bit files. The extension uses lib2bit for file access.
You can install the extension directly from github with:
pip install git+https://github.com/dpryan79/py2bit
Basic usage is as follows:
>>> import py2bit
This will work if your working directory is the py2bit source code directory.
>>> tb = py2bit.open("test/foo.2bit")
Note that if you would like to include information about soft-masked bases, you need to manually specify that:
>>> tb = py2bit.open("test/foo.2bit", True)
TwoBit
objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the chroms()
method.
>>> tb.chroms()
{'chr1': 150L, 'chr2': 100L}
You can directly access a particular chromosome by specifying its name.
>>> tb.chroms('chr1')
150L
The lengths are stored as a "long" integer type, which is why there's an L
suffix. If you specify a nonexistent chromosome then nothing is output.
>>> tb.chroms("foo")
>>>
The following information about and contained within a 2bit file can be accessed with the info()
method:
- file size, in bytes (
file size
) - number of chromosomes/contigs (
nChroms
) - total sequence length, in bases (
sequence length
) - total number of hard-masked (N) bases (
hard-masked length
) - total number of soft-masked (lower case) bases(
soft-masked length
).
Note that soft-masked length
will only be present if open("file.2bit", True)
is used, since handling soft-masking increases memory requirements and decreases perfomance.
>>> tb.info()
{'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8}
The sequence of a full or partial chromosome/contig can be fetched with the sequence()
method.
>>> tb.sequence("chr1")
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
By default, the whole chromosome/contig is returned. A specific range can also be requested.
>>> tb.sequence("chr1", 24, 74)
NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC
The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end.
If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs.
It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the bases()
method.
>>> tb.bases("chr1")
{'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}
This returns a dictionary with bases as keys and the fraction of the sequence composed of them as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 N
in this case). One can also request this information over a particular region.
>>> tb.bases("chr1", 24, 74)
{'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}
The start and end position are as with the sequence()
method described above.
If integer counts are preferred, then they can instead be returned.
>>> tb.bases("chr1", 24, 74, False)
{'A': 6, 'C': 6, 'T': 6, 'G': 6}
There are two kinds of masking blocks that can be present in 2bit files: hard-masked and soft-masked. Hard-masked blocks are stretches of NNNN, as are commonly found near telomeres and centromeres. Soft-masked blocks are runs of lowercase A/C/T/G, typically indicating repeat elements or low-complexity stretches. In can sometimes be useful to query this information from 2bit files:
>>> tb.hardMaskedBlocks("chr1")
[(0, 50), (100, 150)]
In this (small) example, there are two stretches of hard-masked sequence, from 0 to 50 and again from 100 to 150 (see the note below about coordinates). If you would instead like to query all blocks overlapping with a specific region, you can specify the region bounds:
>>> tb.hardMaskedBlocks("chr1", 75, 101)
[(100, 150)]
If there are no overlapping regions, then an empty list is returned:
>>> tb.hardMaskedBlocks("chr1", 75, 100)
[]
Instead of hardMaskedBlocks()
, one can use softMaskedBlocks()
in an identical manner:
>>> tb = py2bit.open("foo.2bit", storeMasked=True)
>>> tb.softMaskedBlocks("chr1")
[(62, 70)]
As shown, you must specify storeMasked=True
or you will receive a run time error.
A TwoBit
object can be closed with the close()
method.
>>> tb.close()
0-based half-open coordinates are used by this python module. So to access the value for the first base on chr1
, one would specify the starting position as 0
and the end position as 1
. Similarly, bases 100 to 115 would have a start of 99
and an end of 115
. This is simply for the sake of consistency with most other bioinformatics packages.