-
Notifications
You must be signed in to change notification settings - Fork 2
update_bam
Convert all the coordinates, IDs and some attributes in a BAM file. Requires output from fasta_diff.
- old_id - the id of the sequence in old_fasta
- old_start - the start position of the sequence in old_fasta
- old_end - the end position of the sequence in old_fasta
- new_id - the id of the sequence in new_fasta
- new_start - the start position of the sequence in new_fasta
- new_end - the end position of the sequence in new_fasta
output from fasta_diff
Scaffold139 0 46566 NW_012161945.1 0 46566
Scaffold111 0 904887 NW_012161915.1 0 904887
Scaffold90 0 211530 NW_012162413.1 0 211530
Tag | Description | state |
---|---|---|
VN | Format version. | keep |
SO | Sorting order of alignments. | keep |
SN | Reference sequence name. | According to information in the output file from fasta_diff, converted the reference sequence name from old_id to new_id. |
LN | Reference sequence length. | According to information in the output file from fasta_diff, converted the Reference sequence length to new one. |
ID | Format version. | keep |
PN | Program name. | keep |
output from fasta_diff.
Scaffold4140 0 3969 KK249218.1 0 3969
Scaffold4139 2368 8532 JHOM01041610.1 0 6164
Scaffold1688 0 390 KK246853.1 0 390
Scaffold1688 2775 4110 KK246853.1 2775 4110
Scaffold1688 4670 5814 KK246853.1 4670 5814
Scaffold1688 8337 8871 KK246853.1 8337 8871
Scaffold1688 10333 11477 KK246853.1 10333 11477
old header
@HD VN:1.0 SO:coordinate
@SQ SN:Scaffold4140 LN:3969
@SQ SN:Scaffold4139 LN:8532
@SQ SN:Scaffold1688 LN:11477
@PG ID:TopHat VN:2.0.9 CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt
new header
@HD VN:1.0 SO:coordinate
@SQ SN:KK249218.1 LN:3969
@SQ SN:JHOM01041610.1 LN:6164
@SQ SN:KK246853.1 LN:11477
@PG ID:TopHat VN:2.0.9 CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt
The new length of the sequence can be calculated by the following formula.
- new length = new_end - new_start (The sequence only has one data in the output file from fasta_diff. Ex. Scaffold4140 and Scaffold4139)
- new length = maximum of new_end - minimum of new_start (The sequence has multiple data in the output file from fasta_diff. Ex. Scaffold1688)
Col | Field | Type | Brief description | state |
---|---|---|---|---|
1 | QNAME | String | Query template NAME | keep |
2 | FLAG | Int | bitwise FLAG | keep |
3 | RNAME | String | Reference sequence NAME | According to information in the output file from fasta_diff, converted the RNAME from old_id to new_id |
4 | POS | Int | 1-based leftmost mapping POSition | According to information in the output file from fasta_diff, converted the POS to the new one. |
5 | MAPQ | Int | MAPping Quality | keep |
6 | CIGAR | String | CIGAR string | keep |
7 | RNEXT | String | Ref. name of the mate/next read | According to information in the output file from fasta_diff, converted the RNEXT from old_id to new_id |
8 | PNEXT | Int | Position of the mate/next read | According to information in the output file from fasta_diff, converted the PNEXT to the new one. |
9 | TLEN | Int | observed Template LENgth | keep |
10 | SEQ | String | segment SEQuence | keep |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 | keep |
12 | TAGs | String | optional fields on a SAM/BAM Alignment | keep |
Tags(col12) won't be updated in version 2.0!
The coordinates(POS and PNEXT) are converted by the following algorithm
- BAM_new_start = BAM_old_start - old_start + new_start
- BAM_old_end = BAM_old_end - old_start + new_start
In the following situation, the alignment will be removed.
- mate/next read name(col7) or reference sequence name(col3) not not found in the output file from fasta_diff
- reference start(col4) and reference end coordinates not contained within old_start to old_end
- Position of the mate/next read(col8) outside the range of match_old_start to match_old_end
If a read not in the situation mentioned above, this read will be written into the bam updated file.(No matter its mate read will be removed or not.)
CASE1: New sequence is 100% match with old sequence Information in the output file from fasta_diff
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1 | 0 | 2968087 | NW_012161901.1 | 0 | 2968087 |
original BAM file
HWI-ST580:259:D191TACXX:6:2212:12148:97597 147 Scaffold1 47512 3 101M = 47314 -299 TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:2 CC:Z:= CP:i:47512 HI:i:0
updated bam file
HWI-ST580:259:D191TACXX:6:2212:12148:97597 147 NW_012161901.1 47512 3 101M = 47314 -299 TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:2 CC:A:= CP:i:47512 HI:i:0
CASE2:
new sequence is a substring of the old sequence with 100% match Information in the output file from fasta_diff
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold4139 | 2368 | 8532 | JHOM01041610.1 | 0 | 6164 |
original bam file
CCRI0219:155:C2LNBACXX:1:2210:11347:56234 137 Scaffold4139 3375 50 49M * 0 0 GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@ MD:Z:1A5A41 RG:Z:RG1 XG:i:0 NH:i:1 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-12 YT:Z:UU
updated bam file
CCRI0219:155:C2LNBACXX:1:2210:11347:56234 137 JHOM01041610.1 1007 50 49M * 0 0 GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@ MD:Z:1A5A41 RG:Z:RG1 XG:i:0 NH:i:1 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-12 YT:Z:UU
CASE3:
part of the original sequence was converted into Ns Information in the output file from fasta_diff
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1688 | 0 | 390 | KK246853.1 | 0 | 390 |
Scaffold1688 | 2775 | 4110 | KK246853.1 | 2775 | 4110 |
Scaffold1688 | 4670 | 5814 | KK246853.1 | 4670 | 5814 |
Scaffold1688 | 8337 | 8871 | KK246853.1 | 8337 | 8871 |
Scaffold1688 | 10333 | 11477 | KK246853.1 | 10333 | 11477 |
original bam file
CCRI0219:155:C2LNBACXX:2:1302:21100:79098 163 Scaffold1688 4 50 45M = 88 151 CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A MD:Z:45 RG:Z:RG1 XG:i:0 NH:i:1 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YT:Z:UU
updated bam file
CCRI0219:155:C2LNBACXX:2:1302:21100:79098 163 KK246853.1 4 50 45M = 88 151 CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A MD:Z:45 RG:Z:RG1 XG:i:0 NH:i:1 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YT:Z:UU
- Python 2.7
- Python pysam module (0.9.1.4) installation
- samtools (1.3.1)
- htslib (1.3.1)
- match.tsv: the output from fasta_diff wiki
- Bam file: the Bam file you want to update
Before processing the BAM file, you should build index first! build index
samtools index example.bam
update_bam -a match.tsv example_file/example.bam
INFO Reading alignment data from: match.tsv...
INFO Alignments: 522
INFO Processing bam file: Aros_ID06.bam...
INFO Update tag SN and LN....
INFO Program record identifier TopHat
INFO Standard meaning tag: CC:Z, CP:i will be updated...
INFO Updated Alignments: 13483587
INFO Removed Alignments: 0
- [user file name]_updated.bam
- [user file name]_removed.bam
update_bam -h
usage: update_bam [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
[-r REMOVED_POSTFIX] [-v]
bam_FILE [bam_FILE ...]
Update the reference sequence name of the alignment and coordinates of a bam file using an alignment file generated by the fasta_diff program.
Updated features are written to a new file with '_updated'(default) appended to the original bam file name.
Feature that can not be updated, due to the id being removed completely or the feature contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original bam file name.
Example:
fasta_diff example_file/old.fa example_file/new.fa | update_bam example_file/example.bam
positional arguments:
bam_FILE List one or more bam files to be updated
optional arguments:
-h, --help show this help message and exit
-a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
The alignment file generated by fasta_diff, a TSV file
with 6 columns: old_id, old_start, old_end, new_id,
new_start, new_end (default: STDIN)
-u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
The filename postfix for updated features (default:
"_updated")
-r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
The filename postfix for removed features (default:
"_removed")
-v, --version show program's version number and exit