-
Notifications
You must be signed in to change notification settings - Fork 2
update_bed
dytk2134 edited this page Sep 12, 2018
·
1 revision
Update the sequence id and coordinates of a Bed file using an alignment file generated by the fasta_diff program
The coordinates are converted by the following algorithm
- bed_new_start = bed_start - match_old_start + match_new_start
- bed_old_end = bed_end - match_old_start + match_new_start
In the following situation, the line in the bed file will be removed.
- bed_old_start and bed_old_end coordinates not contained within match_old_start to match_old_end
- sequence name not found in the match.tsv file (output file from fasta_diff)
- header lines (remain the same)
- first three required BED fields:
- chrom (updated)
- chromStart (updated)
- chromEnd (updated)
- 9 additional optional BED fields:
- name (remains the same)
- score (remains the same)
- strand (remains the same)
- thickStart (updated)
- thickEnd (updated)
- itemRgb (remains the same)
- blockCount (remains the same)
- blockStarts (remains the same)
update_bed -a match.tsv example_file/example.bed
INFO Reading alignment data from: match.tsv...
INFO Alignments: 61768
INFO Processing Bed file: female_Nvit_RNAseq_alignments_junctions.bed...
INFO Updated lines: 103726
INFO Removed lines: 8
CASE1: 100% match between sequences
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1 | 0 | 10378279 | KK961494.1 | 0 | 10378279 |
- original Bed file
Scaffold1 124699 125610 JUNC00000001 1 - 124699 125610 255,0,0 2 38,63 0,848
Scaffold1 125687 127004 JUNC00000002 1 - 125687 127004 255,0,0 2 42,59 0,1258
- updated Bed file
KK961494.1 124699 125610 JUNC00000001 1 - 124699 125610 255,0,0 2 38,63 0,848
KK961494.1 125687 127004 JUNC00000002 1 - 125687 127004 255,0,0 2 42,59 0,1258
CASE2: New sequence is a substring of the old sequence with 100% match
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold500 | 2215 | 777787 | KK961993.1 | 0 | 775572 |
- original Bed file
Scaffold500 194 2394 JUNC00072458 1 + 194 2394 255,0,0 2 79,22 0,2178
Scaffold500 106343 110442 JUNC00072459 61 - 106343 110442 255,0,0 2 99,92 0,4007```
- updated Bed file
KK961993.1 104128 108227 JUNC00072459 61 - 104128 108227 255,0,0 2 99,92 0,4007
- removed Bed file
Scaffold500 194 2394 JUNC00072458 1 + 194 2394 255,0,0 2 79,22 0,2178
CASE3: part of the old sequence was converted into Ns
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold423 | 43403 | 44185 | KK961916.1 | 43403 | 44185 |
Scaffold423 | 45136 | 48693 | KK961916.1 | 45136 | 48693 |
- original Bed file
Scaffold423 42315 43335 JUNC00064280 4 - 42315 43335 255,0,0 2 69,81 0,939
Scaffold423 45134 45845 JUNC00064281 7 - 45134 45845 255,0,0 2 87,94 0,617
Scaffold423 45799 46062 JUNC00064282 6 - 45799 46062 255,0,0 2 85,94 0,169
- updated Bed file
KK961916.1 42315 43335 JUNC00064280 4 - 42315 43335 255,0,0 2 69,81 0,939
KK961916.1 45799 46062 JUNC00064282 6 - 45799 46062 255,0,0 2 85,94 0,169
- removed Bed file
Scaffold423 45134 45845 JUNC00064281 7 - 45134 45845 255,0,0 2 87,94 0,617
update_bed -h
usage: update_bed [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
[-r REMOVED_POSTFIX] [-v]
Bed_FILE [Bed_FILE ...]
Update the sequence id and coordinates of a Bed file using an alignment file generated by the fasta_diff program.
Updated Line are written to a new file with '_updated'(default) appended to the original Bed file name.
Line that can not be updated, due to the id being removed completely or the line contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original Bed file name.
Example:
fasta_diff example_file/old.fa example_file/new.fa | update_bed example_file/example.bed
positional arguments:
Bed_FILE List one or more Bed files to be updated
optional arguments:
-h, --help show this help message and exit
-a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
The alignment file generated by fasta_diff, a TSV file
with 6 columns: old_id, old_start, old_end, new_id,
new_start, new_end (default: STDIN)
-u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
The filename postfix for updated features (default:
"_updated")
-r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
The filename postfix for removed features (default:
"_removed")
-v, --version show program's version number and exit