-
Notifications
You must be signed in to change notification settings - Fork 2
update_bedgraph
dytk2134 edited this page Sep 12, 2018
·
1 revision
Update the sequence id and coordinates of a BedGraph file using an alignment file generated by the fasta_diff program
The coordinates are converted by the following algorithm
- bedGraph_new_start = bedGraph_old_start - match_old_start + match_new_start
- bedGraph_old_end = bedGraph_old_end - match_old_start + match_new_start
In the following situation, the line in bedGraph will be removed.
- bedGraph_old_start and bedGraph_old_end coordinates not contained within match_old_start to match_old_end
- sequence name not found in the match.tsv file (output from the fasta_diff program)
chromA chromStartA chromEndA dataValueA
chromB chromStartB chromEndB dataValueB
chrom, chromStart, chromEnd will be updated
update_bedgraph -a match.tsv example_file/example.bedGraph
INFO Reading alignment data from: match.tsv...
INFO Alignments: 7081
INFO Processing BedGraph file: ctrlF-BRN_S18.sorted.BedGraph...
INFO Updated lines: 7072732
INFO Removed lines: 11
CASE1: 100% match
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1 | 0 | 3368518 | KK245166.1 | 0 | 3368518 |
- original Bedgraph file
Scaffold1 28 35 1
Scaffold1 35 36 2
Scaffold1 36 37 3
- updated Bedgraph file
KK245166.1 28 35 1
KK245166.1 35 36 2
KK245166.1 36 37 3
CASE2: New sequence is a substring of the old sequence with 100% match
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold4139 | 2368 | 8532 | JHOM01041610.1 | 0 | 6164 |
- original Bedgraph file
Scaffold4139 265 337 3
Scaffold4139 337 340 1
Scaffold4139 340 3299 0
Scaffold4139 3299 3324 1
Scaffold4139 3324 3325 4
Scaffold4139 3325 3326 5
- updated Bedgraph file
JHOM01041610.1 931 956 1
JHOM01041610.1 956 957 4
JHOM01041610.1 957 958 5
- removed Bedgraph file
Scaffold4139 265 337 3
Scaffold4139 337 340 1
Scaffold4139 340 3299 0
CASE3: part of the old sequence was converted into Ns
- Information in match.tsv
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1688 | 0 | 390 | KK246853.1 | 0 | 390 |
Scaffold1688 | 2775 | 4110 | KK246853.1 | 2775 | 4110 |
Scaffold1688 | 4670 | 5814 | KK246853.1 | 4670 | 5814 |
Scaffold1688 | 8337 | 8871 | KK246853.1 | 8337 | 8871 |
Scaffold1688 | 10333 | 11477 | KK246853.1 | 10333 | 11477 |
- original Bedgraph file
Scaffold1688 5735 5738 1
Scaffold1688 5738 5784 0
Scaffold1688 5784 5807 1
Scaffold1688 5807 6909 0
Scaffold1688 6909 6910 1
Scaffold1688 6910 6911 3
- updated Bedgraph file
KK246853.1 5735 5738 1
KK246853.1 5738 5784 0
KK246853.1 5784 5807 1
- removed Bedgraph file
Scaffold1688 5807 6909 0
Scaffold1688 6909 6910 1
Scaffold1688 6910 6911 3
CASE4: Information in match.tsv not found
- original Bedgraph file
Scaffold5211 2790 2865 1
Scaffold5211 2926 2962 1
Scaffold5211 2963 3001 1
- removed Bedgraph file
Scaffold5211 2790 2865 1
Scaffold5211 2926 2962 1
Scaffold5211 2963 3001 1
update_bedgraph -h
usage: update_bedgraph [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
[-r REMOVED_POSTFIX] [-v]
BedGraph_FILE [BedGraph_FILE ...]
Update the sequence id and coordinates of a BedGraph file using an alignment file generated by the fasta_diff program.
Updated Line are written to a new file with '_updated'(default) appended to the original BedGraph file name.
Line that can not be updated, due to the id being removed completely or the line contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original BedGraph file name.
Example:
fasta_diff example_file/old.fa example_file/new.fa | update_bedgraph example_file/example.bedGraph
positional arguments:
BedGraph_FILE List one or more BedGraph files to be updated
optional arguments:
-h, --help show this help message and exit
-a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
The alignment file generated by fasta_diff, a TSV file
with 6 columns: old_id, old_start, old_end, new_id,
new_start, new_end (default: STDIN)
-u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
The filename postfix for updated features (default:
"_updated")
-r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
The filename postfix for removed features (default:
"_removed")
-v, --version show program's version number and exit