-
Notifications
You must be signed in to change notification settings - Fork 2
fasta_diff
Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm.
- Stage 1: Find 100% matches
- Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence
- Stage 3: Find cases where part of the sequence was converted into Ns
- split new seq by Ns into multiple substrings
- find an old seq with all substrings and also in order
- if found, write an alignment line for every substring
- check if the alignment line of every substring can be joining together
- check if the next base of the old sequence is same as the next base of the new sequence. If so extend both the end coordinate of old and new seuqnece in the 'alignment line'.
- Stage 4: Find cases where a old sequence is split into two or more new sequences
- check if there is a old sequence mapped to two or more new sequences
- check if there is no overlap between the coordinate mappings of new sequences
- write alignment lines for every coordinate mappings
There is a overlap between the coordinate mappings of new1 and new2. Therefore only the old and new3 mapping will be remained.
Coordinate mappings (0-based):
Old 0 9 New 0 9
Old 11 19 New 11 19
- Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
- There are two models in this example: blue (0-based: 3 - 9; 1-based: 4 - 9) and red (0-based: 10 - 14; 1-based: 11 - 14)
- After using the conversion program, the model with coordinates that are not contained within match_old_start to match_old_end will be removed. Therefore, the red one will be removed and the blue one will be remained.
fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt
Note that fasta_diff can be piped with another coordinates_conversion program so that the intermediate match.tsv file isn’t saved to disk
fasta_diff /app/data/other_species/diacit/diaci1.1/diaci1.1.fa /app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa > match.tsv
INFO Reading alignment data from: <stdin>...
INFO Reading old FASTA file (/app/data/other_species/diacit/diaci1.1/diaci1.1.fa)...
INFO Unique sequences: 163023
INFO Reading new FASTA file (/app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa)...
INFO Unique sequences: 161988
INFO Stage 1 - match_identical_sequence:
INFO Matched sequences: 160938 (New : 160938)
INFO Unmatched sequences in old FASTA: 2085
INFO Unmatched sequences in new FASTA: 1050
INFO Stage 2 - match_truncated_sequence:
INFO Matched sequences: 161233 (New : 295)
INFO Unmatched sequences in old FASTA: 1790
INFO Unmatched sequences in new FASTA: 755
INFO Stage 3 - match_split_subsequence:
INFO Matched sequences: 161988 (New : 755)
INFO Unmatched sequences in old FASTA: 1035
INFO Unmatched sequences in new FASTA: 0
INFO Stage 4 - one_to_multiple_match (New : 0)
INFO Matched sequences: 161988 (New : 0)
INFO Unmatched sequences in old FASTA: 1035
INFO Unmatched sequences in new FASTA: 0
old_id | old_start | old_end | new_id | new_start | new_end |
---|---|---|---|---|---|
Scaffold1 | 0 | 3368518 | KK245166.1 | 0 | 3368518 |
Scaffold4139 | 2368 | 8532 | JHOM01041610.1 | 0 | 6164 |
Scaffold1688 | 0 | 390 | KK246853.1 | 0 | 390 |
Scaffold1688 | 2775 | 4110 | KK246853.1 | 2775 | 4110 |
Scaffold1688 | 4670 | 5814 | KK246853.1 | 4670 | 5814 |
If -r [report]
or --report [report]
is given, fasta_diff.py will generate a report for the unmatched sequences in new FASTA file.
#Sequence_ID Sequence_length Stage
#Stage 1 - match_identical_sequence:
JJRW01S0013007.1 93719 Stage 1
JJRW01S0001013.1 399926 Stage 1
JJRW01S0000843.1 319524 Stage 1
#Stage 2 - match_truncated_sequence:
JJRW01S0013007.1 93719 Stage 2
JJRW01S0001013.1 399926 Stage 2
#Stage 3 - match_split_subsequence:
JJRW01S0001013.1 399926 Stage 3
#Stage 4 - one_to_multiple_match:
JJRW01S0001013.1 399926 Stage 4
fasta_diff -h
usage: fasta_diff [-h] [-o [OUT]] [-r REPORT] [-d] [-hc] [-v]
old_fasta new_fasta
Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm:
Stage 1: Find 100% matches
Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence
Stage 3: Find cases where part of the sequence was converted into Ns
Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
Originally created to compare the NCBI version to the original reference of Diaphorina citri
Example:
fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt
positional arguments:
old_fasta The original FASTA file
new_fasta The new FASTA file
optional arguments:
-h, --help show this help message and exit
-o [OUT], --out [OUT]
The output alignment file (default: STDOUT)
-r REPORT, --report REPORT
Generate a report for the unmatched sequences in new
FASTA file.
-d, --debug If set, partial results are saved in a
*_stage_i_pickle file after each stage, unmatched
sequences after each stage are saved in a
*_stage_i_unmatched file for both FASTA files.
-hc, --header_check If set, confirm the detected mapping by checking the
header of the new sequence for the id of the mapped
old sequence.
-v, --version show program's version number and exit