fasta_diff

Introduction

Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm.

Algorithm:

Stage 1: Find 100% matches

Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence

Stage 3: Find cases where part of the sequence was converted into Ns
1. split new seq by Ns into multiple substrings
2. find an old seq with all substrings and also in order
3. if found, write an alignment line for every substring
4. check if the alignment line of every substring can be joining together
5. check if the next base of the old sequence is same as the next base of the new sequence. If so extend both the end coordinate of old and new seuqnece in the 'alignment line'.

Stage 4: Find cases where a old sequence is split into two or more new sequences
1. check if there is a old sequence mapped to two or more new sequences
2. check if there is no overlap between the coordinate mappings of new sequences
3. write alignment lines for every coordinate mappings

There is a overlap between the coordinate mappings of new1 and new2. Therefore only the old and new3 mapping will be remained.

How it works?

Coordinate mappings (0-based):

Old     0       9       New     0       9
Old     11      19      New     11      19

Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
There are two models in this example: blue (0-based: 3 - 9; 1-based: 4 - 9) and red (0-based: 10 - 14; 1-based: 11 - 14)
After using the conversion program, the model with coordinates that are not contained within match_old_start to match_old_end will be removed. Therefore, the red one will be removed and the blue one will be remained.

Usage:

fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt

Note that fasta_diff can be piped with another coordinates_conversion program so that the intermediate match.tsv file isn’t saved to disk

Example progress output:

command

fasta_diff /app/data/other_species/diacit/diaci1.1/diaci1.1.fa /app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa > match.tsv

output info

 INFO     Reading alignment data from: <stdin>...
 INFO     Reading old FASTA file (/app/data/other_species/diacit/diaci1.1/diaci1.1.fa)...
 INFO       Unique sequences: 163023
 INFO     Reading new FASTA file  (/app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa)...
 INFO       Unique sequences: 161988
 INFO     Stage 1 - match_identical_sequence:
 INFO       Matched sequences: 160938 (New : 160938)
 INFO       Unmatched sequences in old FASTA: 2085
 INFO       Unmatched sequences in new FASTA: 1050
 INFO     Stage 2 - match_truncated_sequence:
 INFO       Matched sequences: 161233 (New : 295)
 INFO       Unmatched sequences in old FASTA: 1790
 INFO       Unmatched sequences in new FASTA: 755
 INFO     Stage 3 - match_split_subsequence:
 INFO       Matched sequences: 161988 (New : 755)
 INFO       Unmatched sequences in old FASTA: 1035
 INFO       Unmatched sequences in new FASTA: 0
 INFO     Stage 4 - one_to_multiple_match (New : 0)
 INFO       Matched sequences: 161988 (New : 0)
 INFO       Unmatched sequences in old FASTA: 1035
 INFO       Unmatched sequences in new FASTA: 0

Example output file(match.tsv)

old_id	old_start	old_end	new_id	new_start	new_end
Scaffold1	0	3368518	KK245166.1	0	3368518
Scaffold4139	2368	8532	JHOM01041610.1	0	6164
Scaffold1688	0	390	KK246853.1	0	390
Scaffold1688	2775	4110	KK246853.1	2775	4110
Scaffold1688	4670	5814	KK246853.1	4670	5814

Example report file

If -r [report] or --report [report] is given, fasta_diff.py will generate a report for the unmatched sequences in new FASTA file.

#Sequence_ID    Sequence_length Stage
#Stage 1 - match_identical_sequence:
JJRW01S0013007.1        93719   Stage 1
JJRW01S0001013.1        399926  Stage 1
JJRW01S0000843.1        319524  Stage 1
#Stage 2 - match_truncated_sequence:
JJRW01S0013007.1        93719   Stage 2
JJRW01S0001013.1        399926  Stage 2
#Stage 3 - match_split_subsequence:
JJRW01S0001013.1        399926  Stage 3
#Stage 4 - one_to_multiple_match:
JJRW01S0001013.1        399926  Stage 4

Running the program with –h prints the following help:

fasta_diff -h

usage: fasta_diff [-h] [-o [OUT]] [-r REPORT] [-d] [-hc] [-v]
                  old_fasta new_fasta

Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm:
Stage 1: Find 100% matches
Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence
Stage 3: Find cases where part of the sequence was converted into Ns
Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
Originally created to compare the NCBI version to the original reference of Diaphorina citri

Example:
    fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt

positional arguments:
  old_fasta             The original FASTA file
  new_fasta             The new FASTA file

optional arguments:
  -h, --help            show this help message and exit
  -o [OUT], --out [OUT]
                        The output alignment file (default: STDOUT)
  -r REPORT, --report REPORT
                        Generate a report for the unmatched sequences in new
                        FASTA file.
  -d, --debug           If set, partial results are saved in a
                        *_stage_i_pickle file after each stage, unmatched
                        sequences after each stage are saved in a
                        *_stage_i_unmatched file for both FASTA files.
  -hc, --header_check   If set, confirm the detected mapping by checking the
                        header of the new sequence for the id of the mapped
                        old sequence.
  -v, --version         show program's version number and exit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly