Skip to content

A versatile script for pulling BLAST results. Specifically intended to be used to pull promoter sequences of a set of proteins and filter those promoter sequences based off a set similarity threshold.

Notifications You must be signed in to change notification settings

ErillLab/blast_seq_filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

blast_sequence_filter

Overview

Image of Yaktocat

A set of protein accesions are passed in to yield BLAST results or nucleotide sequences of the BLAST results.

How it works

A BLAST search is conducted for each of the protein accessions passed in, and all unique hits are returned by the search_blast() function. The unique hits can be written to a CSV file or continue through the pipline depending on the configuration in the JSON file.

The set of unique protein records from the BLAST searches are used to pull the "most complete" nucleotide record for that protein (get_nuc_rec_from_prot()). The set of nucelotide records are used to pull the entire nucleotide sequences or the isolated promoter sequence based on the parameters in the JSON (get_nuc_seq()).

The returned nucelotide sequences can be filtered to a specific threshold similarity. For example, if the threshold is set to .8, no two sequence will be more than 80% similar in the filtered list. It works by sorting all sequences within the thrshold similarity into bins, and the centroid of each bin is selected to be added to the filtered list.

The output sequences are written to the output file (CSV or FASTA).

Usage

The parameters and input sequences are set in a JSON file following the structure below. The JSON file must be placed in the input folder and the filename must be specified in the filter_seq.py file.

Example JSON file structure and definitions:


    {
        
        "entrez_parameters": [{
        
            "request_limit" : 5,
            "sleep_time": 0.5,
            "email" : "[email protected]",
            "api_key" : null
            
        }],
    
        "blast_parameters" : [{
    
        	"tax_limit" : null,
        	"e_val" : 10E-10,
        	"coverage_min": 0.8,
        	"max_hits": 500
    
        }],
    
        "nucleotide_parameters" : [{
    
        	"isolate_promoters": true,
        	"capture_with_n": false,
        	"get_centroid": true,
        	"min_length" : 10,
        	"max_sim" : 0.75,
        	"upstream_adj" : 250,
        	"downstream_adj" : 3
    
        }],
    
        "output_parameters" : [{
    
        	"output_type" : "nuc",
        	"file_type" : "fasta",
        	"file_name" : "rvvAE-A1552_BLAST-p500"
    
        }],
    
        "input_records" : ["WP_000373826.1", "WP_000173586.1", "WP_000778624.1"]
        
            
    }
Parameter Definition
entrez_parameters
request_limit The number of attempts that can be made for each request to the NCBI databases.
sleep_time The delay between each request in seconds
email Email to be associated with the Entrez requests. Should there be issues with your requests, NCBI will contact you before blocking access.
api_key NCBI API key
--- ------
blast_parameters
tax_limit Limits the BLAST search to a specified taxons. Set value to null if not applicable. Input taxids must be in a list format: "tax_limit":["taxid1", "taxid2", "taxid3"]
e-val Cutoff for the e-value for the BLAST hits that are returned
coverage_min Minimum coverage for the BLAST hits that are returned
max_hits Max number of hits returned.
--- ------
nucleotide_parameters
isolate_promoters To return the promoter region. If false, will return the entire nucleotide sequence
capture_with_n To return sequences with ambiguous nucloetides (N). If true, all sequences will be returned.
get_centroid Applys to the filtering process. If true, the centroid for each bin is selected. This is more computationally expensive as it requires many pairwise alignments. If false, a random member from each bin is selected. This is faster, but not as exact.
min_length Minimum length of the nucleotide sequence returned.
max_sim The similarity threhold to be used for the filtering process. If set to -1, the results will not be filtered.
upstream_adj How far upstream from the ATG start to retreive for the nucleotide sequence.
downstream_adj How far downstream from the ATG start to retreive for the nucleotide sequence.
--- ------
output_parameters
output_type nuc - returns nucleotide sequence
prot - returns protein record
file_type CSV or FASTA file type. For the protein records, the CSV is default.
file_name Name of output file. The final output will be written in the output folder with the name file_name + DATE-TIME + file_type (i.e. rvvAE-A1552_BLAST-p500_04-07-2020_19-23-26.fasta)

About

A versatile script for pulling BLAST results. Specifically intended to be used to pull promoter sequences of a set of proteins and filter those promoter sequences based off a set similarity threshold.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages