blast_sequence_filter

Overview

A set of protein accesions are passed in to yield BLAST results or nucleotide sequences of the BLAST results.

How it works

A BLAST search is conducted for each of the protein accessions passed in, and all unique hits are returned by the search_blast() function. The unique hits can be written to a CSV file or continue through the pipline depending on the configuration in the JSON file.

The set of unique protein records from the BLAST searches are used to pull the "most complete" nucleotide record for that protein (get_nuc_rec_from_prot()). The set of nucelotide records are used to pull the entire nucleotide sequences or the isolated promoter sequence based on the parameters in the JSON (get_nuc_seq()).

The returned nucelotide sequences can be filtered to a specific threshold similarity. For example, if the threshold is set to .8, no two sequence will be more than 80% similar in the filtered list. It works by sorting all sequences within the thrshold similarity into bins, and the centroid of each bin is selected to be added to the filtered list.

The output sequences are written to the output file (CSV or FASTA).

Usage

The parameters and input sequences are set in a JSON file following the structure below. The JSON file must be placed in the input folder and the filename must be specified in the filter_seq.py file.

Example JSON file structure and definitions:

    {
        
        "entrez_parameters": [{
        
            "request_limit" : 5,
            "sleep_time": 0.5,
            "email" : "[email protected]",
            "api_key" : null
            
        }],
    
        "blast_parameters" : [{
    
        	"tax_limit" : null,
        	"e_val" : 10E-10,
        	"coverage_min": 0.8,
        	"max_hits": 500
    
        }],
    
        "nucleotide_parameters" : [{
    
        	"isolate_promoters": true,
        	"capture_with_n": false,
        	"get_centroid": true,
        	"min_length" : 10,
        	"max_sim" : 0.75,
        	"upstream_adj" : 250,
        	"downstream_adj" : 3
    
        }],
    
        "output_parameters" : [{
    
        	"output_type" : "nuc",
        	"file_type" : "fasta",
        	"file_name" : "rvvAE-A1552_BLAST-p500"
    
        }],
    
        "input_records" : ["WP_000373826.1", "WP_000173586.1", "WP_000778624.1"]
        
            
    }

Parameter	Definition
entrez_parameters
`request_limit`	The number of attempts that can be made for each request to the NCBI databases.
`sleep_time`	The delay between each request in seconds
`email`	Email to be associated with the Entrez requests. Should there be issues with your requests, NCBI will contact you before blocking access.
`api_key`	NCBI API key
---	------
blast_parameters
`tax_limit`	Limits the BLAST search to a specified taxons. Set value to `null` if not applicable. Input taxids must be in a list format: `"tax_limit":["taxid1", "taxid2", "taxid3"]`
`e-val`	Cutoff for the e-value for the BLAST hits that are returned
`coverage_min`	Minimum coverage for the BLAST hits that are returned
`max_hits`	Max number of hits returned.
---	------
nucleotide_parameters
`isolate_promoters`	To return the promoter region. If `false`, will return the entire nucleotide sequence
`capture_with_n`	To return sequences with ambiguous nucloetides (N). If `true`, all sequences will be returned.
`get_centroid`	Applys to the filtering process. If `true`, the centroid for each bin is selected. This is more computationally expensive as it requires many pairwise alignments. If `false`, a random member from each bin is selected. This is faster, but not as exact.
`min_length`	Minimum length of the nucleotide sequence returned.
`max_sim`	The similarity threhold to be used for the filtering process. If set to `-1`, the results will not be filtered.
`upstream_adj`	How far upstream from the ATG start to retreive for the nucleotide sequence.
`downstream_adj`	How far downstream from the ATG start to retreive for the nucleotide sequence.
---	------
output_parameters
`output_type`	`nuc` - returns nucleotide sequence `prot` - returns protein record
`file_type`	CSV or FASTA file type. For the protein records, the CSV is default.
`file_name`	Name of output file. The final output will be written in the `output` folder with the name `file_name + DATE-TIME + file_type` (i.e. rvvAE-A1552_BLAST-p500_04-07-2020_19-23-26.fasta)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Icon		Icon
README.md		README.md
filter-seq.py		filter-seq.py
seq_filter.yml		seq_filter.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blast_sequence_filter

About

Releases

Packages

Contributors 2

Languages

ErillLab/blast_seq_filter

Folders and files

Latest commit

History

Repository files navigation

blast_sequence_filter

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages