-
module load bcl2fastq/2.20
-
bcl2fastq -p 20 --create-fastq-for-index-reads -o <outputdir>
**note**: run the above bcl2fastq command in the Miseq raw reads output folder. (be sure to change the samplesheet.csv to a different name, so it can lump reads together and give us the 4 ‘Undetermined R1/R2/I1/I2 fastq files)
cutadapt --pair-adapters
-j 20
-m 1
-a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
-A CTGTCTCTTATACACATCTGACGCTGCCGACGA
-o Undetermined_S0_L001_R1_001.fastq.trim.gz
-p Undetermined_S0_L001_R2_001.fastq.trim.gz
Undetermined_S0_L001_R1_001.fastq.gz
Undetermined_S0_L001_R2_001.fastq.gz
**note**: -m 1 will remove any zero length sequence; -j 20 is for specifying CPU number
python3 pipeline.py -c settings.in
Need to make sure that:
- settings.ini is properly set up, for example:
/scicomp/groups/OID/NCEZID/DFWED/EDLB/projects/CIMS/HMAS_QC_pipeline/M3235_22_024/settings.ini
- mothur_py is installe
- mothur is on path (v1.46.0)
- cutadapt is on path
- oligos file is properly set up, for example:
/scicomp/groups/OID/NCEZID/DFWED/EDLB/projects/CIMS/HMAS_QC_pipeline/M3235_22_024/M3235_22_024.oligos
python3 parse_count_table_confusion_matrix.py
-c final.full.count_table (from Mothur QC)
-f fina.fasta (from Mothur QC)
-r reference.fasta
-s sample.csv
-o output.file
note:
- Need to have blast loaded: ml ncbi-blast+/LATEST
- this script has been updated so that it will take one single argument of confusion_matrix.ini file, and all required arguments are set in that config.ini file, as:
python3 parse_count_table_confusion_matrix.py
-c confusion_matrix.ini
python3 extract_amplicon_from_primersearch_output.py
-s isolate_WGS.fasta (or directory name, which holds an array of fasta files)
-p primers-list-psearch.txt
note
-
extract_amplicon_from_primersearch_output.py
will check if the -s argument passed in is a file or a directory. If it's a directory it will grab all the fasta files in the directory. - The primer list has to be in a specific format (tab delimited plain file): total 3 columns, the 1st column is primer name, the 2nd column is forward primer sequence, and the 3rd column is reverse primer sequence. For example: psearch_primer_list
- the script assumes EMBOSS/6.4.0 is already on path. (
ml EMBOSS/6.4.0
if you're in scicomp space) - The outputs are saved in the primersearch folder under the working directory where the script is run.
- Dowload sequence data files using SRA toolkit
- Installation
wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
- Configure and set it up correctly
- fasterq-dump --split-files SRR_file (Scicomp currently has fasterq-dump installed already, so you can skip the first 2 steps )
- Use shovill to assemble
- ml shovill
- this depends, but might need to switch to home folder to use shovill (I sometimes got the folder permission issue when calling shovill)
- run shovill
shovill -R1 SRR1616822_1.fastq -R2 SRR1616822_2.fastq
--outdir output_folder --assembler skesa --trim ON --cpus 30
note: the default spades assembler often throws out out of memory error to me
- I have a run_SRA_assembly.py script, which automates the whole process, if you already have a list of SRA files to download and assemble.
python3 run_SRA_assembly.py
-i sra-list-file
if there is any sra which fails in the process, the script will generate a sra-list-file_fail_to_assemble
, which you can use to run the script again. The sra-list-file
is a 2 column (tab delimited) text file:
sample-1 sra-1
sample-2 sra-2
generate pairwise difference matrix at all allele sites (primer pairs) for given sequence fasta files
###
python3 pairwise_diff_matrix.py
-o output file
-d directory (which holds the fasta files)
-p (optional) oligos file (holds primer info)
-n (optional) numeric flag
-y (optional) diff only flag
note:
- the allele (primer pair) information is in the oligos file, although -p argument is optional, you will need to provide a link to an accessible valid oligos file (I have a default private oligos file in the script)
- the sequence fasta file need to have the allele (primer) information in their sequence id. The script use that to locate the pair of sequences to compare. This will not be an issue if you run our
extract_amplicon_from_primersearch_output.py
- if the -n (numeric flag) argument is turned on, a float value
(ex. 0.002)
is used instead of string(ex. 5 / 2461)
in the output file. If instead the -y (diff only) argument is turned on, only the difference will be in the output(ex. 5 in this case)
-
Run
create_subset_primers.py
to generate a subset of oligos file. It will generate 2 oligos file (*_subset_oligos file and *_remainder_oligos file, we will use that 'remainder_oligos' file) -
set up conda env for NextFlow:
conda env create -n hmas -f bin/hmas.yaml
(if you have mamba installed, usemamba env create
instead for speed)conda activate hmas
-
Run NextFlow script as:
nextflow run hmas2_sampling_rawreads.nf
--oligo absolute_address_for_that_remainder_oligos file
The outputs are all in the output_sampling_rawreads folder
python3 hmas2_confusion_matrix.py
-i hmas2 QC pipeline output folder
(which contains subfolders for each sample)
-o output confusion_matrix file path
-r common reference file for all those samples
-m the metasheet file for all those samples
(this is usually generated while extracting amplicon sequences)
-p mapping file
(mapping between sample and isolates. A sample might has multiple isolates in it)
-s the path for parse_count_table_confusion_matrix.py script
note
- the mapping file is a csv file, with header:
Sample isolate_1 isolate_2 isolate_3
. If a sample has more than 3 isolates in it, you can add more columns to it. If a sample has only one isolate, you can leave the other 2 isolates column blank.