Deep mutational scanning of H5N1 influenza haemagglutinin using a barcoded lentiviral platform (strain A/American Wigeon/South Carolina/USDA-000345-001/2021)
Study by Dadonaite et al, bioRxiv (2024).
This repository has the code and data. For rendering of key results and an easy-to-interpret summary, see the documentation of the analysis at https://dms-vep.org/Flu_H5_American-Wigeon_South-Carolina_2021-H5N1_DMS
A JSON file with the results for visualization with dms-viz is created by the pipeline, and can be viewed at this link.
Most of the analysis is done by the dms-vep-pipeline-3, which was added as a git submodule to this pipeline via:
git submodule add https://github.com/dms-vep/dms-vep-pipeline-3
This added the file .gitmodules and the submodule dms-vep-pipeline-3, which was then committed to the repo. Note that if you want a specific commit or tag of dms-vep-pipeline-3 or to update to a new commit, follow the steps here, basically:
cd dms-vep-pipeline-3
git checkout <commit>
and then cd ../
back to the top-level directory, and add and commit the updated dms-vep-pipeline-3
submodule.
You can also make changes to the dms-vep-pipeline-3 that you commit back to that repo.
The configuration for the pipeline is in config.yaml and the files in ./data/ referenced therein. To run the pipeline, do:
snakemake -j 8 --software-deployment-method conda -s dms-vep-pipeline-3/Snakefile
To run on the Hutch cluster via slurm, you can run the file run_Hutch_cluster.bash:
sbatch -c 8 run_Hutch_cluster.bash
Running the full pipeline in this repo requires access to large FASTQ files with the deep sequencing data. Those FASTQ files are far too large to store in this GitHub repo, so need to be stored elsewhere on your computing cluster. The locations where they are stored on the Fred Hutch computing cluster are specified in ./data/PacBio_runs.csv and ./data/barcode_runs.csv; if you want to run the full pipeline on another cluster then you will need to obtain all these FASTQ files, download them to your cluster, and then update the paths in the two aforementioned files to point to the locations where you downloaded the files.
But alternatively, for most purposes it should be sufficient for you to re-run the pipeline not going all the way back to the FASTQ files, but rather just using the variants and barcode counts already extracted from these FASTQs by the original analysis on the Hutch cluster.
Those variants and barcode counts files are much smaller and so can be stored in this repo.
To re-run the pipeline using those so that the FASTQs are not required, follow the instructions here which only require you to change teh values for prebuilt_variants
, prebuilt_geneseq
, and use_precomputed_barcode_counts
in config.yaml to be as follows:
prebuilt_variants: results/variants/codon_variants.csv # use codon-variant table already in repo
prebuilt_geneseq: results/gene_sequence/codon.fasta # use gene sequence already in repo
...
use_precomputed_barcode_counts: false # use barcode counts already in repo
Input data for the pipeline are in ./data/. Raw sequencing files for both PacBio and Illumina sequencing can be found under PRJNA1123200 BioProject on SRA. Sequencing data was uploaded to SRA using scripts and instructions provided in ./sra_upload/.
The results of running the pipeline are placed in ./results/. Only some of these results are tracked to save space (see .gitignore).
The pipeline builds HTML documentation for the pipeline in ./docs/, which can be rendered via GitHub Pages. For this repo, nice VitePress documentation was then built to render on GitHub Pages by following the instructions in homepage.
The description of the mutant library design is contained in ./library_design/.