Skip to content

Snakemake module containing processing steps that should be performed during sequence alignment.

License

Notifications You must be signed in to change notification settings

Genomic-Medicine-Linkoping/hydra-genetics_alignment

 
 

Repository files navigation

🐍 hydra-genetics/alignment

Snakemake module containing processing steps that should be performed during sequence alignment.

Lint Snakefmt snakemake dry run integration test

License: GPL-3

💬 Introduction

The module consists of alignment processing steps, such as alignment of .fastq-files. and duplicates marking .bam-files.

❗ Dependencies

In order to use this module, the following dependencies are required:

hydra-genetics pandas python snakemake singularity

🎒 Preparations

Sample and unit data

Input data should be added to samples.tsv and units.tsv. The following information need to be added to these files:

Column Id Description
samples.tsv
sample unique sample/patient id, one per row
units.tsv
sample same sample/patient id as in samples.tsv
type data type identifier (one letter), can be one of Tumor, Normal, RNA
platform type of sequencing platform, e.g. NovaSeq
machine specific machine id, e.g. NovaSeq instruments have @Axxxxx
flowcell identifer of flowcell used
lane flowcell lane number
barcode sequence library barcode/index, connect forward and reverse indices by +, e.g. ATGC+ATGC
fastq1/2 absolute path to forward and reverse reads
adapter adapter sequences to be trimmed, separated by comma

Reference data

You need have a indexed reference genome: ex reference.fna

For bwa the files are generated by bwa index. Dict files is generated using picard CreateSequenceDictionary. fai is generated using samtools index

File Description
reference.dict dictionary file
reference.fna.amb record appearance of N (or other non-ATGC) in the ref fasta
reference.fna.ann record ref sequences, name, length, etc
reference.fna.bwt the Burrows-Wheeler transformed sequence
reference.fna.fai index file
reference.fna.pac packaged sequence (four base pairs encode one byte)
reference.fna.sa suffix array index

✅ Testing

The workflow repository contains a small test dataset .tests/integration which can be run like so:

$ cd .tests/integration
$ snakemake -s ../../Snakefile -j1 --use-singularity

🚀 Usage

To use this module in your workflow, follow the description in the snakemake docs. Add the module to your Snakefile like so:

module alignment:
    snakefile:
        github(
            "hydra-genetics/alignment",
            path="workflow/Snakefile",
            tag="v0.1.0",
        )
    config:
        config


use rule * from alignment as alignment_*

Compatibility

Latest:

  • prealignment:v0.2.0

See COMPATIBLITY.md file for a complete list of module compatibility.

Input files

File Description
hydra-genetics/prealignment data
prealignment/fastp_pe/{sample}_{flowcell}_{lane}_{type}_fastq1.fastq.gz trimmed forward reads
prealignment/fastp_pe/{sample}_{flowcell}_{lane}_{type}_fastq1.fastq.gz trimmed reverse reads
original fastq files
PATH/fastq1.fastq.gz forward reads retrieved from units.tsv
PATH/fastq2.fastq.gz reverse reads retrieved from units.tsv

Output files

The following output files should be targeted via another rule:

File Description
alignment/samtools_merge_bam/{sample}_{type}.bam aligned data which have been duplicate marked

🧑‍⚖️ Rule Graph

Align and mark duplicates

rule_graph

About

Snakemake module containing processing steps that should be performed during sequence alignment.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%