From 144367437b3ab3eba244c08de7f519c4f45598ae Mon Sep 17 00:00:00 2001 From: Felix Lenner <52530259+fellen31@users.noreply.github.com> Date: Wed, 3 Apr 2024 13:45:39 +0200 Subject: [PATCH] Update usage docs (#53) update usage docs --------- Co-authored-by: Anders Jemt --- README.md | 2 +- docs/usage.md | 338 +++++++++++++++++++++++++++++--------------------- 2 files changed, 196 insertions(+), 144 deletions(-) diff --git a/README.md b/README.md index abbbc1f7..967f0d16 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ Prepare a samplesheet with input data: ``` sample,file,family_id,paternal_id,maternal_id,sex,phenotype HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,1 -HG005,/path/to/HG005.fastq.gz,FAM1,HG003,HG004,2,1 +HG005,/path/to/HG005.bam,FAM1,HG003,HG004,2,1 ``` Now, you can run the pipeline using: diff --git a/docs/usage.md b/docs/usage.md index de7e0d99..32a9bf56 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,12 +1,162 @@ # genomic-medicine-sweden/skierfe: Usage -## Optional inputs: +## Introduction + +genomic-medicine-sweden/skierfe is a bioinformatics analysis pipeline to analyse long-read data. + +## Prerequisites + +1. Install Nextflow (>=22.10.1) using the instructions [here.](https://nextflow.io/docs/latest/getstarted.html#installation) +2. Install one of the following technologies for full pipeline reproducibility: Docker, Singularity, Podman, Shifter or Charliecloud. + > Almost all nf-core pipelines give you the option to use conda as well. However, some tools used in the skierfe pipeline do not have a conda package so we do not support conda at the moment. + +## Run genomic-medicine-sweden/skierfe with test data + +Before running the pipeline with your data, we recommend running it with the test dataset available in the `assets/test_data` folder provided with the pipeline. You do not need to download any of the data as part of it came directly with the pipeline and the other part will be fetched automatically for you when you use the test profile. + +Run the following command, where YOURPROFILE is the package manager you installed on your machine. For example, `-profile test,docker` or `-profile test,singularity`: + +``` +nextflow run genomic-medicine-sweden/skierfe \ + -revision dev -profile test, \ + --outdir +``` + +> Check [nf-core/configs](https://github.com/nf-core/configs/tree/master/conf) to see if a custom config file to run nf-core pipelines already exists for your institute. If so, you can simply use `-profile test,` in your command. This enables the appropriate package manager and sets the appropriate execution settings for your machine. +> NB: The order of profiles is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. + +Running the command creates the following files in your working directory: + +``` +work # Directory containing the Nextflow working files + # Finished results in specified location (defined with --outdir) +.nextflow_log # Log file from Nextflow +# Other Nextflow hidden files, like history of pipeline logs. +``` + +> [!NOTE] +> The default cpu and memory configurations used in skierfe are written keeping the test profile (and dataset, which is tiny) in mind. You should override these values in configs to get it to work on larger datasets. Check the section `custom-configuration` below to know more about how to configure resources for your platform. + +### Updating the pipeline + +The above command downloads the pipeline from GitHub, caches it, and tests it on the test dataset. When you run the command again, it will fetch the pipeline from cache even if a more recent version of the pipeline is available. To make sure that you're running the latest version of the pipeline, update the cached version of the pipeline by including `-latest` in the command. + +## Run genomic-medicine-sweden/skierfe with your data + +Running the pipeline involves three steps: -- Limit SNV calling to regions in BED file (`--bed`) -- If running dipcall, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed)) -- If running TRGT, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome. -- If running SNV annotation, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)): -- If running CNV-calling, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions) +1. Prepare a samplesheet +2. Gather all required references +3. Supply samplesheet and references, and run the command + +## Samplesheet input + +You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. + +```bash +--input '[path to samplesheet file]' +``` + +It has to be a comma-separated file with 6 columns, and a header row as shown in the examples below. +`file` can either be a gzipped-fastq file or an aligned or unalinged BAM file (BAM files will be converted to FASTQ and aligned again). +`phenotype` is not used at the moment but still required, set it to `1`. If you don't have related samples, set `family_id`, `paternal_id` and `maternal_id` to something of your liking which is not a `sample` name. + +```console +sample,file,family_id,paternal_id,maternal_id,sex,phenotype +HG002,/path/to/HG002.fastq.gz,FAM,HG003,HG004,1,1 +HG005,/path/to/HG005.bam,FAM,HG003,HG004,2,1 +``` + +| Fields | Description | +| ------------------------------------------ | ---------------------------------------------------------------------------------------------------------- | +| `sample` | Custom sample name, cannot contain spaces. | +| `file` | Absolute path to gzipped FASTQ or BAM file. File has to have the extension ".fastq.gz", .fq.gz" or ".bam". | +| `family_id` | "Family ID must be provided and cannot contain spaces. If no family ID is avail | +| able, use the same ID as the sample. | +| `paternal_id` | Paternal ID must be provided and cannot contain spaces. If no paternal ID is a | +| vailable, use any ID not in sample column. | +| `maternal_id` | Maternal ID must be provided and cannot contain spaces. If no maternal ID is a | +| vailable, use any ID not in sample column. | +| `sex` | Sex (1=male; 2=female). | +| `phenotype` | Affected status of patient (0 = missing; 1=unaffected; 2=affected). | + +An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. + +The typical command for running the pipeline is as follows: + +```bash +nextflow run genomic-medicine-sweden/skierfe -r dev -profile docker \ + --input samplesheet.csv \ + --preset \ + --outdir \ + --fasta \ + --skip_assembly_wf \ + --skip_repeat_wf \ + --skip_snv_annotation \ + --skip_cnv_calling +``` + +This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. + +Note that the pipeline will create the following files in your working directory: + +``` +work # Directory containing the Nextflow working files + # Finished results in specified location (defined with --outdir) +.nextflow_log # Log file from Nextflow +# Other Nextflow hidden files, like history of pipeline logs. +``` + + + +## Reference files and parameters + +The typical command example above requires no additional files except the reference genome. Skierfe has the ability to skip certrain parts of the pipeline by specifying one or more of the following parameters: + +| Parameter | Description | Type | Default | Required | Hidden | +| ---------------------------- | ------------------------------------------ | --------- | ------- | -------- | ------ | +| `skip_qc` | Skip QC | `boolean` | | | | +| `skip_short_variant_calling` | Skip short variant calling | `boolean` | | | | +| `skip_assembly_wf` | Skip assembly and downstream processes | `boolean` | | | | +| `skip_mapping_wf` | Skip read mapping and downstream processes | `boolean` | | | | +| `skip_methylation_wf` | Skip methylation workflow | `boolean` | | | | +| `skip_repeat_wf` | Skip repeat analysis workflow | `boolean` | | | | +| `skip_phasing_wf` | Skip phasing workflow | `boolean` | | | | +| `skip_snv_annotation` | Skip SNV annotation | `boolean` | | | | +| `skip_cnv_calling` | Skip CNV workflow | `boolean` | | | | + +However, certain workflows require additional files: + +If running without `--skip_assembly_wf`, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed)) + +> [!NOTE] +> Make sure chrY PAR is hard masked in reference. + +If running without `--skip_repeat_wf`, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome. + +If running without `--skip_snv_annotation`, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)): `snp_dbs.csv` @@ -16,6 +166,8 @@ gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip cadd,/path/to/cadd.v1.6.hg38.zip ``` +If running without `--skip_cnv_calling`, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions) to supply to `--hificnv_xy`, `--hificnv_xx` and `--hificnv_exclude`. + If you want to give more samples to filter variants against, for SVs - prepare a samplesheet with .snf files from Sniffles2: `extra_snfs.csv` @@ -28,6 +180,9 @@ HG01124,/path/to/HG01124_sniffles.snf and for SNVs - prepare a samplesheet with gVCF files from DeepVariant: +> [!NOTE] +> These has to have been generated with the same version of reference genome. + `extra_gvcfs.csv` ``` @@ -37,29 +192,15 @@ HG01124,/path/to/HG01124.g.vcf.gz HG01125,/path/to/HG01125.g.vcf.gz ``` -> **Note** If running dipcall, make sure chrY PAR is hard masked in reference. - ---> +#### Highlighted parameters: -# genomic-medicine-sweden/skierfe pipeline parameters +- You can choose to limit SNV calling to regions in BED file (`--bed`). -Long-read variant calling pipeline +- By default SNV-calling is split into 13 parallel processes, limit this by setting `--parallel_snv` to a different number. -## Workflow skip options +- By default the pipeline does not perform parallel alignment, but this can be set by setting `--split_fastq` to split alignment into N reads per process. -Options to skip various steps within the workflow - -| Parameter | Description | Type | Default | Required | Hidden | -| ---------------------------- | ------------------------------------------ | --------- | ------- | -------- | ------ | -| `skip_qc` | Skip QC | `boolean` | | | | -| `skip_short_variant_calling` | Skip short variant calling | `boolean` | | | | -| `skip_assembly_wf` | Skip assembly and downstream processes | `boolean` | | | | -| `skip_mapping_wf` | Skip read mapping and downstream processes | `boolean` | | | | -| `skip_methylation_wf` | Skip methylation workflow | `boolean` | | | | -| `skip_repeat_wf` | Skip repeat analysis workflow | `boolean` | | | | -| `skip_phasing_wf` | Skip phasing workflow | `boolean` | | | | -| `skip_snv_annotation` | Skip SNV annotation | `boolean` | | | | -| `skip_cnv_calling` | Skip CNV workflow | `boolean` | | | | +All parameters are listed below: ## Input/output options @@ -154,129 +295,25 @@ Different processes may need extra input files | `validationFailUnrecognisedParams` | Validation of parameters fails when an unrecognised parameter is found.
HelpBy default, when an u | | `validationLenientMode` | Validation of parameters in lenient more.
HelpAllows string values that are parseable as numbers or booleans | -> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ - -## :warning: This is information is incomplete - -## Introduction - - - -## Samplesheet input - -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. - -```bash ---input '[path to samplesheet file]' -``` - -### Multiple runs of the same sample - -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: - -```csv title="samplesheet.csv" -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz -``` - -### Full samplesheet - -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. - -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. - -```csv title="samplesheet.csv" -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, -``` - -| Column | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | - -An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. - -## Running the pipeline - -The typical command for running the pipeline is as follows: - -```bash -nextflow run genomic-medicine-sweden/skierfe --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker -``` - -This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. - -Note that the pipeline will create the following files in your working directory: - -```bash -work # Directory containing the nextflow working files - # Finished results in specified location (defined with --outdir) -.nextflow_log # Log file from Nextflow -# Other nextflow hidden files, eg. history of pipeline runs and old logs. -``` - -If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. - -Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `. - -:::warning -Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args). -::: - -The above pipeline run specified with a params file in yaml format: - -```bash -nextflow run genomic-medicine-sweden/skierfe -profile docker -params-file params.yaml -``` - -with `params.yaml` containing: - -```yaml -input: './samplesheet.csv' -outdir: './results/' -genome: 'GRCh37' -<...> -``` - -You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). - -### Updating the pipeline - -When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: - -```bash -nextflow pull genomic-medicine-sweden/skierfe -``` - ### Reproducibility It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. +However, there has been no releases of this pipeline. A workaround is to supply a commit ID, e.g. `-revision 6ff95ff`, in order to ensure that the same version of the pipeline is being executed. + + To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter. -:::tip -If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles. -::: +> 💡 If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles. ## Core Nextflow arguments -:::note -These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). -::: +> **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). ### `-profile` @@ -284,10 +321,6 @@ Use this parameter to choose a configuration profile. Profiles can give configur Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below. -:::info -We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. -::: - The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! @@ -310,8 +343,6 @@ If `-profile` is not specified, the pipeline will run locally and expect all sof - A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) - `apptainer` - A generic configuration profile to be used with [Apptainer](https://apptainer.org/) -- `conda` - - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer. ### `-resume` @@ -333,9 +364,9 @@ To change the resource requests, please see the [max resources](https://nf-co.re ### Custom Containers -In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the [biocontainers](https://biocontainers.pro/) or [bioconda](https://bioconda.github.io/) projects. However in some cases the pipeline specified version maybe out of date. +In some cases you may wish to change which container a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the [biocontainers](https://biocontainers.pro/) or [bioconda](https://bioconda.github.io/) projects. However in some cases the pipeline specified version maybe out of date. -To use a different container from the default container or conda environment specified in a pipeline, please see the [updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section of the nf-core website. +To use a different container from the default container specified in a pipeline, please see the [updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section of the nf-core website. ### Custom Tool Arguments @@ -376,3 +407,24 @@ We recommend adding the following line to your environment to limit this (typica ```bash NXF_OPTS='-Xms1g -Xmx4g' ``` + +## Running the pipeline without internet access + +The pipeline and container images can be downloaded using [nf-core tools](https://nf-co.re/docs/usage/offline). For running offline, you of course have to make all the reference data available locally, and specify `--fasta`, etc., see [above](#reference-files-and-parameters). + +Contrary to the paragraph about [Nextflow](https://nf-co.re/docs/usage/offline#nextflow) on the page linked above, it is not possible to use the "-all" packaged version of Nextflow for this pipeline. The online version of Nextflow is necessary to support the necessary nextflow plugins. Download instead the file called just `nextflow`. Nextflow will download its dependencies when it is run. Additionally, you need to download the nf-validation plugin explicitly: + +``` +./nextflow plugin install nf-validation +``` + +Now you can transfer the `nextflow` binary as well as its directory `$HOME/.nextflow` to the system without Internet access, and use it there. It is necessary to use an explicit version of `nf-validation` offline, or Nextflow will check for the most recent version online. Find the version of nf-validation you downloaded in `$HOME/.nextflow/plugins`, then specify this version for `nf-validation` in your configuration file: + +``` +plugins { + // Set the plugin version explicitly, otherwise nextflow will look for the newest version online. + id 'nf-validation@1.1.3' +} +``` + +This should go in your Nextflow confgiguration file, specified with `-c ` when running the pipeline.