Fix formatting with prettier and black

genomic-medicine-sweden · Mar 22, 2024 · 7db0236 · 7db0236
1 parent ee9a159
commit 7db0236
Show file tree

Hide file tree

Showing 5 changed files with 109 additions and 97 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,7 +6,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
 <!-- insertion marker -->
-<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 --> 
+<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 -->
 
 ### Added
 

diff --git a/README.md b/README.md
@@ -19,26 +19,31 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
 ## Pipeline summary
 
 ##### QC
+
 - FastQC ([`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
 - Aligned read QC ([`cramino`](https://github.com/wdecoster/cramino))
 - Depth information ([`mosdepth`](https://github.com/brentp/mosdepth))
 
 ##### Alignment & assembly
+
 - Align reads to reference ([`minimap2`](https://github.com/lh3/minimap2))
 - Assemble (trio-binned) haploid genomes (HiFi only) ([`hifiasm`](https://github.com/chhylp123/hifiasm))
 
 ##### Variant calling
+
 - Short variant calling & joint genotyping of SNVs ([`deepvariant`](https://github.com/google/deepvariant) + [`GLNexus`](https://github.com/dnanexus-rnd/GLnexus))
 - SV calling and joint genotyping ([`sniffles2`](https://github.com/fritzsedlazeck/Sniffles))
 - Tandem repeats ([`TRGT`](https://github.com/PacificBiosciences/trgt/tree/main))
 - Assembly based variant calls (HiFi only) ([`dipcall`](https://github.com/lh3/dipcall))
 - CNV-calling (HiFi only) ([`HiFiCNV`](https://github.com/PacificBiosciences/HiFiCNV))
 
 ##### Phasing and methylation
+
 - Phase and haplotag reads ([`whatshap`](https://github.com/whatshap/whatshap) + [`hiphase`](https://github.com/PacificBiosciences/HiPhase))
 - Methylation pileups (Revio/ONT) ([`modkit`](https://github.com/nanoporetech/modkit))
 
 ##### Annotation - SNV
+
 1. Annotate variants with database(s) of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. ([`echtvar`](https://github.com/brentp/echtvar))
 2. Annotate variants ([`VEP`](https://github.com/Ensembl/ensembl-vep))
 
@@ -56,25 +61,29 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
 1. Prepare a samplesheet with input data (gzipped fastq-files):
 
 `samplesheet.csv`
+
 ```
 sample,file,family_id,paternal_id,maternal_id,sex,phenotype
 HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,1
 HG005,/path/to/HG005.fastq.gz,FAM1,HG003,HG004,2,1
 ```
 
 2. Optional inputs:
+
 - Limit SNV calling to regions in BED file (`--bed`)
 - If running dipcall, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed))
 - If running TRGT, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome.
 - If running SNV annotation, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)):
 - If running CNV-calling, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions)
 
 `snp_dbs.csv`
+
 ```
 sample,file
 gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip
 cadd,/path/to/cadd.v1.6.hg38.zip
 ```
+
 <!---
 
 - If you want to give more samples to filter variants against, for SVs - prepare a samplesheet with .snf files from Sniffles2:
@@ -115,13 +124,13 @@ HG01125,/path/to/HG01125.g.vcf.gz
 
 To run in an offline environment, download the pipeline using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):
 
-   ```
-   nf-core download fellen31/skierfe -r dev
-   ```
+```
+nf-core download fellen31/skierfe -r dev
+```
 
-   > - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
-   > - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
-   > - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
+> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
+> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
+> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
 
 > **Warning:**
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those

diff --git a/assets/schema_snfs.json b/assets/schema_snfs.json
@@ -19,7 +19,6 @@
                 "pattern": "^\\S+\\.snf$",
                 "errorMessage": "SNF file must be provided, cannot contain spaces and must have extension '.snf"
             }
-
         },
         "required": ["sample", "file"]
     }

diff --git a/bin/split_bed_chunks.py b/bin/split_bed_chunks.py
@@ -2,58 +2,64 @@
 
 # Released under the MIT license.
 
-# Split regions in BED into n files with approximately equal region sizes. 
-# A region is never split. 13 is a good number. 
+# Split regions in BED into n files with approximately equal region sizes.
+# A region is never split. 13 is a good number.
 
 import sys
 import pandas as pd
 import string
 
+
 def contains_whitespace_other_than_tab(filepath):
-    with open(filepath, 'r') as file:
+    with open(filepath, "r") as file:
         for line_number, line in enumerate(file, start=1):
             for char_number, char in enumerate(line, start=1):
-                if char.isspace() and char != '\t' and char != '\n':
-                    print(f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}.")
+                if char.isspace() and char != "\t" and char != "\n":
+                    print(
+                        f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}."
+                    )
                     sys.exit(1)
 
+
 file_path = sys.argv[1]  # Replace with the path to your file
 
 contains_whitespace_other_than_tab(file_path)
 print("File does not contain whitespace characters other than tab and newline.")
 
-chromosome_data = pd.read_csv(sys.argv[1], names = ['chr', 'start', 'stop'], usecols=range(3), sep = '\t')
+chromosome_data = pd.read_csv(sys.argv[1], names=["chr", "start", "stop"], usecols=range(3), sep="\t")
 
-chromosome_data['size'] = chromosome_data['stop'] - chromosome_data['start']
+chromosome_data["size"] = chromosome_data["stop"] - chromosome_data["start"]
 
 # Number of bins
 n = int(sys.argv[2])
 
 # Sort chromosome data by size in descending order
-sorted_data = chromosome_data.sort_values(by='size', ascending=False)
+sorted_data = chromosome_data.sort_values(by="size", ascending=False)
 
 # Initialize empty bins as lists
 bins = [[] for _ in range(n)]
 
 # Allocate chromosomes to bins
 for index, row in sorted_data.iterrows():
     # Find the bin with the fewest chromosomes
-    min_bin = min(range(n), key=lambda i: sum(chrom['size'] for chrom in bins[i]))
+    min_bin = min(range(n), key=lambda i: sum(chrom["size"] for chrom in bins[i]))
 
     # Place the chromosome data in the selected bin
     bins[min_bin].append(row.to_dict())
 
 # Create a DataFrame to store the results
-result_df = pd.DataFrame({
-    'bin': [i + 1 for i in range(n) for _ in bins[i]],
-    'chr': [chromosome['chr'] for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'start': [int(chromosome['start']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'stop': [int(chromosome['stop']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'size': [int(chromosome['size']) for bin_chromosomes in bins for chromosome in bin_chromosomes]
-})
+result_df = pd.DataFrame(
+    {
+        "bin": [i + 1 for i in range(n) for _ in bins[i]],
+        "chr": [chromosome["chr"] for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "start": [int(chromosome["start"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "stop": [int(chromosome["stop"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "size": [int(chromosome["size"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+    }
+)
 
 # Print the result DataFrame, ordered by size within each bin
-result_df = result_df.sort_values(by=['bin', 'size'], ascending=[True, False])
+result_df = result_df.sort_values(by=["bin", "size"], ascending=[True, False])
 
-for id, group in result_df.groupby(['bin']):
-    group[['chr', 'start', 'stop']].to_csv(f'{id}.bed', index=False, header=False, sep = '\t')
+for id, group in result_df.groupby(["bin"]):
+    group[["chr", "start", "stop"]].to_csv(f"{id}.bed", index=False, header=False, sep="\t")