Skip to content

Commit

Permalink
Fix formatting with prettier and black
Browse files Browse the repository at this point in the history
  • Loading branch information
fellen31 committed Mar 22, 2024
1 parent ee9a159 commit 7db0236
Show file tree
Hide file tree
Showing 5 changed files with 109 additions and 97 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

<!-- insertion marker -->
<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 -->
<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 -->

### Added

Expand Down
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,31 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
## Pipeline summary

##### QC

- FastQC ([`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
- Aligned read QC ([`cramino`](https://github.com/wdecoster/cramino))
- Depth information ([`mosdepth`](https://github.com/brentp/mosdepth))

##### Alignment & assembly

- Align reads to reference ([`minimap2`](https://github.com/lh3/minimap2))
- Assemble (trio-binned) haploid genomes (HiFi only) ([`hifiasm`](https://github.com/chhylp123/hifiasm))

##### Variant calling

- Short variant calling & joint genotyping of SNVs ([`deepvariant`](https://github.com/google/deepvariant) + [`GLNexus`](https://github.com/dnanexus-rnd/GLnexus))
- SV calling and joint genotyping ([`sniffles2`](https://github.com/fritzsedlazeck/Sniffles))
- Tandem repeats ([`TRGT`](https://github.com/PacificBiosciences/trgt/tree/main))
- Assembly based variant calls (HiFi only) ([`dipcall`](https://github.com/lh3/dipcall))
- CNV-calling (HiFi only) ([`HiFiCNV`](https://github.com/PacificBiosciences/HiFiCNV))

##### Phasing and methylation

- Phase and haplotag reads ([`whatshap`](https://github.com/whatshap/whatshap) + [`hiphase`](https://github.com/PacificBiosciences/HiPhase))
- Methylation pileups (Revio/ONT) ([`modkit`](https://github.com/nanoporetech/modkit))

##### Annotation - SNV

1. Annotate variants with database(s) of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. ([`echtvar`](https://github.com/brentp/echtvar))
2. Annotate variants ([`VEP`](https://github.com/Ensembl/ensembl-vep))

Expand All @@ -56,25 +61,29 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
1. Prepare a samplesheet with input data (gzipped fastq-files):

`samplesheet.csv`

```
sample,file,family_id,paternal_id,maternal_id,sex,phenotype
HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,1
HG005,/path/to/HG005.fastq.gz,FAM1,HG003,HG004,2,1
```

2. Optional inputs:

- Limit SNV calling to regions in BED file (`--bed`)
- If running dipcall, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed))
- If running TRGT, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome.
- If running SNV annotation, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)):
- If running CNV-calling, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions)

`snp_dbs.csv`

```
sample,file
gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip
cadd,/path/to/cadd.v1.6.hg38.zip
```

<!---
- If you want to give more samples to filter variants against, for SVs - prepare a samplesheet with .snf files from Sniffles2:
Expand Down Expand Up @@ -115,13 +124,13 @@ HG01125,/path/to/HG01125.g.vcf.gz

To run in an offline environment, download the pipeline using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):

```
nf-core download fellen31/skierfe -r dev
```
```
nf-core download fellen31/skierfe -r dev
```

> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.

> **Warning:**
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
Expand Down
1 change: 0 additions & 1 deletion assets/schema_snfs.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
"pattern": "^\\S+\\.snf$",
"errorMessage": "SNF file must be provided, cannot contain spaces and must have extension '.snf"
}

},
"required": ["sample", "file"]
}
Expand Down
44 changes: 25 additions & 19 deletions bin/split_bed_chunks.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,64 @@

# Released under the MIT license.

# Split regions in BED into n files with approximately equal region sizes.
# A region is never split. 13 is a good number.
# Split regions in BED into n files with approximately equal region sizes.
# A region is never split. 13 is a good number.

import sys
import pandas as pd
import string


def contains_whitespace_other_than_tab(filepath):
with open(filepath, 'r') as file:
with open(filepath, "r") as file:
for line_number, line in enumerate(file, start=1):
for char_number, char in enumerate(line, start=1):
if char.isspace() and char != '\t' and char != '\n':
print(f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}.")
if char.isspace() and char != "\t" and char != "\n":
print(
f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}."
)
sys.exit(1)


file_path = sys.argv[1] # Replace with the path to your file

contains_whitespace_other_than_tab(file_path)
print("File does not contain whitespace characters other than tab and newline.")

chromosome_data = pd.read_csv(sys.argv[1], names = ['chr', 'start', 'stop'], usecols=range(3), sep = '\t')
chromosome_data = pd.read_csv(sys.argv[1], names=["chr", "start", "stop"], usecols=range(3), sep="\t")

chromosome_data['size'] = chromosome_data['stop'] - chromosome_data['start']
chromosome_data["size"] = chromosome_data["stop"] - chromosome_data["start"]

# Number of bins
n = int(sys.argv[2])

# Sort chromosome data by size in descending order
sorted_data = chromosome_data.sort_values(by='size', ascending=False)
sorted_data = chromosome_data.sort_values(by="size", ascending=False)

# Initialize empty bins as lists
bins = [[] for _ in range(n)]

# Allocate chromosomes to bins
for index, row in sorted_data.iterrows():
# Find the bin with the fewest chromosomes
min_bin = min(range(n), key=lambda i: sum(chrom['size'] for chrom in bins[i]))
min_bin = min(range(n), key=lambda i: sum(chrom["size"] for chrom in bins[i]))

# Place the chromosome data in the selected bin
bins[min_bin].append(row.to_dict())

# Create a DataFrame to store the results
result_df = pd.DataFrame({
'bin': [i + 1 for i in range(n) for _ in bins[i]],
'chr': [chromosome['chr'] for bin_chromosomes in bins for chromosome in bin_chromosomes],
'start': [int(chromosome['start']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
'stop': [int(chromosome['stop']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
'size': [int(chromosome['size']) for bin_chromosomes in bins for chromosome in bin_chromosomes]
})
result_df = pd.DataFrame(
{
"bin": [i + 1 for i in range(n) for _ in bins[i]],
"chr": [chromosome["chr"] for bin_chromosomes in bins for chromosome in bin_chromosomes],
"start": [int(chromosome["start"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
"stop": [int(chromosome["stop"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
"size": [int(chromosome["size"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
}
)

# Print the result DataFrame, ordered by size within each bin
result_df = result_df.sort_values(by=['bin', 'size'], ascending=[True, False])
result_df = result_df.sort_values(by=["bin", "size"], ascending=[True, False])

for id, group in result_df.groupby(['bin']):
group[['chr', 'start', 'stop']].to_csv(f'{id}.bed', index=False, header=False, sep = '\t')
for id, group in result_df.groupby(["bin"]):
group[["chr", "start", "stop"]].to_csv(f"{id}.bed", index=False, header=False, sep="\t")
Loading

0 comments on commit 7db0236

Please sign in to comment.