-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7b57e3e
commit 84e71bf
Showing
7 changed files
with
327 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# BEDboss | ||
bedboss is a command-line pipeline that standardizes and calculates statistics for genomic interval data, and enters the results into a BEDbase database. | ||
It has 3 components: | ||
|
||
1) bedmaker (`bedboss make`); </br> | ||
2) bedqc (`bedboss qc`);</br> | ||
3) bedstat (`bedboss stat`). | ||
|
||
You may run all 3 pipelines together, or separately. | ||
|
||
Mainly pipelines are intended to be run from command line but nevertheless, | ||
they are also available as a python function, so that user can implement them to his own code. | ||
---- | ||
## BEDboss consist of 3 main pipelines: | ||
|
||
### bedmaker | ||
bedmaker - pipeline to convert supported file types* into BED format and bigBed format. Currently supported formats: | ||
|
||
- bedGraph | ||
- bigBed | ||
- bigWig | ||
- wig | ||
|
||
### bedqc | ||
flag bed files for further evaluation to determine whether they should be included in the downstream analysis. | ||
Currently, it flags bed files that are larger than 2G, has over 5 milliom regions, and/or has mean region width less than 10 bp. | ||
This threshold can be changed in bedqc function arguments. | ||
|
||
### bedstat | ||
|
||
pipeline for obtaining statistics about bed files | ||
|
||
It produces BED file Statistics: | ||
|
||
- **GC content**.The average GC content of the region set. | ||
- **Number of regions**. The total number of regions in the BED file. | ||
- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS) | ||
- **Mean region width**. The average region width of the region set. | ||
- **Exon percentage**. The percentage of the regions in the BED file that are annotated as exon. | ||
- **Intron percentage**. The percentage of the regions in the BED file that are annotated as intron. | ||
- **Promoter proc percentage**. The percentage of the regions in the BED file that are annotated as promoter-prox. | ||
- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic. | ||
- **Promoter core percentage**. The percentage of the regions in the BED file that are annotated as promoter-core. | ||
- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR. | ||
- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR. | ||
|
||
# Additional information | ||
|
||
## bedmaker | ||
|
||
### Additional dependencies | ||
|
||
- bedToBigBed: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed | ||
- bigBedToBed: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed | ||
- bigWigToBedGraph: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph | ||
- wigToBigWig: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig | ||
|
||
## bedstat | ||
|
||
### Additional dependencies | ||
regionstat.R script is used to calculate the bed file statistics, so the pipeline also depends on several R packages: | ||
|
||
All dependencies you can find in R helper script, and use it to easily install the required packages: | ||
|
||
- Rscript scripts/installRdeps.R [How to install R dependencies](./how_to_install_r_dep.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Changelog | ||
|
||
This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. | ||
|
||
## [0.1.0a1] - 2023-08-02 | ||
### Added | ||
- Initial alpha release |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# How to create bedbase config file (for bedstat) | ||
|
||
### Bedbase config file is yaml file with 4 parts: | ||
- path to output files | ||
- database credentials | ||
- server information | ||
- remote info | ||
|
||
### Example: | ||
```yaml | ||
path: | ||
pipeline_output_path: $BEDBOSS_OUTPUT_PATH # do not change it | ||
bedstat_dir: bedstat_output | ||
remote_url_base: null | ||
bedbuncher_dir: bedbucher_output | ||
# region2vec: "add/path/here" | ||
# vec2vec: "add/path/here" | ||
database: | ||
host: $DB_HOST_URL | ||
port: $POSTGRES_PORT | ||
password: $POSTGRES_PASSWORD | ||
user: $POSTGRES_USER | ||
name: $POSTGRES_DB | ||
dialect: postgresql | ||
driver: psycopg2 | ||
server: | ||
host: 0.0.0.0 | ||
port: 8000 | ||
qdrant: | ||
host: localhost | ||
port: 6333 | ||
api_key: None | ||
collection: bedbase | ||
remotes: | ||
http: | ||
prefix: https://data.bedbase.org/ | ||
description: HTTP compatible path | ||
s3: | ||
prefix: s3://data.bedbase.org/ | ||
description: S3 compatible path | ||
``` | ||
### Download example bedbase configuration file here: <a href="../bedbase_configuration.yaml" download>Example bedbase configuration file</a> | ||
. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# How to create bedbase database | ||
|
||
To run bedstat, bedbuncher and bedmbed we need to create postgres database. | ||
|
||
We are initiating postgres db in docker. | ||
If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`. | ||
|
||
Now, create a persistent volume to house PostgreSQL data: | ||
|
||
```bash | ||
docker volume create postgres-data | ||
``` | ||
|
||
```bash | ||
docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres:13 | ||
``` | ||
|
||
Now we have created docker and can run pipelines. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# How to install R dependencies | ||
|
||
1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html | ||
2. Download this script: <a href="../installRdeps.R" download>Install R dependencies</a> | ||
3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R``` | ||
4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder: | ||
[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
.install_pkg = function(p, bioc=FALSE) { | ||
if(!require(package = p, character.only=TRUE)) { | ||
if(bioc) { | ||
BiocManager::install(pkgs = p) | ||
} else { | ||
install.packages(pkgs = p) | ||
} | ||
} | ||
} | ||
|
||
.install_pkg("R.utils") | ||
.install_pkg("BiocManager") | ||
.install_pkg("optparse") | ||
.install_pkg("devtools") | ||
.install_pkg("GenomicRanges", bioc=TRUE) | ||
.install_pkg("GenomicFeatures", bioc=TRUE) | ||
.install_pkg("ensembldb", bioc=TRUE) | ||
.install_pkg("LOLA", bioc=TRUE) | ||
.install_pkg("BSgenome", bioc=TRUE) | ||
if(!require(package = "GenomicDistributions", character.only=TRUE)) { | ||
devtools::install_github("databio/GenomicDistributions") | ||
} | ||
if(!require(package = "GenomicDistributionsData", character.only=TRUE)) { | ||
install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
# Usage reference | ||
|
||
BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files | ||
|
||
BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments: | ||
|
||
- `bedbase all`: Runs all pipelines one in order: bedmaker -> bedqc -> bedstat | ||
|
||
- `bedbase make`: Creates Bed and BigBed files from other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig] | ||
|
||
- `bedbase qc`: Runs Quality control for bed file (Works only with bed files) | ||
|
||
- `bedbase stat`: Runs statistics for bed and bigbed files. | ||
|
||
Here you can see the command-line usage instructions for the main bedboss command and for each subcommand: | ||
|
||
## `bedboss --help` | ||
```console | ||
version: 0.1.0 | ||
usage: bedboss [-h] [--version] {all,make,qc,stat} ... | ||
|
||
Warehouse of pipelines for BED-like files: bedmaker, bedstat, and bedqc. | ||
|
||
positional arguments: | ||
{all,make,qc,stat} | ||
all Run all bedboss pipelines and insert data into bedbase | ||
make A pipeline to convert bed, bigbed, bigwig or bedgraph | ||
files into bed and bigbed formats | ||
qc Run quality control on bed file (bedqc) | ||
stat A pipeline to read a file in BED format and produce | ||
metadata in JSON format. | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
--version show program's version number and exit | ||
``` | ||
|
||
## `bedboss all --help` | ||
```console | ||
usage: bedboss all [-h] -s SAMPLE_NAME -f INPUT_FILE -t INPUT_TYPE -o | ||
OUTPUT_FOLDER -g GENOME [-r RFG_CONFIG] | ||
[--chrom-sizes CHROM_SIZES] [-n NARROWPEAK] | ||
[--standard-chrom] [--check-qc] | ||
[--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB] | ||
--bedbase-config BEDBASE_CONFIG [-y SAMPLE_YAML] | ||
[--no-db-commit] [--just-db-commit] | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
-s SAMPLE_NAME, --sample-name SAMPLE_NAME | ||
name of the sample used to systematically build the | ||
output name | ||
-f INPUT_FILE, --input-file INPUT_FILE | ||
Input file | ||
-t INPUT_TYPE, --input-type INPUT_TYPE | ||
Input type [required] options: | ||
(bigwig|bedgraph|bed|bigbed|wig) | ||
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER | ||
Output folder | ||
-g GENOME, --genome GENOME | ||
reference genome (assembly) | ||
-r RFG_CONFIG, --rfg-config RFG_CONFIG | ||
file path to the genome config file(refgenie) | ||
--chrom-sizes CHROM_SIZES | ||
a full path to the chrom.sizes required for the | ||
bedtobigbed conversion | ||
-n NARROWPEAK, --narrowpeak NARROWPEAK | ||
whether the regions are narrow (transcription factor | ||
implies narrow, histone mark implies broad peaks) | ||
--standard-chrom Standardize chromosome names. Default: False | ||
--check-qc Check quality control before processing data. Default: | ||
True | ||
--open-signal-matrix OPEN_SIGNAL_MATRIX | ||
a full path to the openSignalMatrix required for the | ||
tissue specificity plots | ||
--ensdb ENSDB A full path to the ensdb gtf file required for genomes | ||
not in GDdata | ||
--bedbase-config BEDBASE_CONFIG | ||
a path to the bedbase configuration file | ||
-y SAMPLE_YAML, --sample-yaml SAMPLE_YAML | ||
a yaml config file with sample attributes to pass on | ||
more metadata into the database | ||
--no-db-commit skip the JSON commit to the database | ||
--just-db-commit just commit the JSON to the database | ||
``` | ||
|
||
## `bedboss make --help` | ||
```console | ||
usage: bedboss make [-h] -f INPUT_FILE [-n NARROWPEAK] -t INPUT_TYPE -g GENOME | ||
-r RFG_CONFIG -o OUTPUT_BED --output-bigbed OUTPUT_BIGBED | ||
-s SAMPLE_NAME [--chrom-sizes CHROM_SIZES] | ||
[--standard-chrom] | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
-f INPUT_FILE, --input-file INPUT_FILE | ||
path to the input file | ||
-n NARROWPEAK, --narrowpeak NARROWPEAK | ||
whether the regions are narrow (transcription factor | ||
implies narrow, histone mark implies broad peaks) | ||
-t INPUT_TYPE, --input-type INPUT_TYPE | ||
a bigwig or a bedgraph file that will be converted | ||
into BED format | ||
-g GENOME, --genome GENOME | ||
reference genome | ||
-r RFG_CONFIG, --rfg-config RFG_CONFIG | ||
file path to the genome config file | ||
-o OUTPUT_BED, --output-bed OUTPUT_BED | ||
path to the output BED files | ||
--output-bigbed OUTPUT_BIGBED | ||
path to the folder of output bigBed files | ||
-s SAMPLE_NAME, --sample-name SAMPLE_NAME | ||
name of the sample used to systematically build the | ||
output name | ||
--chrom-sizes CHROM_SIZES | ||
a full path to the chrom.sizes required for the | ||
bedtobigbed conversion | ||
--standard-chrom Standardize chromosome names. Default: False | ||
``` | ||
|
||
## `bedboss qc --help` | ||
```console | ||
usage: bedboss qc [-h] --bedfile BEDFILE --outfolder OUTFOLDER | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
--bedfile BEDFILE a full path to bed file to process | ||
--outfolder OUTFOLDER | ||
a full path to output log folder. | ||
``` | ||
|
||
## `bedboss stat --help` | ||
```console | ||
usage: bedboss stat [-h] --bedfile BEDFILE | ||
[--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB] | ||
[--bigbed BIGBED] [--bedbase-config BEDBASE_CONFIG] | ||
[-y SAMPLE_YAML] --genome GENOME_ASSEMBLY [--no-db-commit] | ||
[--just-db-commit] | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
--bedfile BEDFILE a full path to bed file to process | ||
--open-signal-matrix OPEN_SIGNAL_MATRIX | ||
a full path to the openSignalMatrix required for the | ||
tissue specificity plots | ||
--ensdb ENSDB a full path to the ensdb gtf file required for genomes | ||
not in GDdata | ||
--bigbed BIGBED a full path to the bigbed files | ||
--bedbase-config BEDBASE_CONFIG | ||
a path to the bedbase configuration file | ||
-y SAMPLE_YAML, --sample-yaml SAMPLE_YAML | ||
a yaml config file with sample attributes to pass on | ||
more metadata into the database | ||
--genome GENOME_ASSEMBLY | ||
genome assembly of the sample | ||
--no-db-commit whether the JSON commit to the database should be | ||
skipped | ||
--just-db-commit whether just to commit the JSON to the database | ||
``` | ||
|