Skip to content

Commit

Permalink
added some deleted docs
Browse files Browse the repository at this point in the history
  • Loading branch information
khoroshevskyi committed Oct 2, 2023
1 parent 7b57e3e commit 84e71bf
Show file tree
Hide file tree
Showing 7 changed files with 327 additions and 0 deletions.
65 changes: 65 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# BEDboss
bedboss is a command-line pipeline that standardizes and calculates statistics for genomic interval data, and enters the results into a BEDbase database.
It has 3 components:

1) bedmaker (`bedboss make`); </br>
2) bedqc (`bedboss qc`);</br>
3) bedstat (`bedboss stat`).

You may run all 3 pipelines together, or separately.

Mainly pipelines are intended to be run from command line but nevertheless,
they are also available as a python function, so that user can implement them to his own code.
----
## BEDboss consist of 3 main pipelines:

### bedmaker
bedmaker - pipeline to convert supported file types* into BED format and bigBed format. Currently supported formats:

- bedGraph
- bigBed
- bigWig
- wig

### bedqc
flag bed files for further evaluation to determine whether they should be included in the downstream analysis.
Currently, it flags bed files that are larger than 2G, has over 5 milliom regions, and/or has mean region width less than 10 bp.
This threshold can be changed in bedqc function arguments.

### bedstat

pipeline for obtaining statistics about bed files

It produces BED file Statistics:

- **GC content**.The average GC content of the region set.
- **Number of regions**. The total number of regions in the BED file.
- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS)
- **Mean region width**. The average region width of the region set.
- **Exon percentage**. The percentage of the regions in the BED file that are annotated as exon.
- **Intron percentage**. The percentage of the regions in the BED file that are annotated as intron.
- **Promoter proc percentage**. The percentage of the regions in the BED file that are annotated as promoter-prox.
- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic.
- **Promoter core percentage**. The percentage of the regions in the BED file that are annotated as promoter-core.
- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR.
- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR.

# Additional information

## bedmaker

### Additional dependencies

- bedToBigBed: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
- bigBedToBed: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed
- bigWigToBedGraph: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
- wigToBigWig: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig

## bedstat

### Additional dependencies
regionstat.R script is used to calculate the bed file statistics, so the pipeline also depends on several R packages:

All dependencies you can find in R helper script, and use it to easily install the required packages:

- Rscript scripts/installRdeps.R [How to install R dependencies](./how_to_install_r_dep.md)
7 changes: 7 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Changelog

This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.

## [0.1.0a1] - 2023-08-02
### Added
- Initial alpha release
45 changes: 45 additions & 0 deletions docs/how_to_bedbase_config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# How to create bedbase config file (for bedstat)

### Bedbase config file is yaml file with 4 parts:
- path to output files
- database credentials
- server information
- remote info

### Example:
```yaml
path:
pipeline_output_path: $BEDBOSS_OUTPUT_PATH # do not change it
bedstat_dir: bedstat_output
remote_url_base: null
bedbuncher_dir: bedbucher_output
# region2vec: "add/path/here"
# vec2vec: "add/path/here"
database:
host: $DB_HOST_URL
port: $POSTGRES_PORT
password: $POSTGRES_PASSWORD
user: $POSTGRES_USER
name: $POSTGRES_DB
dialect: postgresql
driver: psycopg2
server:
host: 0.0.0.0
port: 8000
qdrant:
host: localhost
port: 6333
api_key: None
collection: bedbase
remotes:
http:
prefix: https://data.bedbase.org/
description: HTTP compatible path
s3:
prefix: s3://data.bedbase.org/
description: S3 compatible path
```
### Download example bedbase configuration file here: <a href="../bedbase_configuration.yaml" download>Example bedbase configuration file</a>
.
18 changes: 18 additions & 0 deletions docs/how_to_create_database.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# How to create bedbase database

To run bedstat, bedbuncher and bedmbed we need to create postgres database.

We are initiating postgres db in docker.
If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`.

Now, create a persistent volume to house PostgreSQL data:

```bash
docker volume create postgres-data
```

```bash
docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres:13
```

Now we have created docker and can run pipelines.
7 changes: 7 additions & 0 deletions docs/how_to_install_r_dep.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# How to install R dependencies

1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html
2. Download this script: <a href="../installRdeps.R" download>Install R dependencies</a>
3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R```
4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder:
[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh))
25 changes: 25 additions & 0 deletions docs/installRdeps.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
.install_pkg = function(p, bioc=FALSE) {
if(!require(package = p, character.only=TRUE)) {
if(bioc) {
BiocManager::install(pkgs = p)
} else {
install.packages(pkgs = p)
}
}
}

.install_pkg("R.utils")
.install_pkg("BiocManager")
.install_pkg("optparse")
.install_pkg("devtools")
.install_pkg("GenomicRanges", bioc=TRUE)
.install_pkg("GenomicFeatures", bioc=TRUE)
.install_pkg("ensembldb", bioc=TRUE)
.install_pkg("LOLA", bioc=TRUE)
.install_pkg("BSgenome", bioc=TRUE)
if(!require(package = "GenomicDistributions", character.only=TRUE)) {
devtools::install_github("databio/GenomicDistributions")
}
if(!require(package = "GenomicDistributionsData", character.only=TRUE)) {
install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL)
}
160 changes: 160 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Usage reference

BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files

BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments:

- `bedbase all`: Runs all pipelines one in order: bedmaker -> bedqc -> bedstat

- `bedbase make`: Creates Bed and BigBed files from other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig]

- `bedbase qc`: Runs Quality control for bed file (Works only with bed files)

- `bedbase stat`: Runs statistics for bed and bigbed files.

Here you can see the command-line usage instructions for the main bedboss command and for each subcommand:

## `bedboss --help`
```console
version: 0.1.0
usage: bedboss [-h] [--version] {all,make,qc,stat} ...

Warehouse of pipelines for BED-like files: bedmaker, bedstat, and bedqc.

positional arguments:
{all,make,qc,stat}
all Run all bedboss pipelines and insert data into bedbase
make A pipeline to convert bed, bigbed, bigwig or bedgraph
files into bed and bigbed formats
qc Run quality control on bed file (bedqc)
stat A pipeline to read a file in BED format and produce
metadata in JSON format.

options:
-h, --help show this help message and exit
--version show program's version number and exit
```

## `bedboss all --help`
```console
usage: bedboss all [-h] -s SAMPLE_NAME -f INPUT_FILE -t INPUT_TYPE -o
OUTPUT_FOLDER -g GENOME [-r RFG_CONFIG]
[--chrom-sizes CHROM_SIZES] [-n NARROWPEAK]
[--standard-chrom] [--check-qc]
[--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB]
--bedbase-config BEDBASE_CONFIG [-y SAMPLE_YAML]
[--no-db-commit] [--just-db-commit]

options:
-h, --help show this help message and exit
-s SAMPLE_NAME, --sample-name SAMPLE_NAME
name of the sample used to systematically build the
output name
-f INPUT_FILE, --input-file INPUT_FILE
Input file
-t INPUT_TYPE, --input-type INPUT_TYPE
Input type [required] options:
(bigwig|bedgraph|bed|bigbed|wig)
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Output folder
-g GENOME, --genome GENOME
reference genome (assembly)
-r RFG_CONFIG, --rfg-config RFG_CONFIG
file path to the genome config file(refgenie)
--chrom-sizes CHROM_SIZES
a full path to the chrom.sizes required for the
bedtobigbed conversion
-n NARROWPEAK, --narrowpeak NARROWPEAK
whether the regions are narrow (transcription factor
implies narrow, histone mark implies broad peaks)
--standard-chrom Standardize chromosome names. Default: False
--check-qc Check quality control before processing data. Default:
True
--open-signal-matrix OPEN_SIGNAL_MATRIX
a full path to the openSignalMatrix required for the
tissue specificity plots
--ensdb ENSDB A full path to the ensdb gtf file required for genomes
not in GDdata
--bedbase-config BEDBASE_CONFIG
a path to the bedbase configuration file
-y SAMPLE_YAML, --sample-yaml SAMPLE_YAML
a yaml config file with sample attributes to pass on
more metadata into the database
--no-db-commit skip the JSON commit to the database
--just-db-commit just commit the JSON to the database
```

## `bedboss make --help`
```console
usage: bedboss make [-h] -f INPUT_FILE [-n NARROWPEAK] -t INPUT_TYPE -g GENOME
-r RFG_CONFIG -o OUTPUT_BED --output-bigbed OUTPUT_BIGBED
-s SAMPLE_NAME [--chrom-sizes CHROM_SIZES]
[--standard-chrom]

options:
-h, --help show this help message and exit
-f INPUT_FILE, --input-file INPUT_FILE
path to the input file
-n NARROWPEAK, --narrowpeak NARROWPEAK
whether the regions are narrow (transcription factor
implies narrow, histone mark implies broad peaks)
-t INPUT_TYPE, --input-type INPUT_TYPE
a bigwig or a bedgraph file that will be converted
into BED format
-g GENOME, --genome GENOME
reference genome
-r RFG_CONFIG, --rfg-config RFG_CONFIG
file path to the genome config file
-o OUTPUT_BED, --output-bed OUTPUT_BED
path to the output BED files
--output-bigbed OUTPUT_BIGBED
path to the folder of output bigBed files
-s SAMPLE_NAME, --sample-name SAMPLE_NAME
name of the sample used to systematically build the
output name
--chrom-sizes CHROM_SIZES
a full path to the chrom.sizes required for the
bedtobigbed conversion
--standard-chrom Standardize chromosome names. Default: False
```

## `bedboss qc --help`
```console
usage: bedboss qc [-h] --bedfile BEDFILE --outfolder OUTFOLDER

options:
-h, --help show this help message and exit
--bedfile BEDFILE a full path to bed file to process
--outfolder OUTFOLDER
a full path to output log folder.
```

## `bedboss stat --help`
```console
usage: bedboss stat [-h] --bedfile BEDFILE
[--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB]
[--bigbed BIGBED] [--bedbase-config BEDBASE_CONFIG]
[-y SAMPLE_YAML] --genome GENOME_ASSEMBLY [--no-db-commit]
[--just-db-commit]

options:
-h, --help show this help message and exit
--bedfile BEDFILE a full path to bed file to process
--open-signal-matrix OPEN_SIGNAL_MATRIX
a full path to the openSignalMatrix required for the
tissue specificity plots
--ensdb ENSDB a full path to the ensdb gtf file required for genomes
not in GDdata
--bigbed BIGBED a full path to the bigbed files
--bedbase-config BEDBASE_CONFIG
a path to the bedbase configuration file
-y SAMPLE_YAML, --sample-yaml SAMPLE_YAML
a yaml config file with sample attributes to pass on
more metadata into the database
--genome GENOME_ASSEMBLY
genome assembly of the sample
--no-db-commit whether the JSON commit to the database should be
skipped
--just-db-commit whether just to commit the JSON to the database
```

0 comments on commit 84e71bf

Please sign in to comment.