added some deleted docs

databio · Oct 2, 2023 · 84e71bf · 84e71bf
1 parent 7b57e3e
commit 84e71bf
Show file tree

Hide file tree

Showing 7 changed files with 327 additions and 0 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,65 @@
+# BEDboss
+bedboss is a command-line pipeline that standardizes and calculates statistics for genomic interval data, and enters the results into a BEDbase database. 
+It has 3 components: 
+
+1) bedmaker (`bedboss make`); </br>
+2) bedqc (`bedboss qc`);</br>
+3) bedstat (`bedboss stat`).
+
+You may run all 3 pipelines together, or separately.
+
+Mainly pipelines are intended to be run from command line but nevertheless, 
+they are also available as a python function, so that user can implement them to his own code.
+----
+## BEDboss consist of 3 main pipelines:
+
+### bedmaker
+bedmaker - pipeline to convert supported file types* into BED format and bigBed format. Currently supported formats:
+
+- bedGraph
+- bigBed
+- bigWig
+- wig
+
+### bedqc
+flag bed files for further evaluation to determine whether they should be included in the downstream analysis. 
+Currently, it flags bed files that are larger than 2G, has over 5 milliom regions, and/or has mean region width less than 10 bp.
+This threshold can be changed in bedqc function arguments.
+
+### bedstat
+
+pipeline for obtaining statistics about bed files
+
+It produces BED file Statistics:
+
+- **GC content**.The average GC content of the region set. 
+- **Number of regions**. The total number of regions in the BED file. 
+- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS)
+- **Mean region width**. The average region width of the region set.
+- **Exon percentage**.	The percentage of the regions in the BED file that are annotated as exon. 
+- **Intron percentage**.	The percentage of the regions in the BED file that are annotated as intron.
+- **Promoter proc percentage**.	The percentage of the regions in the BED file that are annotated as promoter-prox.
+- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic.
+- **Promoter core percentage**.	The percentage of the regions in the BED file that are annotated as promoter-core.
+- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR.
+- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR.
+
+# Additional information
+
+## bedmaker
+
+### Additional dependencies
+
+- bedToBigBed: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
+- bigBedToBed: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed
+- bigWigToBedGraph: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
+- wigToBigWig: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
+
+## bedstat
+
+### Additional dependencies
+regionstat.R script is used to calculate the bed file statistics, so the pipeline also depends on several R packages:
+
+All dependencies you can find in R helper script, and use it to easily install the required packages:
+
+- Rscript scripts/installRdeps.R [How to install R dependencies](./how_to_install_r_dep.md)
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -0,0 +1,7 @@
+# Changelog
+
+This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.
+
+## [0.1.0a1] - 2023-08-02
+### Added
+- Initial alpha release
diff --git a/docs/how_to_bedbase_config.md b/docs/how_to_bedbase_config.md
@@ -0,0 +1,45 @@
+# How to create bedbase config file (for bedstat)
+
+### Bedbase config file is yaml file with 4 parts:
+- path to output files 
+- database credentials 
+- server information 
+- remote info
+
+### Example:
+```yaml
+path:
+  pipeline_output_path: $BEDBOSS_OUTPUT_PATH  # do not change it
+  bedstat_dir: bedstat_output
+  remote_url_base: null
+  bedbuncher_dir: bedbucher_output
+  #  region2vec: "add/path/here"
+  #  vec2vec: "add/path/here"
+database:
+  host: $DB_HOST_URL
+  port: $POSTGRES_PORT
+  password: $POSTGRES_PASSWORD
+  user: $POSTGRES_USER
+  name: $POSTGRES_DB
+  dialect: postgresql
+  driver: psycopg2
+server:
+  host: 0.0.0.0
+  port: 8000
+qdrant:
+  host: localhost
+  port: 6333
+  api_key: None
+  collection: bedbase
+remotes:
+  http:
+    prefix: https://data.bedbase.org/
+    description: HTTP compatible path
+  s3:
+    prefix: s3://data.bedbase.org/
+    description: S3 compatible path
+```
+
+### Download example bedbase configuration file here: <a href="../bedbase_configuration.yaml" download>Example bedbase configuration file</a>
+
+.
diff --git a/docs/how_to_create_database.md b/docs/how_to_create_database.md
@@ -0,0 +1,18 @@
+# How to create bedbase database
+
+To run bedstat, bedbuncher and bedmbed we need to create postgres database.
+
+We are initiating postgres db in docker.
+If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`.
+
+Now, create a persistent volume to house PostgreSQL data:
+
+```bash
+docker volume create postgres-data
+```
+
+```bash
+docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres:13
+```
+
+Now we have created docker and can run pipelines.
diff --git a/docs/how_to_install_r_dep.md b/docs/how_to_install_r_dep.md
@@ -0,0 +1,7 @@
+# How to install R dependencies
+
+1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html
+2. Download this script: <a href="../installRdeps.R" download>Install R dependencies</a>
+3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R```
+4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder: 
+[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh))
diff --git a/docs/installRdeps.R b/docs/installRdeps.R
@@ -0,0 +1,25 @@
+.install_pkg = function(p, bioc=FALSE) {
+    if(!require(package = p, character.only=TRUE)) {
+        if(bioc) {
+            BiocManager::install(pkgs = p)
+        } else {
+            install.packages(pkgs = p)   
+        }
+    }
+}
+
+.install_pkg("R.utils")
+.install_pkg("BiocManager")
+.install_pkg("optparse")
+.install_pkg("devtools")
+.install_pkg("GenomicRanges", bioc=TRUE)
+.install_pkg("GenomicFeatures", bioc=TRUE)
+.install_pkg("ensembldb", bioc=TRUE)
+.install_pkg("LOLA", bioc=TRUE)
+.install_pkg("BSgenome", bioc=TRUE)
+if(!require(package = "GenomicDistributions", character.only=TRUE)) {
+    devtools::install_github("databio/GenomicDistributions")
+}
+if(!require(package = "GenomicDistributionsData", character.only=TRUE)) {
+    install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL)
+}
diff --git a/docs/usage.md b/docs/usage.md
@@ -0,0 +1,160 @@
+# Usage reference
+
+BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files
+
+BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments:
+
+- `bedbase all`:  Runs all pipelines one in order: bedmaker -> bedqc -> bedstat
+
+- `bedbase make`:  Creates Bed and BigBed files from  other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig]
+
+- `bedbase qc`: Runs Quality control for bed file (Works only with bed files)
+
+- `bedbase stat`: Runs statistics for bed and bigbed files.
+
+Here you can see the command-line usage instructions for the main bedboss command and for each subcommand:
+
+## `bedboss --help`
+```console
+version: 0.1.0
+usage: bedboss [-h] [--version] {all,make,qc,stat} ...
+
+Warehouse of pipelines for BED-like files: bedmaker, bedstat, and bedqc.
+
+positional arguments:
+  {all,make,qc,stat}
+    all               Run all bedboss pipelines and insert data into bedbase
+    make              A pipeline to convert bed, bigbed, bigwig or bedgraph
+                      files into bed and bigbed formats
+    qc                Run quality control on bed file (bedqc)
+    stat              A pipeline to read a file in BED format and produce
+                      metadata in JSON format.
+
+options:
+  -h, --help          show this help message and exit
+  --version           show program's version number and exit
+```
+
+## `bedboss all --help`
+```console
+usage: bedboss all [-h] -s SAMPLE_NAME -f INPUT_FILE -t INPUT_TYPE -o
+                   OUTPUT_FOLDER -g GENOME [-r RFG_CONFIG]
+                   [--chrom-sizes CHROM_SIZES] [-n NARROWPEAK]
+                   [--standard-chrom] [--check-qc]
+                   [--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB]
+                   --bedbase-config BEDBASE_CONFIG [-y SAMPLE_YAML]
+                   [--no-db-commit] [--just-db-commit]
+
+options:
+  -h, --help            show this help message and exit
+  -s SAMPLE_NAME, --sample-name SAMPLE_NAME
+                        name of the sample used to systematically build the
+                        output name
+  -f INPUT_FILE, --input-file INPUT_FILE
+                        Input file
+  -t INPUT_TYPE, --input-type INPUT_TYPE
+                        Input type [required] options:
+                        (bigwig|bedgraph|bed|bigbed|wig)
+  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
+                        Output folder
+  -g GENOME, --genome GENOME
+                        reference genome (assembly)
+  -r RFG_CONFIG, --rfg-config RFG_CONFIG
+                        file path to the genome config file(refgenie)
+  --chrom-sizes CHROM_SIZES
+                        a full path to the chrom.sizes required for the
+                        bedtobigbed conversion
+  -n NARROWPEAK, --narrowpeak NARROWPEAK
+                        whether the regions are narrow (transcription factor
+                        implies narrow, histone mark implies broad peaks)
+  --standard-chrom      Standardize chromosome names. Default: False
+  --check-qc            Check quality control before processing data. Default:
+                        True
+  --open-signal-matrix OPEN_SIGNAL_MATRIX
+                        a full path to the openSignalMatrix required for the
+                        tissue specificity plots
+  --ensdb ENSDB         A full path to the ensdb gtf file required for genomes
+                        not in GDdata
+  --bedbase-config BEDBASE_CONFIG
+                        a path to the bedbase configuration file
+  -y SAMPLE_YAML, --sample-yaml SAMPLE_YAML
+                        a yaml config file with sample attributes to pass on
+                        more metadata into the database
+  --no-db-commit        skip the JSON commit to the database
+  --just-db-commit      just commit the JSON to the database
+```
+
+## `bedboss make --help`
+```console
+usage: bedboss make [-h] -f INPUT_FILE [-n NARROWPEAK] -t INPUT_TYPE -g GENOME
+                    -r RFG_CONFIG -o OUTPUT_BED --output-bigbed OUTPUT_BIGBED
+                    -s SAMPLE_NAME [--chrom-sizes CHROM_SIZES]
+                    [--standard-chrom]
+
+options:
+  -h, --help            show this help message and exit
+  -f INPUT_FILE, --input-file INPUT_FILE
+                        path to the input file
+  -n NARROWPEAK, --narrowpeak NARROWPEAK
+                        whether the regions are narrow (transcription factor
+                        implies narrow, histone mark implies broad peaks)
+  -t INPUT_TYPE, --input-type INPUT_TYPE
+                        a bigwig or a bedgraph file that will be converted
+                        into BED format
+  -g GENOME, --genome GENOME
+                        reference genome
+  -r RFG_CONFIG, --rfg-config RFG_CONFIG
+                        file path to the genome config file
+  -o OUTPUT_BED, --output-bed OUTPUT_BED
+                        path to the output BED files
+  --output-bigbed OUTPUT_BIGBED
+                        path to the folder of output bigBed files
+  -s SAMPLE_NAME, --sample-name SAMPLE_NAME
+                        name of the sample used to systematically build the
+                        output name
+  --chrom-sizes CHROM_SIZES
+                        a full path to the chrom.sizes required for the
+                        bedtobigbed conversion
+  --standard-chrom      Standardize chromosome names. Default: False
+```
+
+## `bedboss qc --help`
+```console
+usage: bedboss qc [-h] --bedfile BEDFILE --outfolder OUTFOLDER
+
+options:
+  -h, --help            show this help message and exit
+  --bedfile BEDFILE     a full path to bed file to process
+  --outfolder OUTFOLDER
+                        a full path to output log folder.
+```
+
+## `bedboss stat --help`
+```console
+usage: bedboss stat [-h] --bedfile BEDFILE
+                    [--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB]
+                    [--bigbed BIGBED] [--bedbase-config BEDBASE_CONFIG]
+                    [-y SAMPLE_YAML] --genome GENOME_ASSEMBLY [--no-db-commit]
+                    [--just-db-commit]
+
+options:
+  -h, --help            show this help message and exit
+  --bedfile BEDFILE     a full path to bed file to process
+  --open-signal-matrix OPEN_SIGNAL_MATRIX
+                        a full path to the openSignalMatrix required for the
+                        tissue specificity plots
+  --ensdb ENSDB         a full path to the ensdb gtf file required for genomes
+                        not in GDdata
+  --bigbed BIGBED       a full path to the bigbed files
+  --bedbase-config BEDBASE_CONFIG
+                        a path to the bedbase configuration file
+  -y SAMPLE_YAML, --sample-yaml SAMPLE_YAML
+                        a yaml config file with sample attributes to pass on
+                        more metadata into the database
+  --genome GENOME_ASSEMBLY
+                        genome assembly of the sample
+  --no-db-commit        whether the JSON commit to the database should be
+                        skipped
+  --just-db-commit      whether just to commit the JSON to the database
+```
+