From 84e71bf69c2af21c7020de2f010d06c301f0f8d4 Mon Sep 17 00:00:00 2001 From: Khoroshevskyi Date: Mon, 2 Oct 2023 19:35:21 +0200 Subject: [PATCH] added some deleted docs --- docs/README.md | 65 ++++++++++++++ docs/changelog.md | 7 ++ docs/how_to_bedbase_config.md | 45 ++++++++++ docs/how_to_create_database.md | 18 ++++ docs/how_to_install_r_dep.md | 7 ++ docs/installRdeps.R | 25 ++++++ docs/usage.md | 160 +++++++++++++++++++++++++++++++++ 7 files changed, 327 insertions(+) create mode 100644 docs/README.md create mode 100644 docs/changelog.md create mode 100644 docs/how_to_bedbase_config.md create mode 100644 docs/how_to_create_database.md create mode 100644 docs/how_to_install_r_dep.md create mode 100644 docs/installRdeps.R create mode 100644 docs/usage.md diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..ed9a7ea --- /dev/null +++ b/docs/README.md @@ -0,0 +1,65 @@ +# BEDboss +bedboss is a command-line pipeline that standardizes and calculates statistics for genomic interval data, and enters the results into a BEDbase database. +It has 3 components: + +1) bedmaker (`bedboss make`);
+2) bedqc (`bedboss qc`);
+3) bedstat (`bedboss stat`). + +You may run all 3 pipelines together, or separately. + +Mainly pipelines are intended to be run from command line but nevertheless, +they are also available as a python function, so that user can implement them to his own code. +---- +## BEDboss consist of 3 main pipelines: + +### bedmaker +bedmaker - pipeline to convert supported file types* into BED format and bigBed format. Currently supported formats: + +- bedGraph +- bigBed +- bigWig +- wig + +### bedqc +flag bed files for further evaluation to determine whether they should be included in the downstream analysis. +Currently, it flags bed files that are larger than 2G, has over 5 milliom regions, and/or has mean region width less than 10 bp. +This threshold can be changed in bedqc function arguments. + +### bedstat + +pipeline for obtaining statistics about bed files + +It produces BED file Statistics: + +- **GC content**.The average GC content of the region set. +- **Number of regions**. The total number of regions in the BED file. +- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS) +- **Mean region width**. The average region width of the region set. +- **Exon percentage**. The percentage of the regions in the BED file that are annotated as exon. +- **Intron percentage**. The percentage of the regions in the BED file that are annotated as intron. +- **Promoter proc percentage**. The percentage of the regions in the BED file that are annotated as promoter-prox. +- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic. +- **Promoter core percentage**. The percentage of the regions in the BED file that are annotated as promoter-core. +- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR. +- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR. + +# Additional information + +## bedmaker + +### Additional dependencies + +- bedToBigBed: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed +- bigBedToBed: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed +- bigWigToBedGraph: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph +- wigToBigWig: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig + +## bedstat + +### Additional dependencies +regionstat.R script is used to calculate the bed file statistics, so the pipeline also depends on several R packages: + +All dependencies you can find in R helper script, and use it to easily install the required packages: + +- Rscript scripts/installRdeps.R [How to install R dependencies](./how_to_install_r_dep.md) diff --git a/docs/changelog.md b/docs/changelog.md new file mode 100644 index 0000000..5026ad7 --- /dev/null +++ b/docs/changelog.md @@ -0,0 +1,7 @@ +# Changelog + +This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. + +## [0.1.0a1] - 2023-08-02 +### Added +- Initial alpha release diff --git a/docs/how_to_bedbase_config.md b/docs/how_to_bedbase_config.md new file mode 100644 index 0000000..0c19ae0 --- /dev/null +++ b/docs/how_to_bedbase_config.md @@ -0,0 +1,45 @@ +# How to create bedbase config file (for bedstat) + +### Bedbase config file is yaml file with 4 parts: +- path to output files +- database credentials +- server information +- remote info + +### Example: +```yaml +path: + pipeline_output_path: $BEDBOSS_OUTPUT_PATH # do not change it + bedstat_dir: bedstat_output + remote_url_base: null + bedbuncher_dir: bedbucher_output + # region2vec: "add/path/here" + # vec2vec: "add/path/here" +database: + host: $DB_HOST_URL + port: $POSTGRES_PORT + password: $POSTGRES_PASSWORD + user: $POSTGRES_USER + name: $POSTGRES_DB + dialect: postgresql + driver: psycopg2 +server: + host: 0.0.0.0 + port: 8000 +qdrant: + host: localhost + port: 6333 + api_key: None + collection: bedbase +remotes: + http: + prefix: https://data.bedbase.org/ + description: HTTP compatible path + s3: + prefix: s3://data.bedbase.org/ + description: S3 compatible path +``` + +### Download example bedbase configuration file here: Example bedbase configuration file + +. \ No newline at end of file diff --git a/docs/how_to_create_database.md b/docs/how_to_create_database.md new file mode 100644 index 0000000..12d2679 --- /dev/null +++ b/docs/how_to_create_database.md @@ -0,0 +1,18 @@ +# How to create bedbase database + +To run bedstat, bedbuncher and bedmbed we need to create postgres database. + +We are initiating postgres db in docker. +If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`. + +Now, create a persistent volume to house PostgreSQL data: + +```bash +docker volume create postgres-data +``` + +```bash +docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres:13 +``` + +Now we have created docker and can run pipelines. diff --git a/docs/how_to_install_r_dep.md b/docs/how_to_install_r_dep.md new file mode 100644 index 0000000..2059795 --- /dev/null +++ b/docs/how_to_install_r_dep.md @@ -0,0 +1,7 @@ +# How to install R dependencies + +1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html +2. Download this script: Install R dependencies +3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R``` +4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder: +[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh)) diff --git a/docs/installRdeps.R b/docs/installRdeps.R new file mode 100644 index 0000000..3cad82f --- /dev/null +++ b/docs/installRdeps.R @@ -0,0 +1,25 @@ +.install_pkg = function(p, bioc=FALSE) { + if(!require(package = p, character.only=TRUE)) { + if(bioc) { + BiocManager::install(pkgs = p) + } else { + install.packages(pkgs = p) + } + } +} + +.install_pkg("R.utils") +.install_pkg("BiocManager") +.install_pkg("optparse") +.install_pkg("devtools") +.install_pkg("GenomicRanges", bioc=TRUE) +.install_pkg("GenomicFeatures", bioc=TRUE) +.install_pkg("ensembldb", bioc=TRUE) +.install_pkg("LOLA", bioc=TRUE) +.install_pkg("BSgenome", bioc=TRUE) +if(!require(package = "GenomicDistributions", character.only=TRUE)) { + devtools::install_github("databio/GenomicDistributions") +} +if(!require(package = "GenomicDistributionsData", character.only=TRUE)) { + install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL) +} diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 0000000..a457e59 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,160 @@ +# Usage reference + +BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files + +BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments: + +- `bedbase all`: Runs all pipelines one in order: bedmaker -> bedqc -> bedstat + +- `bedbase make`: Creates Bed and BigBed files from other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig] + +- `bedbase qc`: Runs Quality control for bed file (Works only with bed files) + +- `bedbase stat`: Runs statistics for bed and bigbed files. + +Here you can see the command-line usage instructions for the main bedboss command and for each subcommand: + +## `bedboss --help` +```console +version: 0.1.0 +usage: bedboss [-h] [--version] {all,make,qc,stat} ... + +Warehouse of pipelines for BED-like files: bedmaker, bedstat, and bedqc. + +positional arguments: + {all,make,qc,stat} + all Run all bedboss pipelines and insert data into bedbase + make A pipeline to convert bed, bigbed, bigwig or bedgraph + files into bed and bigbed formats + qc Run quality control on bed file (bedqc) + stat A pipeline to read a file in BED format and produce + metadata in JSON format. + +options: + -h, --help show this help message and exit + --version show program's version number and exit +``` + +## `bedboss all --help` +```console +usage: bedboss all [-h] -s SAMPLE_NAME -f INPUT_FILE -t INPUT_TYPE -o + OUTPUT_FOLDER -g GENOME [-r RFG_CONFIG] + [--chrom-sizes CHROM_SIZES] [-n NARROWPEAK] + [--standard-chrom] [--check-qc] + [--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB] + --bedbase-config BEDBASE_CONFIG [-y SAMPLE_YAML] + [--no-db-commit] [--just-db-commit] + +options: + -h, --help show this help message and exit + -s SAMPLE_NAME, --sample-name SAMPLE_NAME + name of the sample used to systematically build the + output name + -f INPUT_FILE, --input-file INPUT_FILE + Input file + -t INPUT_TYPE, --input-type INPUT_TYPE + Input type [required] options: + (bigwig|bedgraph|bed|bigbed|wig) + -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER + Output folder + -g GENOME, --genome GENOME + reference genome (assembly) + -r RFG_CONFIG, --rfg-config RFG_CONFIG + file path to the genome config file(refgenie) + --chrom-sizes CHROM_SIZES + a full path to the chrom.sizes required for the + bedtobigbed conversion + -n NARROWPEAK, --narrowpeak NARROWPEAK + whether the regions are narrow (transcription factor + implies narrow, histone mark implies broad peaks) + --standard-chrom Standardize chromosome names. Default: False + --check-qc Check quality control before processing data. Default: + True + --open-signal-matrix OPEN_SIGNAL_MATRIX + a full path to the openSignalMatrix required for the + tissue specificity plots + --ensdb ENSDB A full path to the ensdb gtf file required for genomes + not in GDdata + --bedbase-config BEDBASE_CONFIG + a path to the bedbase configuration file + -y SAMPLE_YAML, --sample-yaml SAMPLE_YAML + a yaml config file with sample attributes to pass on + more metadata into the database + --no-db-commit skip the JSON commit to the database + --just-db-commit just commit the JSON to the database +``` + +## `bedboss make --help` +```console +usage: bedboss make [-h] -f INPUT_FILE [-n NARROWPEAK] -t INPUT_TYPE -g GENOME + -r RFG_CONFIG -o OUTPUT_BED --output-bigbed OUTPUT_BIGBED + -s SAMPLE_NAME [--chrom-sizes CHROM_SIZES] + [--standard-chrom] + +options: + -h, --help show this help message and exit + -f INPUT_FILE, --input-file INPUT_FILE + path to the input file + -n NARROWPEAK, --narrowpeak NARROWPEAK + whether the regions are narrow (transcription factor + implies narrow, histone mark implies broad peaks) + -t INPUT_TYPE, --input-type INPUT_TYPE + a bigwig or a bedgraph file that will be converted + into BED format + -g GENOME, --genome GENOME + reference genome + -r RFG_CONFIG, --rfg-config RFG_CONFIG + file path to the genome config file + -o OUTPUT_BED, --output-bed OUTPUT_BED + path to the output BED files + --output-bigbed OUTPUT_BIGBED + path to the folder of output bigBed files + -s SAMPLE_NAME, --sample-name SAMPLE_NAME + name of the sample used to systematically build the + output name + --chrom-sizes CHROM_SIZES + a full path to the chrom.sizes required for the + bedtobigbed conversion + --standard-chrom Standardize chromosome names. Default: False +``` + +## `bedboss qc --help` +```console +usage: bedboss qc [-h] --bedfile BEDFILE --outfolder OUTFOLDER + +options: + -h, --help show this help message and exit + --bedfile BEDFILE a full path to bed file to process + --outfolder OUTFOLDER + a full path to output log folder. +``` + +## `bedboss stat --help` +```console +usage: bedboss stat [-h] --bedfile BEDFILE + [--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB] + [--bigbed BIGBED] [--bedbase-config BEDBASE_CONFIG] + [-y SAMPLE_YAML] --genome GENOME_ASSEMBLY [--no-db-commit] + [--just-db-commit] + +options: + -h, --help show this help message and exit + --bedfile BEDFILE a full path to bed file to process + --open-signal-matrix OPEN_SIGNAL_MATRIX + a full path to the openSignalMatrix required for the + tissue specificity plots + --ensdb ENSDB a full path to the ensdb gtf file required for genomes + not in GDdata + --bigbed BIGBED a full path to the bigbed files + --bedbase-config BEDBASE_CONFIG + a path to the bedbase configuration file + -y SAMPLE_YAML, --sample-yaml SAMPLE_YAML + a yaml config file with sample attributes to pass on + more metadata into the database + --genome GENOME_ASSEMBLY + genome assembly of the sample + --no-db-commit whether the JSON commit to the database should be + skipped + --just-db-commit whether just to commit the JSON to the database +``` +