Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics extraction subfeature raw trimmed align only #45

Open
wants to merge 24 commits into
base: development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e183416
fix: add 'ISO-8859-1' fallback decoding
J-81 Mar 31, 2023
b5462ee
ci: allow tests on hotfix branches
J-81 Mar 31, 2023
7cddb47
fix: use loguru based logger
J-81 Mar 31, 2023
f065a01
feat: update documentation for version
J-81 Mar 31, 2023
0fe3c0b
feat: refactor to ensure ISO-8859-1 fallback can be used
J-81 Apr 11, 2023
cdab874
docs: update for 1.3.2 release
J-81 Apr 11, 2023
c1baeb4
feat: added support for data-asset-keys and run-components in updated…
J-81 May 10, 2023
da64da7
feat: version related updates
J-81 May 10, 2023
5bd18cf
feat: join multi-files on ',' instead of ',<SPACE>'
J-81 May 15, 2023
6307508
feat: version 1.3.4 updates
J-81 May 18, 2023
d24ba0f
feat: extraction to complete data sparse table format
J-81 Aug 29, 2023
7e8a393
feat: add isa config
J-81 Aug 29, 2023
3e32fab
Pushing pre-cleaned up code to repo
J-81 Sep 29, 2023
bbd58b9
Pushing pre-cleaned up runner script to repo
J-81 Sep 29, 2023
3f76cc8
Pushing pre-cleaned up runner script to repo
J-81 Sep 29, 2023
8c79ec7
update instructions
J-81 Oct 23, 2023
80720f8
fix: update extraction yaml
J-81 Oct 31, 2023
3400c80
search for aligned multiqc instead of read dist
asaravia-butler Nov 5, 2023
9ace419
feat: version that only extracts and summarizes raw, trimmed, and ali…
asaravia-butler Nov 5, 2023
47f9a53
removing all but raw, trimmed, and align multiqc
asaravia-butler Nov 5, 2023
cc34c6b
adding instructions for modifying root dir
asaravia-butler Nov 5, 2023
1e493ee
fixing unmapped percent extraction
asaravia-butler Nov 8, 2023
6b475e1
fix(metadata-variant): Added alternative casing for library selection
asaravia-butler Nov 8, 2023
ba496cf
copy yaml from assets subdir
asaravia-butler Nov 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/run_pytests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
branches:
- main
- development
- "*hotfix*"
pull_request:
types: [ opened, synchronize]
# Allows you to run this workflow manually from the Actions tab
Expand Down
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.3.4]

### Changed

- Table updates (associated with updating ISA archive files) now separates multiple files in a field with ',' instead of ', '

## [1.3.3]

### Added

- Support for data asset key sets and run components in updated validation interface (i.e. by 'dpt validation')

## [1.3.2]

### Fixed

- Refactored ISA archive parsing functions as prior the fallback wasn't being used in all calls (specifically the plug in based ones)

## [1.3.1]

### Fixed

- Parsing for ISA Archives met 'ISO-8859-1' encoding but not 'utf-8'
- Specifically, 'utf-8' is attempted and 'ISO-8859-1' is used as a fallback

## [1.3.0]

### Added
Expand Down Expand Up @@ -170,3 +195,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
[1.2.0]: https://github.com/j-81/dp_tools/compare/1.1.9...1.2.0
[1.2.1]: https://github.com/j-81/dp_tools/compare/1.2.0...1.2.1
[1.3.0]: https://github.com/j-81/dp_tools/compare/1.2.1...1.3.0
[1.3.1]: https://github.com/j-81/dp_tools/compare/1.3.0...1.3.1
[1.3.2]: https://github.com/j-81/dp_tools/compare/1.3.1...1.3.2
[1.3.3]: https://github.com/j-81/dp_tools/compare/1.3.2...1.3.3
[1.3.4]: https://github.com/j-81/dp_tools/compare/1.3.3...1.3.4
60 changes: 60 additions & 0 deletions INSTRUCTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# This document explains usage from Gitpod; however, beside installation, these may (untested) also work when running from containers (wrapped in appropriate `singularity` or `docker` invocations)


## Installation

```
cd $REPO_DIRECTORY # e.g. /workspace/dp_tools in gitpod
pip install -e .
```

## Download Relevant MultiQC & ISA archive

> python download_multiqc_from_OSD.py --osd-id <OSD-NNN> --output-dir <OUTPUT_DIR>

### Known Limitations / Issues

* Only supports datasets with `read distribution` MultiQC files (used as a proxy for whether the dataset is actually sequencing transcriptomics)
** Future: Should rely on parsing metadata from API

## Copy required configuration files

> bash set_up_config_files.sh <OUTPUT_DIR>

This copies template yaml files from the repository code.

## CD into directory

> cd <OUTPUT_DIR>

## Modify configuration files


### isa_config.yaml

1. Initially, no changes
2. If encountering error like: `ValueError: Could not find required column '['Parameter Value[Stranded]', 'Parameter Value[stranded]']' in either ISA sample or assay table.`
* Comment out or modify item in `Staging: -> General: -> Required Metadata: -> From ISA:` section of yaml

### extraction_settings.yaml

1. MUST: change root search directory (line 2) to directory containing multiQC reports generated at start of this document
1. MAY: need to disable section for certain multiQC (not likely useful / will very probably break summarization)

## Run extract & summarize script

> python ../extract_dataset.py --osd-id <OSD_NNN> # You should still be in the directory with the multiQC outputs & yaml files

Outputs:

1. <OSD_NNN>_metrics.csv # Exhaustive metrics as pulled from multiQC reports
2. <OSD_NNN>_summary.csv # Summarization and derived statistics as generated on the exhaustive metrics table


## Overall Known Limitations

* Currently only supports paired end sequencing transcriptomics
* Updating this will require updating both extraction & summarization code

* Certain ISA archives may not work
* While most missing or encoded off-spec metadata can be addressed by disabling (commenting out) sections in `extraction_settings.yaml`, certain ones like missing `library layout` (unlikley but an example) will likely require more significant changes to accomodate.
130 changes: 130 additions & 0 deletions OSD-201/extraction_settings.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
Extraction Settings:
root search directory: "/workspace/dp_tools/OSD-201"
sections:
- name: "raw reads"
enabled: True
multiQC:
from json:
- "raw_multiqc_report"
- "raw_multiqc_data"
- "multiqc_data.json"
search recursively: False
logs directory:
- "00-RawData"
- "FastQC_Reports"
logs pattern(s):
- "*fastqc.zip"
modules:
- "fastqc"

- name: "trimmed reads"
enabled: True
multiQC:
from json:
- "trimmed_multiqc_report"
- "trimmed_multiqc_data"
- "multiqc_data.json"
search recursively: False
logs directory:
- "01-TG_Preproc"
- "FastQC_Reports"
logs pattern(s):
- "*fastqc.zip"
modules:
- "fastqc"

- name: "aligned reads"
enabled: True
multiQC:
from json:
- "align_multiqc_report"
- "align_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "02-STAR_Alignment"
logs pattern(s):
- "*Log.final.out"
modules:
- "star"

- name: "rseqc: genebody coverage"
enabled: True
multiQC:
from json:
- "geneBody_cov_multiqc_report"
- "geneBody_cov_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "RSeQC_Analyses"
- "02_geneBody_coverage"
logs pattern(s):
- "*.geneBodyCoverage.txt"
modules:
- "rseqc"

- name: "rseqc: infer experiment"
enabled: True
multiQC:
from json:
- "infer_exp_multiqc_report"
- "infer_exp_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "RSeQC_Analyses"
- "03_infer_experiment"
logs pattern(s):
- "*infer_expt.out"
modules:
- "rseqc"

- name: "rseqc: inner distance"
enabled: True
multiQC:
from json:
- "inner_dist_multiqc_report"
- "inner_dist_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "RSeQC_Analyses"
- "04_inner_distance"
logs pattern(s):
- "*inner_distance.txt"
modules:
- "rseqc"

- name: "rseqc: read distribution"
enabled: True
multiQC:
from json:
- "read_dist_multiqc_report"
- "read_dist_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "RSeQC_Analyses"
- "05_read_distribution"
logs pattern(s):
- "*read_dist.out"
modules:
- "rseqc"


- name: "rsem count"
enabled: True
multiQC:
from json:
- "RSEM_count_multiqc_report"
- "RSEM_count_multiqc_data"
- "multiqc_data.json"
search recursively: True
logs directory:
- "03-RSEM_Counts"
logs pattern(s):
- "*.stat"
modules:
- "rsem"

Loading