Skip to content

Commit

Permalink
Merge pull request #115 from ARTbio/week-7-workflows
Browse files Browse the repository at this point in the history
week-7
  • Loading branch information
drosofff authored Mar 11, 2024
2 parents 82373a5 + f4f5471 commit 8e7d378
Show file tree
Hide file tree
Showing 4 changed files with 143 additions and 136 deletions.
30 changes: 3 additions & 27 deletions docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,6 @@
## Issues with Slack ?
## Issues with :wrench: `annotateMyID` ?

## Issues with GitHub ?
- [x] Does everyone have a GitHub ID ?
- [x] Was everyone able to create a readme file and make a pull request to the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
generated during the first online meeting, with an extension .ga) and to add it in
the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
## Issues with :wrench: `fgsea` ?

## Data upload in PSILO, then in Galaxy from Psilo
- [x] Did everyone upload the necessary data in its
[PSILO account](https://psilo.sorbonne-universite.fr) ?
- [x] Did everyone succeed to create direct download links ?
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
in its Galaxy account ?
## Issues with :wrench: `EGSEA` ?

## Issues following the Galaxy training ?

[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)

- Check whether `Relabel identifiers` tool is understood

- Check whether `Extract element identifiers` tool is understood. Is the output dataset
from this tool uploaded in the appropriate GitHub folder ?

## Check input datasets histories of the participants

... and their ability to create appropriate collection for the analysis
106 changes: 86 additions & 20 deletions docs/bulk_RNAseq-IOC/41_workflow_intro.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,99 @@
# Galaxy Workflows
## Galaxy Workflows

At this point, you should be more familiar with

- importing and manipulating datasets in Galaxy
- using tools in single consecutive steps
- visualising the metadata associated to these steps as well as the results.
- using tools in single, consecutive steps
- visualising the metadata associated to inputs, computational steps, and outputs.

You could arguably point out that all of these actions can be performed (maybe faster) either
in a Linux terminal (for Linux tools), the R environment (for R packages), or in a python
environment for python scripts.

However, this is only the tip of the Galaxy.
It can be noted, however, that using several completely separate environments would make
the analysis difficult to understand, compared to reading an analysis in a single Galaxy
history.

Indeed, as you may have noticed, histories can become very complicated with a lot of
datasets whose origin and purpose is not so easy to remember after a while (shorter that
you may believe).
Much worse, if you opt to use multiple environments with command lines you will
not maintain the links that connect the inputs, the computational tool and the outputs and
you will have to guess them based on their plausibility. On the contrary, in a Galaxy
hisstory, all these links are kept in a database (postgresql) and they can be retrieved
(even years later) by clicking on the galaxy datasets information icon.

Actually, the best way to preserve an analysis is to get it completely scripted in a
computational workflow.
Having said that, the accumulation of computational steps in histories is not the
culmination of an argument in favor of Galaxy.

This is where you find the Galaxy workflows !
You've likely noticed that analysis histories can become quite complex. Numerous
trial-and-error iterations and datasets accumulate, making it difficult to recall their
origins and purposes after a surprisingly short period.

Galaxy workflow can be extracted from an history or built from scratch using the
Galaxy workflow editor (Menu `worflows`).
Scripting these analyses into
computational workflows offers the most effective solution for preserving them.

A workflow can be replayed at any time to regenerate an analysis. Importantly, they can be
exported as a `.ga` file and imported in another Galaxy server. Provided that this new
server has the input data and the tools specified by the workflow, the exact same analysis
will be generated.
**These workflows are the foundation of Galaxy, that streamlines their creation, execution,
and management**.

Take home message: "advanced Galaxy users use workflows, to capture their work and make
convincing, transparent and re-usable their computational protocols"
### Building and Sharing Analyses with Galaxy Workflows

In the next and last section, you will test 2 workflows that are available in your
Galaxy server and recapitulate most of the analyses you have performed today.
Galaxy workflows offer a powerful solution for managing complex analyses. You can either:

- [x] Extract a workflow from an existing history:

This captures the steps you've taken in your analysis, making it easy to replicate.

- [x] Build a workflow from scratch using the Galaxy workflow editor:

This allows you to design custom workflows for specific analyses.

- [x] Use a combination of both approaches !

Beginners tend to start with the first approach since it allows to automatically build
a workflow without interacting too much with the workflow `editor`. However, in use this
proves difficult, because the stories are often cluttered with several trials and
errors or trials and successes, with different parameter values for the same tool.

Thus, a workflow built from a story can be difficult to untangle.

On the other hand, experts in using the workflow editor favor creating workflows from
scratch. This mode requires you to have an analysis plan in mind, whereby workflow
editing is literally akin to the graphic writing of a computer script. Testing this
workflow can be done as it is written, by running it in Galaxy and verifying that the
outputs are valid and conform to what is expected.

In real life, it is often a combination of the two approaches that is implemented: you
can start a workflow from a not too complicated story and correct / develop it later
by first using the editor before testing it

Along the same lines, Galaxy masters will also rely on already existing workflows to
avoid reinventing what has already been done and save time. It is also possible to use
a workflow as a tool in another workflow, and thus to build very complex and elaborate
workflows by structuring them as `workflows of workflows`.

The beauty of workflows lies in their reusability. You can:

- [x] Replay a workflow at any time:

Simply run the workflow again to regenerate your analysis, saving time and effort.

- [x] Export workflows as shareable .ga files:

This allows you to export your workflows and import them into other Galaxy servers. As
long as the new server has the required data and tools, the analysis will run identically.

### Workflow reports
Another essential aspect of Galaxy workflows is that their invocations are logged and
accessible in the menu `User` --> `Workflow invocations`

In addition, a report is automatically generated for each workflow invocation. A minimal
default report is generated for each workflow invocation and give access to inputs, outputs
and the workflow ==in its runtime version==. You can customize and enrich this automated
report using the Galaxy workflow editor.

:warning: Reports cannot still be considered as a Material and Methods section for your
scientific manuscripts with computational analyses but they clearly make this section more
accurate and easier to write ! Moreover, the goal of reports is clearly to generate this
section in a fully automated manner, and Galaxy development is happening at a rapid pace !

### Key Takeaway
Advanced Galaxy users leverage workflows to capture their analyses, ensuring transparency,
reproducibility, and reusability of their computational protocols.
135 changes: 51 additions & 84 deletions docs/bulk_RNAseq-IOC/42_workflow_use_1.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,69 @@
# Workflow upload
# A workflow of your use-case

Same as data libraries, you can import workflows, from shared data that has been pre-set in your Galaxy server for this training session.
The exercise of this week is difficult:

To access these workflows :
You are going to prepare a complete workflow of your analysis.

----
![](images/tool_small.png)

1. Click the menu `Données partagées` (`Shared data`) and select the submenu
`Workflows`. You should see two workflows : `paired-data-STAR-RNAseq` and `paired-data-HISAT2-RNAseq`

2. For each workflow, click on the arrow and select `Import`.


Now, you'll be able to see these workflows in the `Workflow` menu.

----

# Running workflows
Depending on your model organisms, you may not have been able to perform all of the
analyses covered in this training. This is not a problem: you are expected to create a
workflow from what you have actually been able to do.

You need to return to our first galaxy history `Inputs`, to do so :
In order to make a sustainable, reproducible and transparent workflow, you should meet the
following requirements:

----
![](images/tool_small.png)

1. Click the menu `Utilisateur` and select the submenu
`Historiques sauvegardés`.

2. Click on `Inputs`. Its status is now **current history**.

----
## Workflow inputs

## Prepare inputs
Best inputs are

These workflows use data collection as inputs, one per condition `treat` and `untreat`. Let's create our two data collections !

----
![](images/tool_small.png)

1. Click on the checked box. ![](images/checked-box.png)

2. Select all treated datasets in pair ends :
- `GSM461180_1_treat_paired.fastq.gz`
- `GSM461181_1_treat_paired.fastq.gz`
- `GSM461180_2_treat_paired.fastq.gz`
- `GSM461181_2_treat_paired.fastq.gz`
- [x] Completely unprocessed data (i.e. fastq files)
- [x] Preferably accessible through a sustainable URL. If it is not possible, they should
be at least easily accessible (i.e. gathered in a single folder, whose location is
precisely described)
- [x] reference data (GTF, bed, etc...) should be precisely annotated, date, organisation,
version, etc... Importantly, a **direct** URL to the original reference should be included
- [x] :warning: Unless impossible to do, do not use processed data as inputs of your
workflow. If you think this is impossible to do, **let's discuss it** !
- A lot of good workflows stand on a metadata table, which describes input data, their
names, labels if required, replicate status, etc. This metadata table may be considered
as a genuine dataset which can be used by the workflow to perform some operations.

3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`.

4. Enter a name for your dataset collection. `Name`: Treat data pairs.

5. `Create list`

----
![](images/redo.png)
## Computational steps

Redo a data collections for untreated datasets.
- [x] Whenever a computational step applies to multiple sample, think "**Collections**"
- [x] A good clue that you should switch to collections is when your workflow contains
twice or more the same step with the same parameters (or almost the same)
- [x] Take the time, for each step, to carefully fill the tool form at the right hand-side
of the workflow editor.
- [x] There are several fields in this tool form that *must* be used to clarify the step:
The `Label` field at the top of the tool form, the `Step Annotation` field, and the
`Configure Output: xxx` fields and their sub-fields `Label`, `Rename dataset` and `Change
datatype`

1. Unchecked the previous datasets.
Experiment theses fields with your workflow !

2. Select all untreated datasets in pair ends :
- `GSM461177_1_untreat_paired.fastq.gz`
- `GSM461178_1_untreat_paired.fastq.gz`
- `GSM461177_2_untreat_paired.fastq.gz`
- `GSM461178_2_untreat_paired.fastq.gz`

3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`.

4. Enter a name for your dataset collection. `Name`: Untreat data pairs.
- [x] Workflow **can use parameters** at their runtime. If you are interested by this functionality,
let's discuss it !

5. `Create list`

----
## Workflow outputs

You are now the happy owner of two dataset paired collections !

It's time to test the worflows !

----
![](images/tool_small.png)

1. Go to Menu `Workflow`.

2. For the workflow `imported: paired-data-HISAT2-RNAseq`, click on the arrow and then `Run`.

3. `History Options`
- `Send results to a new history`: Yes

4. `1: treated data pairs`: Treat data pairs
- [x] You can hide some output datasets for better readability of the workflow by
unchecking this outputs in the tool items of the workflow.

:warning: By default all outputs are visible although unchecked. This is only when you
check a first output that unchecked outputs become hidden.

:warning: Hidden does not mean deleted: all workflow outputs are still there and you can
reveal them in the Galaxy history.

5. `2:GTF`: Drosophila_melanogaster.BDGP6.95.gtf.gz

6. `3: un-treated data pairs`: Untreat data pairs
- [x] Whenever possible, rename your datasets in the workflow using the `Configure Output: xxx`
fields in the tool forms

7. `Run workflow`
## Your objective:

----
Is that you generate the complete analysis in a **single** workflow run, with the minimal
number of inputs.

![](images/redo.png)
This way, you can even loose/trash your Galaxy history :
Just having the inputs plus the workflow should be enough to regenerate the analysis.

Redo the same for the workflow `imported: paired-data-STAR-RNAseq`.
Consider that it is also a **huge** gain in term of data storage.
8 changes: 3 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -107,12 +107,10 @@ nav:

- Week 7:
- Review on week-6 work: bulk_RNAseq-IOC/40_exercices_week_06_review.md
- Read Mapping overview:
- Galaxy Workflows: bulk_RNAseq-IOC/41_workflow_intro.md
- Galaxy Workflows:
- Introduction: bulk_RNAseq-IOC/41_workflow_intro.md
- Week 7 exercices:
- Workflows part 1: bulk_RNAseq-IOC/42_workflow_use_1.md
- Workflows part 2: bulk_RNAseq-IOC/43_workflow_use_2.md
- Workflows part 2: bulk_RNAseq-IOC/44_workflow_use_3.md
- Build your workflow: bulk_RNAseq-IOC/42_workflow_use_1.md

- Week 8:
- Review on week-7 work: bulk_RNAseq-IOC/50_exercices_week_07_review.md
Expand Down

0 comments on commit 8e7d378

Please sign in to comment.