Merge pull request #115 from ARTbio/week-7-workflows

week-7
ARTbio · Mar 11, 2024 · 8e7d378 · 8e7d378
2 parents 82373a5 + f4f5471
commit 8e7d378
Show file tree

Hide file tree

Showing 4 changed files with 143 additions and 136 deletions.
diff --git a/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md b/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md
@@ -1,30 +1,6 @@
-## Issues with Slack ?
+## Issues with :wrench: `annotateMyID` ?
 
-## Issues with GitHub ?
-- [x] Does everyone have a GitHub ID ? 
-- [x] Was everyone able to create a readme file and make a pull request to the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
-- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
-      generated during the first online meeting, with an extension .ga) and to add it in
-      the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
+## Issues with :wrench: `fgsea` ?
 
-## Data upload in PSILO, then in Galaxy from Psilo
-- [x] Did everyone upload the necessary data in its
-      [PSILO account](https://psilo.sorbonne-universite.fr) ?
-- [x] Did everyone succeed to create direct download links ? 
-- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
-      in its Galaxy account ?
+## Issues with :wrench: `EGSEA` ?
 
-## Issues following the Galaxy training ?
-
-[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
-
-- Check whether `Relabel identifiers` tool is understood
-
-- Check whether `Extract element identifiers` tool is understood. Is the output dataset
-  from this tool uploaded in the appropriate GitHub folder ?
-
-## Check input datasets histories of the participants
-
-... and their ability to create appropriate collection for the analysis
diff --git a/docs/bulk_RNAseq-IOC/41_workflow_intro.md b/docs/bulk_RNAseq-IOC/41_workflow_intro.md
@@ -1,33 +1,99 @@
-# Galaxy Workflows
+## Galaxy Workflows
 
 At this point, you should be more familiar with
 
 - importing and manipulating datasets in Galaxy
-- using tools in single consecutive steps
-- visualising the metadata associated to these steps as well as the results.
+- using tools in single, consecutive steps
+- visualising the metadata associated to inputs, computational steps, and outputs.
 
+You could arguably point out that all of these actions can be performed (maybe faster) either
+in a Linux terminal (for Linux tools), the R environment (for R packages), or in a python
+environment for python scripts.
 
-However, this is only the tip of the Galaxy.
+It can be noted, however, that using several completely separate environments would make
+the analysis difficult to understand, compared to reading an analysis in a single Galaxy
+history.
 
-Indeed, as you may have noticed, histories can become very complicated with a lot of
-datasets whose origin and purpose is not so easy to remember after a while (shorter that
-you may believe).
+Much worse, if you opt to use multiple environments with command lines you will
+not maintain the links that connect the inputs, the computational tool and the outputs and
+you will have to guess them based on their plausibility. On the contrary, in a Galaxy
+hisstory, all these links are kept in a database (postgresql) and they can be retrieved
+(even years later) by clicking on the galaxy datasets information icon.
 
-Actually, the best way to preserve an analysis is to get it completely scripted in a
-computational workflow.
+Having said that, the accumulation of computational steps in histories is not the
+culmination of an argument in favor of Galaxy.
 
-This is where you find the Galaxy workflows !
+You've likely noticed that analysis histories can become quite complex. Numerous
+trial-and-error iterations and datasets accumulate, making it difficult to recall their
+origins and purposes after a surprisingly short period.
 
-Galaxy workflow can be extracted from an history or built from scratch using the
-Galaxy workflow editor (Menu `worflows`).
+Scripting these analyses into
+computational workflows offers the most effective solution for preserving them.
 
-A workflow can be replayed at any time to regenerate an analysis. Importantly, they can be
-exported as a `.ga` file and imported in another Galaxy server. Provided that this new
-server has the input data and the tools specified by the workflow, the exact same analysis
-will be generated.
+**These workflows are the foundation of Galaxy, that streamlines their creation, execution,
+and management**.
 
-Take home message: "advanced Galaxy users use workflows, to capture their work and make
-convincing, transparent and re-usable their computational protocols"
+### Building and Sharing Analyses with Galaxy Workflows
 
-In the next and last section, you will test 2 workflows that are available in your
-Galaxy server and recapitulate most of the analyses you have performed today.
+Galaxy workflows offer a powerful solution for managing complex analyses. You can either:
+
+- [x] Extract a workflow from an existing history:
+
+    This captures the steps you've taken in your analysis, making it easy to replicate.
+
+- [x] Build a workflow from scratch using the Galaxy workflow editor:
+
+    This allows you to design custom workflows for specific analyses.
+
+- [x] Use a combination of both approaches !
+
+    Beginners tend to start with the first approach since it allows to automatically build
+    a workflow without interacting too much with the workflow `editor`. However, in use this
+    proves difficult, because the stories are often cluttered with several trials and
+    errors or trials and successes, with different parameter values for the same tool.
+
+    Thus, a workflow built from a story can be difficult to untangle.
+
+    On the other hand, experts in using the workflow editor favor creating workflows from
+    scratch. This mode requires you to have an analysis plan in mind, whereby workflow
+    editing is literally akin to the graphic writing of a computer script. Testing this
+    workflow can be done as it is written, by running it in Galaxy and verifying that the
+    outputs are valid and conform to what is expected.
+
+    In real life, it is often a combination of the two approaches that is implemented: you
+    can start a workflow from a not too complicated story and correct / develop it later
+    by first using the editor before testing it
+
+    Along the same lines, Galaxy masters will also rely on already existing workflows to
+    avoid reinventing what has already been done and save time. It is also possible to use 
+    a workflow as a tool in another workflow, and thus to build very complex and elaborate
+    workflows by structuring them as `workflows of workflows`.
+
+The beauty of workflows lies in their reusability. You can:
+
+- [x] Replay a workflow at any time:
+
+    Simply run the workflow again to regenerate your analysis, saving time and effort.
+
+- [x] Export workflows as shareable .ga files:
+
+    This allows you to export your workflows and import them into other Galaxy servers. As
+    long as the new server has the required data and tools, the analysis will run identically.
+
+### Workflow reports
+Another essential aspect of Galaxy workflows is that their invocations are logged and
+accessible in the menu `User` --> `Workflow invocations`
+
+In addition, a report is automatically generated for each workflow invocation. A minimal
+default report is generated for each workflow invocation and give access to inputs, outputs
+and the workflow ==in its runtime version==. You can customize and enrich this automated
+report using the Galaxy workflow editor.
+
+:warning: Reports cannot still be considered as a Material and Methods section for your
+scientific manuscripts with computational analyses but they clearly make this section more
+accurate and easier to write ! Moreover, the goal of reports is clearly to generate this
+section in a fully automated manner, and Galaxy development is happening at a rapid pace !
+
+### Key Takeaway
+Advanced Galaxy users leverage workflows to capture their analyses, ensuring transparency,
+reproducibility, and reusability of their computational protocols.
diff --git a/docs/bulk_RNAseq-IOC/42_workflow_use_1.md b/docs/bulk_RNAseq-IOC/42_workflow_use_1.md
@@ -1,102 +1,69 @@
-# Workflow upload
+# A workflow of your use-case
 
-Same as data libraries, you can import workflows, from shared data that has been pre-set in your Galaxy server for this training session.
+The exercise of this week is difficult:
 
-To access these workflows :
+You are going to prepare a complete workflow of your analysis.
 
-----
-  ![](images/tool_small.png)
-
-  1. Click the menu `Données partagées` (`Shared data`) and select the submenu
-  `Workflows`. You should see two workflows : `paired-data-STAR-RNAseq` and `paired-data-HISAT2-RNAseq`
-
-  2. For each workflow, click on the arrow and select `Import`.
-
-
-Now, you'll be able to see these workflows in the `Workflow` menu.
-
-----
-
-# Running workflows
+Depending on your model organisms, you may not have been able to perform all of the
+analyses covered in this training. This is not a problem: you are expected to create a
+workflow from what you have actually been able to do.
 
-You need to return to our first galaxy history `Inputs`, to do so :
+In order to make a sustainable, reproducible and transparent workflow, you should meet the
+following requirements:
 
-----
-  ![](images/tool_small.png)
-
-  1. Click the menu `Utilisateur` and select the submenu
-  `Historiques sauvegardés`.
-
-  2. Click on `Inputs`. Its status is now **current history**. 
-
-----
+## Workflow inputs
 
-## Prepare inputs
+Best inputs are
 
-These workflows use data collection as inputs, one per condition `treat` and `untreat`. Let's create our two data collections !
-
-----
-  ![](images/tool_small.png)
-
-  1. Click on the checked box. ![](images/checked-box.png)
-
-  2. Select all treated datasets in pair ends :
-      - `GSM461180_1_treat_paired.fastq.gz`
-      - `GSM461181_1_treat_paired.fastq.gz`
-      - `GSM461180_2_treat_paired.fastq.gz`
-      - `GSM461181_2_treat_paired.fastq.gz`
+- [x] Completely unprocessed data (i.e. fastq files)
+- [x] Preferably accessible through a sustainable URL. If it is not possible, they should
+  be at least easily accessible (i.e. gathered in a single folder, whose location is
+  precisely described)
+- [x] reference data (GTF, bed, etc...) should be precisely annotated, date, organisation,
+  version, etc... Importantly, a **direct** URL to the original reference should be included
+- [x] :warning: Unless impossible to do, do not use processed data as inputs of your
+  workflow. If you think this is impossible to do, **let's discuss it** !
+- A lot of good workflows stand on a metadata table, which describes input data, their
+  names, labels if required, replicate status, etc. This metadata table may be considered
+  as a genuine dataset which can be used by the workflow to perform some operations.
 
-  3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`.
-
-  4. Enter a name for your dataset collection. `Name`: Treat data pairs. 
-
-  5. `Create list`
-
-----
-![](images/redo.png)
+## Computational steps
 
-  Redo a data collections for untreated datasets.
+- [x] Whenever a computational step applies to multiple sample, think "**Collections**"
+- [x] A good clue that you should switch to collections is when your workflow contains
+  twice or more the same step with the same parameters (or almost the same)
+- [x] Take the time, for each step, to carefully fill the tool form at the right hand-side
+  of the workflow editor.
+- [x] There are several fields in this tool form that *must* be used to clarify the step:
+  The `Label` field at the top of the tool form, the `Step Annotation` field, and the
+  `Configure Output: xxx` fields and their sub-fields `Label`, `Rename dataset` and `Change
+  datatype`
 
-  1. Unchecked the previous datasets.
+  Experiment theses fields with your workflow !
 
-  2. Select all untreated datasets in pair ends :
-      - `GSM461177_1_untreat_paired.fastq.gz`
-      - `GSM461178_1_untreat_paired.fastq.gz`
-      - `GSM461177_2_untreat_paired.fastq.gz`
-      - `GSM461178_2_untreat_paired.fastq.gz`
-
-  3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`.
-
-  4. Enter a name for your dataset collection. `Name`: Untreat data pairs. 
+- [x] Workflow **can use parameters** at their runtime. If you are interested by this functionality,
+  let's discuss it !
 
-  5. `Create list`
-
-----
+## Workflow outputs
 
-You are now the happy owner of two dataset paired collections ! 
-
-It's time to test the worflows !
-
-----
-  ![](images/tool_small.png)
-
-  1. Go to Menu `Workflow`.
-
-  2. For the workflow `imported: paired-data-HISAT2-RNAseq`, click on the arrow and then `Run`.
-
-  3. `History Options`
-      - `Send results to a new history`: Yes
-
-  4. `1: treated data pairs`: Treat data pairs
+- [x] You can hide some output datasets for better readability of the workflow by
+  unchecking this outputs in the tool items of the workflow.
+
+      :warning: By default all outputs are visible although unchecked. This is only when you
+      check a first output that unchecked outputs become hidden.
+
+      :warning: Hidden does not mean deleted: all workflow outputs are still there and you can
+      reveal them in the Galaxy history.
 
-  5. `2:GTF`: Drosophila_melanogaster.BDGP6.95.gtf.gz
-
-  6. `3: un-treated data pairs`: Untreat data pairs
+- [x] Whenever possible, rename your datasets in the workflow using the `Configure Output: xxx`
+  fields in the tool forms
 
-  7. `Run workflow`
+## Your objective:
 
-----
+Is that you generate the complete analysis in a **single** workflow run, with the minimal
+number of inputs.
 
-![](images/redo.png)
+This way, you can even loose/trash your Galaxy history :
+Just having the inputs plus the workflow should be enough to regenerate the analysis.
 
-  Redo the same for the workflow `imported: paired-data-STAR-RNAseq`.
+Consider that it is also a **huge** gain in term of data storage.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -107,12 +107,10 @@ nav:
 
         - Week 7:
           - Review on week-6 work: bulk_RNAseq-IOC/40_exercices_week_06_review.md
-          - Read Mapping overview:
-            - Galaxy Workflows: bulk_RNAseq-IOC/41_workflow_intro.md
+          - Galaxy Workflows:
+            - Introduction: bulk_RNAseq-IOC/41_workflow_intro.md
           - Week 7 exercices:
-            - Workflows part 1: bulk_RNAseq-IOC/42_workflow_use_1.md
-            - Workflows part 2: bulk_RNAseq-IOC/43_workflow_use_2.md
-            - Workflows part 2: bulk_RNAseq-IOC/44_workflow_use_3.md
+            - Build your workflow: bulk_RNAseq-IOC/42_workflow_use_1.md
 
         - Week 8:
           - Review on week-7 work: bulk_RNAseq-IOC/50_exercices_week_07_review.md