diff --git a/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md b/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md index 031c7145..70deaf77 100644 --- a/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md +++ b/docs/bulk_RNAseq-IOC/40_exercices_week_06_review.md @@ -1,30 +1,6 @@ -## Issues with Slack ? +## Issues with :wrench: `annotateMyID` ? -## Issues with GitHub ? -- [x] Does everyone have a GitHub ID ? -- [x] Was everyone able to create a readme file and make a pull request to the repository - [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? -- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have - generated during the first online meeting, with an extension .ga) and to add it in - the repository - [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? +## Issues with :wrench: `fgsea` ? -## Data upload in PSILO, then in Galaxy from Psilo -- [x] Did everyone upload the necessary data in its - [PSILO account](https://psilo.sorbonne-universite.fr) ? -- [x] Did everyone succeed to create direct download links ? -- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset` - in its Galaxy account ? +## Issues with :wrench: `EGSEA` ? -## Issues following the Galaxy training ? - -[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html) - -- Check whether `Relabel identifiers` tool is understood - -- Check whether `Extract element identifiers` tool is understood. Is the output dataset - from this tool uploaded in the appropriate GitHub folder ? - -## Check input datasets histories of the participants - -... and their ability to create appropriate collection for the analysis diff --git a/docs/bulk_RNAseq-IOC/41_workflow_intro.md b/docs/bulk_RNAseq-IOC/41_workflow_intro.md index d06d7aea..7d438f9a 100644 --- a/docs/bulk_RNAseq-IOC/41_workflow_intro.md +++ b/docs/bulk_RNAseq-IOC/41_workflow_intro.md @@ -1,33 +1,99 @@ -# Galaxy Workflows +## Galaxy Workflows At this point, you should be more familiar with - importing and manipulating datasets in Galaxy -- using tools in single consecutive steps -- visualising the metadata associated to these steps as well as the results. +- using tools in single, consecutive steps +- visualising the metadata associated to inputs, computational steps, and outputs. +You could arguably point out that all of these actions can be performed (maybe faster) either +in a Linux terminal (for Linux tools), the R environment (for R packages), or in a python +environment for python scripts. -However, this is only the tip of the Galaxy. +It can be noted, however, that using several completely separate environments would make +the analysis difficult to understand, compared to reading an analysis in a single Galaxy +history. -Indeed, as you may have noticed, histories can become very complicated with a lot of -datasets whose origin and purpose is not so easy to remember after a while (shorter that -you may believe). +Much worse, if you opt to use multiple environments with command lines you will +not maintain the links that connect the inputs, the computational tool and the outputs and +you will have to guess them based on their plausibility. On the contrary, in a Galaxy +hisstory, all these links are kept in a database (postgresql) and they can be retrieved +(even years later) by clicking on the galaxy datasets information icon. -Actually, the best way to preserve an analysis is to get it completely scripted in a -computational workflow. +Having said that, the accumulation of computational steps in histories is not the +culmination of an argument in favor of Galaxy. -This is where you find the Galaxy workflows ! +You've likely noticed that analysis histories can become quite complex. Numerous +trial-and-error iterations and datasets accumulate, making it difficult to recall their +origins and purposes after a surprisingly short period. -Galaxy workflow can be extracted from an history or built from scratch using the -Galaxy workflow editor (Menu `worflows`). +Scripting these analyses into +computational workflows offers the most effective solution for preserving them. -A workflow can be replayed at any time to regenerate an analysis. Importantly, they can be -exported as a `.ga` file and imported in another Galaxy server. Provided that this new -server has the input data and the tools specified by the workflow, the exact same analysis -will be generated. +**These workflows are the foundation of Galaxy, that streamlines their creation, execution, +and management**. -Take home message: "advanced Galaxy users use workflows, to capture their work and make -convincing, transparent and re-usable their computational protocols" +### Building and Sharing Analyses with Galaxy Workflows -In the next and last section, you will test 2 workflows that are available in your -Galaxy server and recapitulate most of the analyses you have performed today. \ No newline at end of file +Galaxy workflows offer a powerful solution for managing complex analyses. You can either: + +- [x] Extract a workflow from an existing history: + + This captures the steps you've taken in your analysis, making it easy to replicate. + +- [x] Build a workflow from scratch using the Galaxy workflow editor: + + This allows you to design custom workflows for specific analyses. + +- [x] Use a combination of both approaches ! + + Beginners tend to start with the first approach since it allows to automatically build + a workflow without interacting too much with the workflow `editor`. However, in use this + proves difficult, because the stories are often cluttered with several trials and + errors or trials and successes, with different parameter values for the same tool. + + Thus, a workflow built from a story can be difficult to untangle. + + On the other hand, experts in using the workflow editor favor creating workflows from + scratch. This mode requires you to have an analysis plan in mind, whereby workflow + editing is literally akin to the graphic writing of a computer script. Testing this + workflow can be done as it is written, by running it in Galaxy and verifying that the + outputs are valid and conform to what is expected. + + In real life, it is often a combination of the two approaches that is implemented: you + can start a workflow from a not too complicated story and correct / develop it later + by first using the editor before testing it + + Along the same lines, Galaxy masters will also rely on already existing workflows to + avoid reinventing what has already been done and save time. It is also possible to use + a workflow as a tool in another workflow, and thus to build very complex and elaborate + workflows by structuring them as `workflows of workflows`. + +The beauty of workflows lies in their reusability. You can: + +- [x] Replay a workflow at any time: + + Simply run the workflow again to regenerate your analysis, saving time and effort. + +- [x] Export workflows as shareable .ga files: + + This allows you to export your workflows and import them into other Galaxy servers. As + long as the new server has the required data and tools, the analysis will run identically. + +### Workflow reports +Another essential aspect of Galaxy workflows is that their invocations are logged and +accessible in the menu `User` --> `Workflow invocations` + +In addition, a report is automatically generated for each workflow invocation. A minimal +default report is generated for each workflow invocation and give access to inputs, outputs +and the workflow ==in its runtime version==. You can customize and enrich this automated +report using the Galaxy workflow editor. + +:warning: Reports cannot still be considered as a Material and Methods section for your +scientific manuscripts with computational analyses but they clearly make this section more +accurate and easier to write ! Moreover, the goal of reports is clearly to generate this +section in a fully automated manner, and Galaxy development is happening at a rapid pace ! + +### Key Takeaway +Advanced Galaxy users leverage workflows to capture their analyses, ensuring transparency, +reproducibility, and reusability of their computational protocols. diff --git a/docs/bulk_RNAseq-IOC/42_workflow_use_1.md b/docs/bulk_RNAseq-IOC/42_workflow_use_1.md index cfb04022..20c4fb7d 100644 --- a/docs/bulk_RNAseq-IOC/42_workflow_use_1.md +++ b/docs/bulk_RNAseq-IOC/42_workflow_use_1.md @@ -1,102 +1,69 @@ -# Workflow upload +# A workflow of your use-case -Same as data libraries, you can import workflows, from shared data that has been pre-set in your Galaxy server for this training session. +The exercise of this week is difficult: -To access these workflows : +You are going to prepare a complete workflow of your analysis. ----- - ![](images/tool_small.png) - - 1. Click the menu `Données partagées` (`Shared data`) and select the submenu - `Workflows`. You should see two workflows : `paired-data-STAR-RNAseq` and `paired-data-HISAT2-RNAseq` - - 2. For each workflow, click on the arrow and select `Import`. - - -Now, you'll be able to see these workflows in the `Workflow` menu. - ----- - -# Running workflows +Depending on your model organisms, you may not have been able to perform all of the +analyses covered in this training. This is not a problem: you are expected to create a +workflow from what you have actually been able to do. -You need to return to our first galaxy history `Inputs`, to do so : +In order to make a sustainable, reproducible and transparent workflow, you should meet the +following requirements: ----- - ![](images/tool_small.png) - - 1. Click the menu `Utilisateur` and select the submenu - `Historiques sauvegardés`. - - 2. Click on `Inputs`. Its status is now **current history**. - ----- +## Workflow inputs -## Prepare inputs +Best inputs are -These workflows use data collection as inputs, one per condition `treat` and `untreat`. Let's create our two data collections ! - ----- - ![](images/tool_small.png) - - 1. Click on the checked box. ![](images/checked-box.png) - - 2. Select all treated datasets in pair ends : - - `GSM461180_1_treat_paired.fastq.gz` - - `GSM461181_1_treat_paired.fastq.gz` - - `GSM461180_2_treat_paired.fastq.gz` - - `GSM461181_2_treat_paired.fastq.gz` +- [x] Completely unprocessed data (i.e. fastq files) +- [x] Preferably accessible through a sustainable URL. If it is not possible, they should + be at least easily accessible (i.e. gathered in a single folder, whose location is + precisely described) +- [x] reference data (GTF, bed, etc...) should be precisely annotated, date, organisation, + version, etc... Importantly, a **direct** URL to the original reference should be included +- [x] :warning: Unless impossible to do, do not use processed data as inputs of your + workflow. If you think this is impossible to do, **let's discuss it** ! +- A lot of good workflows stand on a metadata table, which describes input data, their + names, labels if required, replicate status, etc. This metadata table may be considered + as a genuine dataset which can be used by the workflow to perform some operations. - 3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`. - - 4. Enter a name for your dataset collection. `Name`: Treat data pairs. - - 5. `Create list` - ----- -![](images/redo.png) +## Computational steps - Redo a data collections for untreated datasets. +- [x] Whenever a computational step applies to multiple sample, think "**Collections**" +- [x] A good clue that you should switch to collections is when your workflow contains + twice or more the same step with the same parameters (or almost the same) +- [x] Take the time, for each step, to carefully fill the tool form at the right hand-side + of the workflow editor. +- [x] There are several fields in this tool form that *must* be used to clarify the step: + The `Label` field at the top of the tool form, the `Step Annotation` field, and the + `Configure Output: xxx` fields and their sub-fields `Label`, `Rename dataset` and `Change + datatype` - 1. Unchecked the previous datasets. + Experiment theses fields with your workflow ! - 2. Select all untreated datasets in pair ends : - - `GSM461177_1_untreat_paired.fastq.gz` - - `GSM461178_1_untreat_paired.fastq.gz` - - `GSM461177_2_untreat_paired.fastq.gz` - - `GSM461178_2_untreat_paired.fastq.gz` - - 3. Then click on the button `Pour toute la sélection...` and `Build List of Dataset Pairs`. - - 4. Enter a name for your dataset collection. `Name`: Untreat data pairs. +- [x] Workflow **can use parameters** at their runtime. If you are interested by this functionality, + let's discuss it ! - 5. `Create list` - ----- +## Workflow outputs -You are now the happy owner of two dataset paired collections ! - -It's time to test the worflows ! - ----- - ![](images/tool_small.png) - - 1. Go to Menu `Workflow`. - - 2. For the workflow `imported: paired-data-HISAT2-RNAseq`, click on the arrow and then `Run`. - - 3. `History Options` - - `Send results to a new history`: Yes - - 4. `1: treated data pairs`: Treat data pairs +- [x] You can hide some output datasets for better readability of the workflow by + unchecking this outputs in the tool items of the workflow. + + :warning: By default all outputs are visible although unchecked. This is only when you + check a first output that unchecked outputs become hidden. + + :warning: Hidden does not mean deleted: all workflow outputs are still there and you can + reveal them in the Galaxy history. - 5. `2:GTF`: Drosophila_melanogaster.BDGP6.95.gtf.gz - - 6. `3: un-treated data pairs`: Untreat data pairs +- [x] Whenever possible, rename your datasets in the workflow using the `Configure Output: xxx` + fields in the tool forms - 7. `Run workflow` +## Your objective: ----- +Is that you generate the complete analysis in a **single** workflow run, with the minimal +number of inputs. -![](images/redo.png) +This way, you can even loose/trash your Galaxy history : +Just having the inputs plus the workflow should be enough to regenerate the analysis. - Redo the same for the workflow `imported: paired-data-STAR-RNAseq`. +Consider that it is also a **huge** gain in term of data storage. diff --git a/mkdocs.yml b/mkdocs.yml index 6f93f3a9..56af72d5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -107,12 +107,10 @@ nav: - Week 7: - Review on week-6 work: bulk_RNAseq-IOC/40_exercices_week_06_review.md - - Read Mapping overview: - - Galaxy Workflows: bulk_RNAseq-IOC/41_workflow_intro.md + - Galaxy Workflows: + - Introduction: bulk_RNAseq-IOC/41_workflow_intro.md - Week 7 exercices: - - Workflows part 1: bulk_RNAseq-IOC/42_workflow_use_1.md - - Workflows part 2: bulk_RNAseq-IOC/43_workflow_use_2.md - - Workflows part 2: bulk_RNAseq-IOC/44_workflow_use_3.md + - Build your workflow: bulk_RNAseq-IOC/42_workflow_use_1.md - Week 8: - Review on week-7 work: bulk_RNAseq-IOC/50_exercices_week_07_review.md