hisat2 align wrapper ht2l extension #3368

gilless429 · 2024-10-29T10:33:34Z

When using the hisat2 wrappers, I run into issues if my reference genome is large. This is because the hisat2 indexation will create ht2l files in this instance, not ht2, and the hisat2 align wrapper uses .glob("*.ht2") (line 29 of the current wrapper.py) and as such does not grab any ht2l files.

It seems to me adding to the ht2_files variable any files that match .ht2l via .glob("*.ht2l") as well would solve the problem.

It could also be considered to add a parameter to specify whether the genome used is large in the rule, but this seems a little too manual for the user in my opinion, when there don't really seem to be any issues with grabbing all .ht2 and .ht2l files.

This happens to me when using the wheat reference genome for example, which is quite large.

fgvieira · 2024-10-29T15:57:43Z

Well, the proper way would be to specify the files you want to use as index, and then infer the path from them.
I've made a PR that should fix it. Can you take a look and see if it would work for you?

gilless429 · 2024-10-30T09:28:19Z

This works to allow .ht2l index files, but it seems to have the (rather significant, in my opinion) downside of requiring that the user know ahead of time what their index output is going to look like, whereas in automated pipelines the indexing will often be done blind then followed up immediately by mapping.
A user would have to know ahead of time what their index is going to look like and say so, or the pipeline would need to peek into the requisite folder and see what they look like before getting on with alignment (and the peek would need to happen AFTER the indexing is done).
Obviously when using wheat (massive genome) it's 100% of the time going to be ht2l, and with something like Arabidopsis thaliana it'll always be ht2, but for the genomes in the neighborhood of that 4 gigabases approximate threshold (or for users who just have no idea about this threshold), this gets a little more dicey for automation.

The perk of a .glob() type approach is that as long as you put your indexation files in their own folder, not shared with other index files - which should be done anyway - you can just point snakemake to it, have the wrapper look for either ht2 or ht2l files, and go from there.

fgvieira · 2024-10-30T10:28:28Z

I find globbing a bit dangerous as you never know what other files might be in the same folder.
In your case, what about forcing the index to always be .ht2l?

Fix issue #3368 ### QC  * [x] I confirm that I have followed the [documentation for contributing to `snakemake-wrappers`](https://snakemake-wrappers.readthedocs.io/en/stable/contributing.html). While the contributions guidelines are more extensive, please particularly ensure that: * [x] `test.py` was updated to call any added or updated example rules in a `Snakefile` * [x] `input:` and `output:` file paths in the rules can be chosen arbitrarily * [x] wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`) * [x] temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to * [x] the `meta.yaml` contains a link to the documentation of the respective tool or command under `url:` * [x] conda environments use a minimal amount of channels and packages, in recommended ordering  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced support for new index file formats (`.ht2l`) in both alignment and indexing processes. - Added a new rule for handling large index files in the HISAT2 alignment workflow. - **Bug Fixes** - Enhanced input handling for index files to improve clarity and maintainability. - **Documentation** - Updated `meta.yaml` to include a description and a link to the HISAT2 manual. - **Chores** - Significant updates to the Conda environment configuration, including version upgrades and removal of unnecessary dependencies.

gilless429 added the enhancement New feature or request label Oct 29, 2024

fgvieira mentioned this issue Oct 29, 2024

fix: add support for ht2l indexes in hisat wrapper #3371

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hisat2 align wrapper ht2l extension #3368

hisat2 align wrapper ht2l extension #3368

gilless429 commented Oct 29, 2024

fgvieira commented Oct 29, 2024 •

edited

Loading

gilless429 commented Oct 30, 2024

fgvieira commented Oct 30, 2024

hisat2 align wrapper ht2l extension #3368

hisat2 align wrapper ht2l extension #3368

Comments

gilless429 commented Oct 29, 2024

fgvieira commented Oct 29, 2024 • edited Loading

gilless429 commented Oct 30, 2024

fgvieira commented Oct 30, 2024

fgvieira commented Oct 29, 2024 •

edited

Loading