Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hisat2 align wrapper ht2l extension #3368

Open
gilless429 opened this issue Oct 29, 2024 · 3 comments
Open

hisat2 align wrapper ht2l extension #3368

gilless429 opened this issue Oct 29, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@gilless429
Copy link

When using the hisat2 wrappers, I run into issues if my reference genome is large. This is because the hisat2 indexation will create ht2l files in this instance, not ht2, and the hisat2 align wrapper uses .glob("*.ht2") (line 29 of the current wrapper.py) and as such does not grab any ht2l files.

It seems to me adding to the ht2_files variable any files that match .ht2l via .glob("*.ht2l") as well would solve the problem.

It could also be considered to add a parameter to specify whether the genome used is large in the rule, but this seems a little too manual for the user in my opinion, when there don't really seem to be any issues with grabbing all .ht2 and .ht2l files.

This happens to me when using the wheat reference genome for example, which is quite large.

@gilless429 gilless429 added the enhancement New feature or request label Oct 29, 2024
@fgvieira
Copy link
Collaborator

fgvieira commented Oct 29, 2024

Well, the proper way would be to specify the files you want to use as index, and then infer the path from them.
I've made a PR that should fix it. Can you take a look and see if it would work for you?

@gilless429
Copy link
Author

This works to allow .ht2l index files, but it seems to have the (rather significant, in my opinion) downside of requiring that the user know ahead of time what their index output is going to look like, whereas in automated pipelines the indexing will often be done blind then followed up immediately by mapping.
A user would have to know ahead of time what their index is going to look like and say so, or the pipeline would need to peek into the requisite folder and see what they look like before getting on with alignment (and the peek would need to happen AFTER the indexing is done).
Obviously when using wheat (massive genome) it's 100% of the time going to be ht2l, and with something like Arabidopsis thaliana it'll always be ht2, but for the genomes in the neighborhood of that 4 gigabases approximate threshold (or for users who just have no idea about this threshold), this gets a little more dicey for automation.

The perk of a .glob() type approach is that as long as you put your indexation files in their own folder, not shared with other index files - which should be done anyway - you can just point snakemake to it, have the wrapper look for either ht2 or ht2l files, and go from there.

@fgvieira
Copy link
Collaborator

I find globbing a bit dangerous as you never know what other files might be in the same folder.
In your case, what about forcing the index to always be .ht2l?

johanneskoester pushed a commit that referenced this issue Oct 31, 2024
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

<!-- Add a description of your PR here-->
Fix issue #3368 

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that I have followed the [documentation for contributing
to
`snakemake-wrappers`](https://snakemake-wrappers.readthedocs.io/en/stable/contributing.html).

While the contributions guidelines are more extensive, please
particularly ensure that:
* [x] `test.py` was updated to call any added or updated example rules
in a `Snakefile`
* [x] `input:` and `output:` file paths in the rules can be chosen
arbitrarily
* [x] wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`)
* [x] temporary files are either written to a unique hidden folder in
the working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to
* [x] the `meta.yaml` contains a link to the documentation of the
respective tool or command under `url:`
* [x] conda environments use a minimal amount of channels and packages,
in recommended ordering


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

- **New Features**
- Introduced support for new index file formats (`.ht2l`) in both
alignment and indexing processes.
- Added a new rule for handling large index files in the HISAT2
alignment workflow.

- **Bug Fixes**
- Enhanced input handling for index files to improve clarity and
maintainability.

- **Documentation**
- Updated `meta.yaml` to include a description and a link to the HISAT2
manual.

- **Chores**
- Significant updates to the Conda environment configuration, including
version upgrades and removal of unnecessary dependencies.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants