Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: adds keeplength option to ensure the correct region is extracted when using primers. #208

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mikerobeson
Copy link
Collaborator

@mikerobeson mikerobeson commented Nov 2, 2024

Fixes #175. This is now possible as the keeplength option was added to the q2-alignment plugin.

Adds keeplength parameter to make sure original alignment length does not change. Otherwise there is potential for the incorrect region to be extracted from the alignment. This is important to make sure alignment positions can easily be referenced to the original alignment. See the q2-alignment plugin, specifically the mafft-add test cases for examples.

Also adds clearer clarification and a warning within the trim-alignment description when using primers to find and extract amplicon regions.

All of the original trim-alignment tests still pass.

Copy link
Collaborator

@nbokulich nbokulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @mikerobeson ! I have one small question inline.

Also: could you maybe add a tiny unit test to make sure this is working as intended? Maybe you have a test case where the primer introduced gaps in an alignment and led to incorrect results? Or is it all just very theoretical?

alignment_with_primers, = expand_alignment_action(
alignment=aligned_sequences,
sequences=primers,
addfragments=True,
keeplength=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @mikerobeson do we want to hardcode this param? Should we make it True by default but expose the option to the user? Is there ever a case when they might want to not keep the length? In this case this would only happen if there are indels in the primer region, right? which we probably don't want to preserve.

Copy link
Collaborator Author

@mikerobeson mikerobeson Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These primer indel cases are explicitly tested in the mafft-add tests by toggling that option on/off. I was thinking I did not need to reproduce that test case here.

The issue is that the final alignment length changes after the primer sequences are added. Often, due to common column gaps in the newly combined alignment being removed. From what I can observe this is regardless if new indels are added within the location of the newly aligned primers. If any of the V3V4, V4, V4V5 primers are used no indels are added to the location where the primers align.

Due to the alignment length change after the primers are added to the alignment, the derived positional values of the primers will differ from the proper positions of the original alignment. Thus, when these derived positional values are used, the actual region extracted is not the region we are after, as they are based on the positions of the new alignment.

I suppose I could add another set of small files based on a few of the actual SILVA alignment sequences. Then I can check the expected vs observed returned alignment length differences. I'd just have to override keeplength=True option and compare. I've done this manually, by using an alignment viewer, and the differences are quite obvious.

Copy link
Collaborator Author

@mikerobeson mikerobeson Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is the trimming aspect... we are deriving the positional values from the resulting alignment after the primers are added. Then using those values to trim from the original alignment. This won't work in all cases as the resulting alignment length can very likely change. Which is why we need to use keeplength=True. I think that toggling this option makes obvious sense for the general mafft-add function, which is implemented ... but not for our specific use case here: finding alignment position trimming locations. Also, I'd like to think that most primers have already been vetted against a global set of reference sequences. 🤞

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay now I understand! So any length change messes up the positional trimming. I agree, hardcoding sounds necessary in that case.

And agreed, unit tests are not needed if this is explicitly tested in q2-alignment.

Copy link
Collaborator Author

@mikerobeson mikerobeson Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I can likely fix this by adding an if statement to check if primers were used. If primers are used then this line could be something like:

result = _trim_all_sequences(aligned_sequences_with_primers, trim_positions)

That is, we should be trimming the correct data from that new alignment rather than the original one. Then the primer sequences can removed from that output prior to saving. This would enable us add the ability to toggle the keeplength option. For instance if users would like to concatenate their edits to some master alignment and not worry about alignment length changes.

Thought that might be unnecessarily onerous... 🤔

Copy link
Collaborator Author

@mikerobeson mikerobeson Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nbokulich, I've added explicit unit test for the mafft --keeplength option. I decided to forgo toggling the --p-keeplengh parameter, as it is expected that this should used when trying to find the location of added fragments. That is, the mafft parameter --mapout (currently not accessible within the plugin) enforces the mafft --keeplength parameter if it is not supplied. Thus, I think we are in keeping with the spirit of the tool when it comes to determining alignment positions. :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm still working on the test cases... I'll let you know when I am done. :-)

Copy link
Collaborator Author

@mikerobeson mikerobeson Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... I realized that I should actually run my test cases rather than mock them. This PR should be ready to review now. :-)

@mikerobeson
Copy link
Collaborator Author

Appears to be failing on ModuleNotFoundError: No module named 'qiime2.plugins.alignment'. Not sure why.... When I run the tests locally within the q2dev-amplicon environment all the tests pass. This applies for both the 2024.10 and 2025.4 versions of dev.

@nbokulich
Copy link
Collaborator

looks like q2-alignment is missing from the recipe. Try adding this and re-run?

@mikerobeson
Copy link
Collaborator Author

Well, that's embarrassing. That is what the issue was... 🤦

Everything passes. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants