Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: dereplicate-sequences expose parameter to disable sequence hash IDs #55

Open
nbokulich opened this issue Aug 20, 2018 · 8 comments
Labels
src:forum From the QIIME 2 Forum.

Comments

@nbokulich
Copy link
Member

nbokulich commented Aug 20, 2018

Improvement Description
Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in dereplicate-sequences.

Current Behavior
Seq hashes are used by default.

Proposed Behavior
Expose a --p-hashed-feature-ids parameter to choose how sequence IDs get handled.

References

  1. forum xref
@nbokulich nbokulich added the src:forum From the QIIME 2 Forum. label Aug 20, 2018
@colinbrislawn
Copy link
Contributor

Should we request this as a feature of vsearch? Vsearch currently supports:

--relabel string
  Relabel sequences using the prefix string and a ticker
--relabel_md5
--relabel_sha1

Colin

@colinbrislawn
Copy link
Contributor

colinbrislawn commented Aug 27, 2018

Wait... there are several, nested feature requests here!

  • expose a parameter to control how reads are labeled after derep
    • add setting to use the ID of the first sequenced encountered (vsearch's default)
    • add setting to use the sequence of the read as the ID (does any software do this?)

Is this what we want?

>ACTTTTTTG
ACTTTTTTG

Having a sequence with identical ID and sequence seems a little silly to me, but if both dada2 and deblur implement this natively, then I'm comfortable requesting it for vsearch. However, if this is an option within the q2 plugins, maybe we should implement this within Q2-vsearch.

Colin

@colinbrislawn
Copy link
Contributor

Hello @torognes, what do you think about a --relabel_self option in vsearch that relabels fasta headers so they identical to their sequences? Like this

>GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
>CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
>CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC

@torognes
Copy link

torognes commented Sep 2, 2019

Hi @colinbrislawn, yes, that's a feature that should be easy to add to vsearch. I'll add it to issues for vsearch and implement it soon.

@colinbrislawn
Copy link
Contributor

colinbrislawn commented Sep 2, 2019

Thanks @torognes!

@nbokulich I'll add --p-hashed-feature-ids / --p-no-hashed-feature-ids to match dada2 and deblur.

As far as I can see, the reads will be hashed with sha1, which conflicts with the md5 of dada2...
Should we make an option for other values or keep vsearch consistent with dada2 and deblur?

@nbokulich
Copy link
Member Author

Thanks @colinbrislawn !

Looks like VSEARCH has both --relabel_md5 and --relabel_sha1 options. So in q2-vsearch instead of a boolean option hashed_feature_ids you could make this a multi-choice string. Something like: hashed_feature_ids = Str % Choices(['md5', 'sha1', 'unhashed'])

@colinbrislawn
Copy link
Contributor

So --relabel_self is now in vsearch v2.14.0 and up. All our options are on the table.

Looks like both this issue and #48 can't be closed until the vsearch version is bumped. While we wait for the bump, I'll try to get this PR submitted added before the October 18th deadline.

@colinbrislawn
Copy link
Contributor

It looks like removing the hashes breaks this section:

 id_map = {e.metadata['description']: e.metadata['id']
              for e in skbio.io.read(str(dereplicated_sequences),

With just a sample ID, instead of hash + sample ID, this section breaks.

What's the recommended way to build this id_map without hashes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
src:forum From the QIIME 2 Forum.
Projects
None yet
Development

No branches or pull requests

3 participants