ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

nbokulich · 2018-08-20T15:30:54Z

Improvement Description
Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in dereplicate-sequences.

Current Behavior
Seq hashes are used by default.

Proposed Behavior
Expose a --p-hashed-feature-ids parameter to choose how sequence IDs get handled.

References

forum xref

The text was updated successfully, but these errors were encountered:

colinbrislawn · 2018-08-23T18:04:53Z

Should we request this as a feature of vsearch? Vsearch currently supports:

--relabel string
  Relabel sequences using the prefix string and a ticker
--relabel_md5
--relabel_sha1

Colin

colinbrislawn · 2018-08-27T18:41:35Z

Wait... there are several, nested feature requests here!

expose a parameter to control how reads are labeled after derep
- add setting to use the ID of the first sequenced encountered (vsearch's default)
- add setting to use the sequence of the read as the ID (does any software do this?)

Is this what we want?

>ACTTTTTTG
ACTTTTTTG

Having a sequence with identical ID and sequence seems a little silly to me, but if both dada2 and deblur implement this natively, then I'm comfortable requesting it for vsearch. However, if this is an option within the q2 plugins, maybe we should implement this within Q2-vsearch.

Colin

colinbrislawn · 2019-08-30T19:18:35Z

Hello @torognes, what do you think about a --relabel_self option in vsearch that relabels fasta headers so they identical to their sequences? Like this

>GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
>CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
>CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC

torognes · 2019-09-02T08:36:40Z

Hi @colinbrislawn, yes, that's a feature that should be easy to add to vsearch. I'll add it to issues for vsearch and implement it soon.

colinbrislawn · 2019-09-02T15:33:35Z

Thanks @torognes!

@nbokulich I'll add --p-hashed-feature-ids / --p-no-hashed-feature-ids to match dada2 and deblur.

As far as I can see, the reads will be hashed with sha1, which conflicts with the md5 of dada2...
Should we make an option for other values or keep vsearch consistent with dada2 and deblur?

nbokulich · 2019-09-02T16:55:44Z

Thanks @colinbrislawn !

Looks like VSEARCH has both --relabel_md5 and --relabel_sha1 options. So in q2-vsearch instead of a boolean option hashed_feature_ids you could make this a multi-choice string. Something like: hashed_feature_ids = Str % Choices(['md5', 'sha1', 'unhashed'])

colinbrislawn · 2019-09-20T20:04:24Z

So --relabel_self is now in vsearch v2.14.0 and up. All our options are on the table.

Looks like both this issue and #48 can't be closed until the vsearch version is bumped. While we wait for the bump, I'll try to get this PR submitted added before the October 18th deadline.

colinbrislawn · 2022-09-13T22:02:32Z

It looks like removing the hashes breaks this section:

 id_map = {e.metadata['description']: e.metadata['id']
              for e in skbio.io.read(str(dereplicated_sequences),

With just a sample ID, instead of hash + sample ID, this section breaks.

What's the recommended way to build this id_map without hashes?

nbokulich added the src:forum From the QIIME 2 Forum. label Aug 20, 2018

torognes mentioned this issue Sep 2, 2019

Add option relabel_self to use sequence itself as a label in FASTA/FASTQ torognes/vsearch#384

Closed

colinbrislawn mentioned this issue Sep 13, 2022

imp: adds more options to dereplicate_sequences() #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

nbokulich commented Aug 20, 2018 •

edited by thermokarst

Loading

colinbrislawn commented Aug 23, 2018

colinbrislawn commented Aug 27, 2018 •

edited

Loading

colinbrislawn commented Aug 30, 2019

torognes commented Sep 2, 2019

colinbrislawn commented Sep 2, 2019 •

edited

Loading

nbokulich commented Sep 2, 2019

colinbrislawn commented Sep 20, 2019

colinbrislawn commented Sep 13, 2022

ENH: dereplicate-sequences expose parameter to disable sequence hash IDs #55

ENH: dereplicate-sequences expose parameter to disable sequence hash IDs #55

Comments

nbokulich commented Aug 20, 2018 • edited by thermokarst Loading

colinbrislawn commented Aug 23, 2018

colinbrislawn commented Aug 27, 2018 • edited Loading

colinbrislawn commented Aug 30, 2019

torognes commented Sep 2, 2019

colinbrislawn commented Sep 2, 2019 • edited Loading

nbokulich commented Sep 2, 2019

colinbrislawn commented Sep 20, 2019

colinbrislawn commented Sep 13, 2022

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

nbokulich commented Aug 20, 2018 •

edited by thermokarst

Loading

colinbrislawn commented Aug 27, 2018 •

edited

Loading

colinbrislawn commented Sep 2, 2019 •

edited

Loading