Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demux: should golay-error-correction be False by default? #102

Open
nbokulich opened this issue May 26, 2019 · 9 comments
Open

demux: should golay-error-correction be False by default? #102

nbokulich opened this issue May 26, 2019 · 9 comments
Assignees

Comments

@nbokulich
Copy link
Member

nbokulich commented May 26, 2019

Proposed Behavior
I think yes, default should be False, but I see both sides of the argument:

  1. True: the emp-* methods are designed for EMP, after all. EMP uses golay barcodes (correct?) and hence anyone truly following that protocol would have golay (by default).
  2. False: many different primers and barcoding schemes (i.e., non-EMP) are compatible with the "EMP"-style method and I think many novice users will stumble into this, as demonstrated on the forum. Setting this to default will allow users to decide what they do.
@ChrisKeefe ChrisKeefe changed the title demux: should golay-error-correction should be False by default? demux: should golay-error-correction be False by default? Jun 25, 2019
@ebolyen
Copy link
Member

ebolyen commented Jun 25, 2019

Should it have a default? Maybe this is a required argument, not sure if that's actually better or not as False still demultiplexes fine.

@ChrisKeefe
Copy link
Collaborator

Setting the default to false could prevent many ongoing users from using the feature. They likely are not aware that it exists, and have less reason than new users to re-read documentation or tutorials.

Do we have a sense of how many emp-* users use non-standard (i.e. non-golay) barcodes? If it's a small minority, I like the current default; those users will soon have many forum responses to help them troubleshoot, and all standard-EMP barcode users will get the benefit of golay correction without the hassle of a new required parameter.

If it's a significant number, making this a required argument might be a good idea. It would be a minor inconvenience to all users, but would require them to better understand their lab protocols - not entirely a bad thing.

@nbokulich
Copy link
Member Author

not sure if that's actually better or not as False still demultiplexes fine.

And False would be consistent with the functionality prior to the latest release.

Do we have a sense of how many emp-* users use non-standard (i.e. non-golay) barcodes?

I think it's a great idea to hold a forum poll! There are some other examples out there for how to set up these polls... one of them actually concerns demux needs, so may already answer this question.

But I would bet many users have non-Golay barcodes. The "EMP" format was adopted and adapted by many other projects, so there are a few "traditions" floating out there that look like EMP but have different barcode schemes.

those users will soon have many forum responses to help them troubleshoot

I think that is overly optimistic. The most common questions on the forum are still asked frequently, perhaps because new users are not aware of the search functionality. In the case of this problem it is unlikely to go away because demux can fail for many reasons and whether golay barcodes are the problem is difficult to ascertain (perhaps we should implement a golay barcode validation method that users can run on their barcodes to check)

all standard-EMP barcode users will get the benefit of golay correction without the hassle of a new required parameter.

We can take this as a "straw poll": to my reckoning, few users asked about the lack of golay error correction before it was implemented. But now many users are complaining about the mysterious errors they are receiving (evidently because their barcodes are non-golay).

@thermokarst
Copy link
Contributor

👍 Thanks all. I like the idea of dropping the default, that seems like a simple way forward, but will break existing workflows for users. Something to think about.

Another option:
Technically, EMP protocol specifies Golay barcodes, anything else is just a "psuedo-EMP" approach. How about we take a two-prong approach: a) Tighten up our requirements for emp-single and emp-paired so that these methods honor the EMP protocol in a stricter sense and b) expose a new method (or two, depends on how far TypeMatch can take us) for "barcodes in a separate file AKA psuedo EMP" users. So, a) would see emp-* remove the parameter for golay-correction --- it would be True all the time, while in b) error-correction would either be False all the time, or, could be controlled via parameter (I haven't thought too hard about that yet). All the same base demux functionality could be shared behind the scenes --- this is really just about changing the user-facing API here. "Don't have EMP barcodes? Then don't use the EMP demux method(s)."

Thoughts? I have had a few side-line discussions with @wasade, @ebolyen, & @nbokulich, and will attempt to pull in some discussion points:

  • Whatever we do in the tutorials, people are likely going to copy in their own analysis. We could still wind up with just as many people accidentally running non-golay barcodes through the emp-* methods. My rebuttal here is that we can beef up the validation of these methods with this proposed refactor --- we can just outright tell users in this case: "you don't have golay/emp barcodes."
  • We might be able to layer on additional EMP-edness validation in a) above, such as pulling in a known set of 5' primers, perhaps mined from something like Qiita
  • How does this proposal fit into the larger discussion on barcode schemes? EMP, Hamming, DI, none. @wasade brought up the point that maybe there are opportunities to expose "tiers" of functionality in a plugin.

I am sure I missed or misunderstood some of these side discussions summarized above, please correct me!

Thanks!

PS - a brief summary of options that I see:

  1. Leave everything as-is, but update the "no reads mapped to samples" exception to recommend that users adjust the Golay setting as necessary.
  2. Drop the default Golay setting --- force users to pick an option manually each time.
  3. Change the default back to "False"
  4. The proposal above for stricter EMP and a new method or two for non-golay barcodes (but otherwise "looks" like EMP).
  5. ???

@wasade
Copy link
Member

wasade commented Jun 27, 2019

Thanks, @thermokarst! A slight addendum, the mining of Qiita would be for looking at the first few observed nt off the 5' end of the read for EMP (frequently TAC). And, could be really nice for differentiating dual/single indexing or just overload method names. E.g.,

single-index-emp-single-end
single-index-emp-paired-end
single-index-exact-match
single-index-hamming
single-index-variable
single-index-golay   # note, I *think* if 12nt then this would be the same as EMP
dual-index-hamming
dual-index-exact-match
dual-index-golay  # i dont think this is in use

@nbokulich
Copy link
Member Author

Just a thought, to tie together two ideas of @thermokarst :

  1. error-correction would either be False all the time, or, could be controlled via parameter (I haven't thought too hard about that yet).
  2. How does this proposal fit into the larger discussion on barcode schemes? EMP, Hamming, DI, none. @wasade brought up the point that maybe there are opportunities to expose "tiers" of functionality in a plugin.

Maybe make a generalized psuedo-emp method that can be expanded to accommodate all of these options, including EMP barcodes. The emp-* methods would hardcode the golay parameter, plus possibly provide EMP-specific validation as discussed above, but otherwise pseudo-emp (we really need a better name!) could theoretically process emp barcodes.

This will serve a few purposes:

  1. we keep EMP separate and sacred (and add neat validation)
  2. we supply a swiss army knife method for demuxing any type of multiplexed data with separate barcode reads
  3. the emp-* validation can diverge and expand beyond just golay barcodes. So we can imagine a future where folks use non-EMP barcodes, primers, etc, but the barcodes happen to be golay barcodes that are not on the EMP list (is that a thing?).

Proposed new names (I am just scratching my head here):
index-reads
from-index
fastq-barcodes
fastq-index

@ebolyen
Copy link
Member

ebolyen commented Jun 27, 2019

I am very much in support of having separate EMP only actions, with stricter validation, and then some number of catch-alls for other index-based demultiplexing. With respect to tutorials, I really like this idea:

My rebuttal here is that we can beef up the validation of these methods with this proposed refactor --- we can just outright tell users in this case: "you don't have golay/emp barcodes."

We could say: use pseudo-emp (better name pending) instead.


We can probably use TypeMap to make less actions (e.g. the single vs paired split becomes less interesting). However, we're lacking a way to creating a Union view-type for the python type annotation, so we'd have to use this hack in the meanwhile.

Maybe a dumb question, but do dual-index schemes have 4 files in total? If so, we'd need a new format for that, which means we could similarly disambiguate the view-types and overload the action name like for single/paired.

As for hamming/variable/golay/none these all seem like things that could work as parameters just fine (required in this case I would argue), since they just change the specific treatment of the index file(s), not the general structure of things.


For names, I like something along the lines of from-index if we were to overload all the things, otherwise @wasade's names are very clear.

@mestaki
Copy link

mestaki commented May 18, 2020

Just something I noticed that may belong here for a mini-discussion regarding default values.

https://forum.qiime2.org/t/questions-about-import-paired-end-sequencing-data/12531/3

False (according to the method defaults). The mapping barcodes must be reverse-complemented to be considered valid golay barcodes.

So if I'm reading this correctly, in order for Golay error-correction to be valid the users need to be including both rev-comp option if they have true EMP daata?

  --p-rev-comp-barcodes
  --p-rev-comp-mapping-barcodes

But this isn't set by default, and shouldn't this plugin technically be ready to go for EMP data with defaults?

If ^ is true then I would propose that these be set TRUE by default. If not true and I'm misreading then just ignore this!

@thermokarst
Copy link
Contributor

But this isn't set by default, and shouldn't this plugin technically be ready to go for EMP data with defaults?

Great question - I would need to go look up the EMP to be more informed. Pinging @gregcaporaso - do you recall why the param defaults are what they are in this plugin?

@gregcaporaso gregcaporaso self-assigned this Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants