-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly treat blank ancestral alleles #963
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #963 +/- ##
==========================================
+ Coverage 93.13% 93.15% +0.01%
==========================================
Files 18 18
Lines 6354 6368 +14
Branches 1136 1142 +6
==========================================
+ Hits 5918 5932 +14
Misses 296 296
Partials 140 140
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @hyanwong
I've thought about it more and think that unknown should be explicitly N
or another character as that is usually the case in FASTA etc. ""
would probably only get into the ancestral allele as a mistake right?
I don't think we should assign any meaning to the empty string - as @benjeffery "N" is a much better signal for unknown from the FASTA. |
The problem with "N" is surely that it could match "N" in the actual VCF? It would require extra hardcoded logic not to match "N" against existing alleles. It seemed a bit neater to me to pick something that explicitly isn't allowed as an allele in a VCF file. I guess it's fine hardcoding a specific string like Note that inference will find a different ancestral allele from any what we set here. |
Just looking at how to do this. What alleles strings do we want to treat as "unknown ancestral state"? I assume If we want " |
I changed the behaviour to using (and documenting) |
121b11d
to
76a155f
Compare
I think this is as you wanted now @jeromekelleher ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just needs a bit of simplification
tsinfer/formats.py
Outdated
except IndexError: | ||
unknown_alleles[allele] += 1 | ||
converted[i] = allele_index | ||
if not (allele in {"", "N", "n"}): # All these must represent unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't treat empty string as legal input here as it has a defined meaning in VCZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now done. See comment below about changing the name to "ancestral_state", to match what we used previously.
d4c8f49
to
fc16222
Compare
As discussed with @jeromekelleher: we warn on unknown ancestral states, unless they are all "N" or "n", in which case we merely output info. Note that after discussion with @benjeffery, this also changes the parameter name to "ancestral_state" (to match tskit, and so that it is clearer that it differs from the index (which is known as the "ancestral allele"). It also fixes #927. I have tested it on a 200Mb 1000 genomes VCF, and the check does not noticeably increase the construction time of the VariantData object (which in this case was ~0.7 secs) |
Also note that I duplicated the |
fc16222
to
15c500a
Compare
…known" state Also document the class
15c500a
to
15705f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a bug in the previous code: since "" is the fill-value for alleles, if an ancestral allele was "", it wasn't treated as missing.
This PR fixes that bug, but also fixes #962: the string
"""N" is treated as meaning "deliberately set the ancestral allele as unknown".This is OK because "" is not a valid VCZ allele string. Therefore if the only ancestral alleles that don't match are "", then it's confusing to issue a warning, because setting unknown alleles is a deliberate user choice.