Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Add functions guess_parse and guess_alphabet #292

Merged
merged 7 commits into from
Oct 22, 2024

Conversation

jakobnissen
Copy link
Member

This PR creates the functions guess_parse and guess_alphabet, which infers an appropriate alphabet for the sequence:

julia> guess_parse("UAUHVCG")
7nt RNA Sequence:
UAUHVCG

julia> guess_parse("LVVWKREFVL")
10aa Amino Acid Sequence:
LVVWKREFVL

Notes for reviewers

  • This PR does not implement a good API for recoverable parsing, to be used in libraries. That is, it's not a stab at Add tryparse(::BioSequence, s::AbstractString) or similar #224 . Rather, it's intended for interactive REPL work
  • The name could use some bikeshedding! 😄 Ideally, the name
    • Is short. It'll be used in the REPL, after all
    • Makes it clear that this function is a heuristic / loose / guessing function
  • Should we have a macro for this? guess"TAGTGCA" or whatever?

Closes #268
Does not close the similar #224

@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Attention: Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.73%. Comparing base (95d9218) to head (a9a233e).
Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
src/alphabet.jl 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #292      +/-   ##
==========================================
- Coverage   90.87%   90.73%   -0.14%     
==========================================
  Files          31       31              
  Lines        2400     2419      +19     
==========================================
+ Hits         2181     2195      +14     
- Misses        219      224       +5     
Flag Coverage Δ
unittests 90.73% <95.65%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jakobnissen jakobnissen marked this pull request as ready for review October 20, 2023 12:23
@kescobo
Copy link
Member

kescobo commented Jan 17, 2024

In terms of name, if the goal is for easy typing, I think underscores are a no no (for me personally). guess or guessparse seem ok.

I do like guess"AATTCC", but if you're copying a sequence, you probably know if it's aa or dna or whatever - I'd expect to us this more in a loop or something, where the macro form is less useful.

Copy link
Member

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig it! The bit shifting stuff is inscrutable to me, but I like the tests, and the caveats are sufficiently documented IMO

@kescobo kescobo self-requested a review January 17, 2024 20:22
Copy link
Member

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to approve

@cjprybol
Copy link
Contributor

Like Kevin, I can't say I feel like I'm much help on the bit-code, but the rest of this looks great and I'm very excited about this functionality. I don't know if I'm the only one, but it feels like so few bioinformatics tools do any pre-processing validation of fasta files on their own. The number of hours I've wasted debugging code when someone throws a protein fasta into a collection of DNA fastas and uses .fasta for both instead of .fna & .faa extensions 😅

@jakobnissen
Copy link
Member Author

Thanks for your inputs! I'd like to merge soon.
I'm still a bit torn on the name though. I agree that guess_parse is awkward. But I also don't like that neither the name nor the argument says anything about what it parses into.
Maybe bioseq?

It's possible we might want a real API at some point to detect compatible
alphabets for a given input, but it's not trivial:
* What do we do about user-defined alphabets?
* Can we accept parsing the whole sequence twice - once to detect the alphabet,
  and once to construct the sequence?
* Might there be some downstream problems caused by giving the users functions
  to create type instability in their packages?
@jakobnissen jakobnissen merged commit b47c3d6 into BioJulia:master Oct 22, 2024
22 checks passed
@jakobnissen jakobnissen deleted the guessparse2 branch October 22, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record type inference
3 participants