[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

jakobnissen · 2023-10-16T10:20:37Z

This PR creates the functions guess_parse and guess_alphabet, which infers an appropriate alphabet for the sequence:

julia> guess_parse("UAUHVCG")
7nt RNA Sequence:
UAUHVCG

julia> guess_parse("LVVWKREFVL")
10aa Amino Acid Sequence:
LVVWKREFVL

Notes for reviewers

This PR does not implement a good API for recoverable parsing, to be used in libraries. That is, it's not a stab at Add tryparse(::BioSequence, s::AbstractString) or similar #224 . Rather, it's intended for interactive REPL work
The name could use some bikeshedding! 😄 Ideally, the name
- Is short. It'll be used in the REPL, after all
- Makes it clear that this function is a heuristic / loose / guessing function
Should we have a macro for this? guess"TAGTGCA" or whatever?

Closes #268
Does not close the similar #224

codecov · 2023-10-16T11:19:36Z

Codecov Report

Attention: Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.73%. Comparing base (95d9218) to head (a9a233e).
Report is 10 commits behind head on master.

Files with missing lines	Patch %	Lines
src/alphabet.jl	94.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #292      +/-   ##
==========================================
- Coverage   90.87%   90.73%   -0.14%     
==========================================
  Files          31       31              
  Lines        2400     2419      +19     
==========================================
+ Hits         2181     2195      +14     
- Misses        219      224       +5

Flag	Coverage Δ
unittests	`90.73% <95.65%> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kescobo · 2024-01-17T20:20:05Z

In terms of name, if the goal is for easy typing, I think underscores are a no no (for me personally). guess or guessparse seem ok.

I do like guess"AATTCC", but if you're copying a sequence, you probably know if it's aa or dna or whatever - I'd expect to us this more in a loop or something, where the macro form is less useful.

kescobo

I dig it! The bit shifting stuff is inscrutable to me, but I like the tests, and the caveats are sufficiently documented IMO

kescobo

Meant to approve

cjprybol · 2024-01-20T10:40:39Z

Like Kevin, I can't say I feel like I'm much help on the bit-code, but the rest of this looks great and I'm very excited about this functionality. I don't know if I'm the only one, but it feels like so few bioinformatics tools do any pre-processing validation of fasta files on their own. The number of hours I've wasted debugging code when someone throws a protein fasta into a collection of DNA fastas and uses .fasta for both instead of .fna & .faa extensions 😅

jakobnissen · 2024-01-21T16:16:05Z

Thanks for your inputs! I'd like to merge soon.
I'm still a bit torn on the name though. I agree that guess_parse is awkward. But I also don't like that neither the name nor the argument says anything about what it parses into.
Maybe bioseq?

It's possible we might want a real API at some point to detect compatible alphabets for a given input, but it's not trivial: * What do we do about user-defined alphabets? * Can we accept parsing the whole sequence twice - once to detect the alphabet, and once to construct the sequence? * Might there be some downstream problems caused by giving the users functions to create type instability in their packages?

jakobnissen requested review from kescobo and CiaranOMara October 16, 2023 10:20

jakobnissen marked this pull request as ready for review October 20, 2023 12:23

jakobnissen force-pushed the guessparse2 branch from 8ced555 to ee245ee Compare January 15, 2024 11:40

kescobo reviewed Jan 17, 2024

View reviewed changes

kescobo self-requested a review January 17, 2024 20:22

kescobo approved these changes Jan 17, 2024

View reviewed changes

cjprybol approved these changes Jan 20, 2024

View reviewed changes

jakobnissen added 7 commits October 22, 2024 14:14

Fixup: Tests

9b11e98

Fixup: Test with AbstractString

eca93d3

Fixup: Do not copy AbstractString

bf018b9

Rename guess_parse to bioseq

40c8bba

Fix typo

0ac2d93

Update changelog etc

a9a233e

jakobnissen force-pushed the guessparse2 branch from 3ae5494 to a9a233e Compare October 22, 2024 12:17

jakobnissen merged commit b47c3d6 into BioJulia:master Oct 22, 2024
22 checks passed

jakobnissen deleted the guessparse2 branch October 22, 2024 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

jakobnissen commented Oct 16, 2023

codecov bot commented Oct 16, 2023 •

edited

Loading

kescobo commented Jan 17, 2024

kescobo left a comment

kescobo left a comment

cjprybol commented Jan 20, 2024

jakobnissen commented Jan 21, 2024

[RFC]: Add functions guess_parse and guess_alphabet #292

[RFC]: Add functions guess_parse and guess_alphabet #292

Conversation

jakobnissen commented Oct 16, 2023

Notes for reviewers

codecov bot commented Oct 16, 2023 • edited Loading

Codecov Report

kescobo commented Jan 17, 2024

kescobo left a comment

Choose a reason for hiding this comment

kescobo left a comment

Choose a reason for hiding this comment

cjprybol commented Jan 20, 2024

jakobnissen commented Jan 21, 2024

[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

codecov bot commented Oct 16, 2023 •

edited

Loading