Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitter Names lack spaces in dataformat tsv virus-genome compared to genbank file serialization #336

Open
corneliusroemer opened this issue Mar 21, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@corneliusroemer
Copy link

Describe the bug
The Submitter Names field has a not-very-robust serialization format of LAST NAME,FIRST NAME INITIALS,LAST NAME,FIRST NAME INITIALS... that does not separate individuals. Is this on purpose, if so why?

When I look up the original genbank file for a sequence, there is a space after the initials, before the next Last Name.

Compare output from

   datasets download virus genome taxon 186538  --no-progressbar  --filename results/ncbi_dataset.zip
 dataformat tsv virus-genome   --package results/ncbi_dataset.zip  --fields submitter-names

for e.g. OR084927 with what's shown for the corresponding .gb file.

CLI output: Kinganda-Lusamaki,E.,Whitmer,S.,Lokilo-Lofiko,E.,Amuri-Aziza,A.,Muyembe-Mawete,F.,Makangara-Cigolo,J.C.,...
Genbank file: Kinganda-Lusamaki,E., Whitmer,S., Lokilo-Lofiko,E., Amuri-Aziza,A., Muyembe-Mawete,F., Makangara-Cigolo,J.C.,

Note that the Genbank file separates names with a whitespace - which is prudent, as otherwise one needs to hope that the parity holds for long strings.

@corneliusroemer corneliusroemer added the bug Something isn't working label Mar 21, 2024
@olearyna
Copy link
Contributor

Hi corneliusroemer,

Thanks, we'll look into it.

Nuala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants