--add-relatives outputs unexpected sequence ID #99

glajoie1 · 2021-04-16T15:41:02Z

Hello,

I have been using SINA on a 16S sequences fasta file with the following command-line to obtain an alignment that included neighbour sequences, as in the online ACT implementation for small sequence sets. The reference database was downloaded from Silva.

sina -i ~/asv_ps20.fa -r ~/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb -o aligned.fasta.gz -o aligned.csv --add-relatives=15

In the output alignment file, I was expecting the 'relatives' sequences to correspond to the reference sequences identified in the align_filter_slv column of the output (e.g. JF769553.1, KJ855315.1) but I am rather getting sequence IDs that are not retrievable in the Silva reference database (e.g. GYJUndar, UncCy339). The same thing happens when I'm adding the '--search' flag.

Is there a way to get the sequences identified in the align_filter_slv column in the alignment file with the query sequence? (Or get information on name matching if this is a formatting issue?)

Thank you very much for your software!

epruesse · 2021-04-17T05:59:43Z

Those are the "ARB names". Each sequence in ARB has a couple of meta-data fields, "acc" holds the accession number and "name" holds that name that you are seeing. It's an ID generated from the sequence description ("UncCy399" will be something uncultured) such that it's unique for accession + start position (to account for genomes with multiple 16S).

In theory, you should be able to export the accession into the csv using -f acc. In practice that doesn't seem to be working. I'll mark this as bug. Also - the accession should always be listed in the CSV, I think.

glajoie1 · 2021-04-19T15:31:40Z

Ok - thank you for the information. The accession was not listed in the csv, so I generated a mapping file of the arb names to the silva accession numbers and taxonomy through the arb software using the SILVA_138.1_SSURef_NR99_12_06_20_opt.arb database.

epruesse · 2021-04-20T18:29:44Z

Just be aware you might get dups on the acc alone. In SILVA acc + start uniquely identify a SSU/LSU sequence, with start being the first base of the sequence within its accession number sequence.

epruesse added the bug label Apr 17, 2021

epruesse added this to the 1.7.3 milestone Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--add-relatives outputs unexpected sequence ID #99

--add-relatives outputs unexpected sequence ID #99

glajoie1 commented Apr 16, 2021

epruesse commented Apr 17, 2021

glajoie1 commented Apr 19, 2021

epruesse commented Apr 20, 2021

--add-relatives outputs unexpected sequence ID #99

--add-relatives outputs unexpected sequence ID #99

Comments

glajoie1 commented Apr 16, 2021

epruesse commented Apr 17, 2021

glajoie1 commented Apr 19, 2021

epruesse commented Apr 20, 2021