-
Notifications
You must be signed in to change notification settings - Fork 21
Mapping scientific and common names
Gavin Huttley edited this page Oct 17, 2019
·
1 revision
ensembldb
relies on mapping "species" names to common names to simplify creating Genome and Compara instances. The following script is one way of exporting this content from the Ensembl database.
The result of running this script produces "common" names are not always what you want. For instance, there are multiple members of the Drosophila genus, making the common name "fruitfly" ambiguous. Accordingly, the species.tsv
file distributed with ensembldb3
is an edited version of this.
import os
from collections import defaultdict
import sqlalchemy as sql
from pprint import pprint
from ensembldb3 import HostAccount, Compara, Species
account = HostAccount(*os.environ['ENSEMBL_ACCOUNT'].split())
compara = Compara(['human', 'mouse', 'dog', 'platypus'], release=85,
account=account)
gen_db = compara.ComparaDb.get_table('genome_db')
ncbi_db = compara.ComparaDb.get_table('ncbi_taxa_name')
joined = gen_db.outerjoin(ncbi_db, gen_db.c.taxon_id==ncbi_db.c.taxon_id)
mapping = defaultdict(dict)
query = sql.select([joined], use_labels=True,
whereclause=sql.or_(ncbi_db.c.name_class=='ensembl alias name',
ncbi_db.c.name_class=='scientific name'))
recs = query.execute().fetchall()
for r in recs:
names = {r['ncbi_taxa_name_name_class']: r['ncbi_taxa_name_name']}
mapping[r['genome_db_name']].update(names)
rows = []
for db in mapping:
sci = db.split('_')
sci[0] = sci[0].capitalize()
sci = ' '.join(sci)
db_sci = mapping[db]['scientific name']
syn = '' if db_sci.lower() == sci.lower() else db_sci
row = [sci, mapping[db]['ensembl alias name'], syn]
rows.append(row)
rows = list(sorted(rows))
rows = ['\t'.join(r) for r in rows]
with open('species.tsv', 'wt') as out:
out.write('\n'.join(rows))