-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provenance for some CbP relationships is lost during condense() operation #28
Comments
Hetionet uses DrugBank version 4.2 as processed in In the case of Haloperidol-binds-DRD2, I think these actions are coming from ChEMBL not DrugBank. Notice the following edge property:
If you go to either of the ChEMBL URLs above you'll see the following table, which does contain "inverse agonist" (link) For reference, we combined multiple sources of Compound-binds-Gene relationships in the |
Okay, so I will interpret an edge with many actions as all of them being separately true, even if there are some conflicts. I was looking through the Jupyter notebook at it seems that the actions and sources lists available are generated using the following code block def condense(df):
"""Combine gene-compound relationships"""
row = pandas.Series()
row['sources'] = set(itertools.chain.from_iterable(df.sources))
row['pubmed_ids'] = set(itertools.chain.from_iterable(df.pubmed_ids))
row['actions'] = set(itertools.chain.from_iterable(df.actions))
row['affinity_nM'] = df.affinity_nM.mean(skipna=True)
row['license'] = get_license(row['sources'])
row['urls'] = set(itertools.chain.from_iterable(df.urls))
return row so the information about which action comes from which source is not maintained. As far as I know, the neo4j schema is a bit limiting to having JSON/dictionary objects as the values, but it would be nice to be able to figure out from the final data what the provenance for each relationship was. Maybe a data structure that would be appropriate would be parallel lists, at the cost of being a bit repetitive. |
Referencing https://stackoverflow.com/a/38026494/4651668. I don't remember whether this limitation influenced how I encoded these properties. "parallel lists" or json-encoded text could be a sufficient workaround.
Yeah. Good lesson for the future. If need be, we could potentially create some sort of mapping from neo4j relationship id to full provenance info for CbP edges. Not as good as having it in the database, but hopefully an acceptable workaround? |
I'm going through metadata in the compound-binds-gene relationships, and taking a specific look at the
actions
lists. In many examples, there are several actions, such as with drugbank:DB00502 binds ncbigene:1813. In the JSON GZ export, there are two actions listed:['antagonist', 'inverse agonist']
. I made a query to the Neo4j instance to confirm this is also true there:However, on DrugBank I could only find the
antagonist
label. Is it the case that the DrugBank source data that gets parsed and converted in Hetionet contains extra information that doesn't make it to the web page I linked? If so, do you have any idea on how they pick which of many gets displayed?The text was updated successfully, but these errors were encountered: