-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add identifiers to all FamPlex entries #87
base: master
Are you sure you want to change the base?
Conversation
- This adds identifiers and descriptions to all entities. - For now, the old files will stay until there's a good reason to commit to this new format. - TODO: The check_references.py script needs to be rewritten still for the new format
During this process I noticed I was writing the identifiers and names backwards. That's why we test!
[skip ci]
Still can't run interpro, reactome, or signor successfully
Wow, that's a serious PR! I like a lot of things about this, a couple of points to discuss:
|
I agree this is a problem, it wouldn't be a bad idea to do an audit on what's using FamPlex and where. For example, it might make sense to import a FamPlex package that already is a "famplex_client" for use in INDRA and Gilda.
Martin always joked that a good way to determine how much a web service is used is to to turn it off and see how long it takes for someone to email and complain. Maybe that strategy would work here? We could make a tag/release before making any big changes so users could refer to the repository by that hash in the mean time.
I've spent an inordinate amount of time (several sessions spanning several hours of digging through the rabbit hole that is setuputils, distutils, etc.) on this and never come up with a solution. I would love to solve the problem how you propose! Maybe you have some ideas, or at least have a fresh pair of eyes.
Good point. Again, I wonder how many users are currently depending on resolving through identifiers. Maybe we could get away with switching the current entry to something different
Okay! Will revert that.
Sounds like this will be an ongoing issue... Currently the OBO isn't compatible with OLS. Overall, do you think we should split this PR into a couple smaller ones? Which ones do you @bgyori @johnbachman think are the lowest hanging fruits that we can address first? |
Closes #86
This PR adds a script,
famplex/reorganize_old_resources.py
that migrates the old resource TSVs into thefamplex/resources/
folder. It moves the code in theexports/
folder tofamplex/export/
and refactors code necessary to cope with the new format changes.Identifiers:
\d{6}
entities.csv
in a label-sorted order, but rather to use the identifiers to sort.Format changes:
All CSV files were switched to TSV. I'm not dogmatic about this, but it makes it much easier to deal with funny charactersAfter reading the README one more time, I think it's better just to stick with CSVA
descriptions.csv
file was introduced for now, but it will be jointly upgraded withentities.csv
tofamplex/resources/entities.tsv
which has the identifier, name, description, and references.grounding_map.csv
has been upgraded tofamplex/resources/grounding_map.tsv
and will no longer be a wide CSV, but a tall and skinny TSV with four columns: text, database, database identifier, and name. This means one text reference may appear on many lines.equivalence.csv
has been upgraded tofamplex/resources/equivalence.tsv
and now has both labels and identifiers.relations.csv
has been upgraded tofamplex/resources/relations.tsv
and now has both identifiers and labels for subject and objectFeatures:
check_references.py
script now checks both that the HGNC identifier is valid and the associated names are up to date usingindra.databases.hgnc_client
check_references.py
script now checks for text in the grounding map that has multiple grounds to different entities in the same database. After some investigation, I suppose this sort of makes sense because one textual reference could be disambiguated to multiple entities. @steppi and @bgyori you should take a look at this, and consider how the grounding map in this repository fits into your current entity disambiguation work, and whether it even makes sense to store this kind of information here (in the famplex repo) anymore. Here's the current (short) output:Exporter changes:
[Typedef]
stanza). Added identifiers where possible and used OBO syntax to keep showing names when available. Added description and references for each entity when available. Renamed root entity to "famplex".To Do and Questions:
export/hgnc_ids.py
still necessary?