Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define input/output format #6

Open
cmungall opened this issue Apr 23, 2018 · 7 comments
Open

Define input/output format #6

cmungall opened this issue Apr 23, 2018 · 7 comments

Comments

@cmungall
Copy link
Member

cmungall commented Apr 23, 2018

I will write proposals as individual comments in this ticket

@cmungall
Copy link
Member Author

RDF

Pros:

  • already a standard

Cons:

  • multiple ways of handling reification
  • even if we bless one, still awkward to handle

recommendation: The format MUST have a defined mapping to the RDF model, but there MAY be another serialization format

@cmungall
Copy link
Member Author

GraphML

Pros:

  • standard
  • allows for property graphs

Cons:

  • no standard way of defining which node or edge properties are accepted

@cmungall
Copy link
Member Author

cmungall commented Apr 23, 2018

Translator TSV

Spec (loosely defined):

  • TSV or CSV
  • Nodes and Edges as separate files
  • Multivalued columns separated by |
  • Must be readily translated to a python DataFrame - i.e. first line should be a header
  • Headers must come from Translator spec http://bit.ly/tr-kg-standard
  • E.g.

@cmungall
Copy link
Member Author

Translator JSON

Follow KB standard

Structurally almost identical to TSV above, the doc would have a section for nodes and a section for edges

Cons:

  • larger I/O or disk footprint than TSV for little gain

@RichardBruskiewich
Copy link
Collaborator

I've expressed this reservation before (to you Chris), but I'm wondering whether simply defining the nodes and edges alone suffices for knowledge graph representation. In effect, we need to annotate statements, not just subject nodes of a statement, but simply annotating the predicates doesn't help either. It does seem to be necessary to treat statements as a reified node, then hang everything off of it: subject, predicate, object, evidence, provenance, etc.

@yy20716
Copy link

yy20716 commented Apr 23, 2018

Chris, I wonder if we could also briefly mention GraphQL and Tinkerpop as well. The data model used in these languages/systems are also based on or similar to the property graph, so their pros and cons are also very similar to the ones of GraphML. I listed down additional pros and cons as follows.

  • One of the advantages of the models based on the property graphs and their variations is that they may be not suffered from the reification problem in RDF. The data can be presented and formatted in more compact ways (compared to RDF).
  • However, employing the property graphs may be not a good choice for the case where we need to integrate or link multiple different datasets, which can happen in this project. For example, most property graph models do not use concepts of IRIs and typed literals for describing their entities, thus it can be challenging when we need to handle entities labeled with the same values, which often exist in different datasets.

@cmungall
Copy link
Member Author

@yy20716 we will need to be careful about how we map CURIEs to IRIs. If the exchange format is not an RDF serialization (which already has precisely defined mechanisms) we will need to embed a prefix mapping in the exchange file or have a standard one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants