Choose a format for publishing harmonized datasets #15

gaurav · 2020-09-17T21:56:04Z

We currently have an initial, incomplete set of harmonized data from some IDC example datasets (#4), and need to chose a format to store and publish this data in a way that maintains provenance information. Here are some possible formats:

Please feel free to add any other formats we should look at!

This issue covers comparing these formats for our purposes and to translate the incomplete harmonized data into that format as a potential exemplar.

fedorov · 2020-10-01T15:40:57Z

Before we answer this question - do you know where will the resulting files live within the CRDC? What resource is going to be responsible for supporting users in querying the data stored in those files?

balhoff · 2020-10-02T17:21:02Z

I think the answer to that lies in the ultimate architecture for CDA. Definitely the querying part; not sure about storage of the files.

fedorov · 2020-10-02T18:16:41Z

Since CDA was mentioned - adding @DavidPotCanuck who is one of the leads for CDA, in case he is interested to participate in this discussion.

gaurav · 2020-11-12T02:42:01Z

I've made a preliminary conversion from the ISPY1 patient clinical subset to PFB by using a JSON file that records the mappings. You can download the PFB file, or have a look at their internal representation extracted with the PyPFB tool: the Avro schema, PFB metadata node, and data.

The good news is that PFB can be used to store textual, numerical and enumerated data, along with a reference to the caDSR CDE used for the mapping (from which we could map the harmonized values to NCIt concept identifiers). The bad news is that Avro fields and enum values are both restricted to the regex [A-Za-z_][A-Za-z0-9_]*. The PFB developers have come up with a simple representation for inserting Unicode characters (using e.g. _00A9_ to represent U+00A9 = ©), but PyPFB doesn't consistently translate those field names into and out of JSON at the moment. We could probably fix this without too much of a problem, assuming that this is a bug in PyPFB and not a misinterpretation on my part. PFB doesn't currently have a facility for storing verbatim values, so we do lose the actual values recorded in the original data. We might be able to propose a change to PFB to record this information if it is important to us. Another possibility is that we could store all the mapping information in another part of the Avro file, leave the verbatim values as-is and only convert them into harmonized values when exporting it.

I'm going to leave this conversion as-is for now and move on to trying to convert the same dataset into the CEDAR instance format so we can do a side-by-side comparison between these two formats.

gaurav · 2021-07-06T22:39:29Z

We'll have some example output files in cancerDHC/example-data#10 -- we can use that to figure out which format work best for our needs.

gaurav · 2021-10-01T05:46:16Z

For our immediate needs, the LinkML instance format (in YAML) seems to be a good representation, and has been developed into some exemplars by the CCDH Data Model Harmonization team as part of the CCDH Pilot. Future formats will probably be supported by adding generators to LinkML, so that data from any LinkML model can be converted into that format (e.g. see the issue tracking an Avro/PFB generator for LinkML). Given that, I think we can close this issue until specific use-cases emerge from CDA and the CRDC nodes.

gaurav added the in progress Somebody is working on this right now label Sep 17, 2020

gaurav self-assigned this Sep 17, 2020

gaurav added this to the Phase 2 - Quarter 3 milestone Sep 17, 2020

gaurav mentioned this issue Sep 17, 2020

Build a workflow for integrating example IDC data against standardized vocabularies #4

Closed

3 tasks

balhoff removed this from the Phase 2 - Quarter 3 (2020) milestone Apr 6, 2021

balhoff mentioned this issue Apr 6, 2021

specify formats for data adhering to CRDC-H model #28

Closed

gaurav added this to the Phase 3 - Quarter 3 (2021) milestone Aug 16, 2021

gaurav closed this as completed Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a format for publishing harmonized datasets #15

Choose a format for publishing harmonized datasets #15

gaurav commented Sep 17, 2020

fedorov commented Oct 1, 2020

balhoff commented Oct 2, 2020

fedorov commented Oct 2, 2020

gaurav commented Nov 12, 2020

gaurav commented Jul 6, 2021

gaurav commented Oct 1, 2021

Choose a format for publishing harmonized datasets #15

Choose a format for publishing harmonized datasets #15

Comments

gaurav commented Sep 17, 2020

fedorov commented Oct 1, 2020

balhoff commented Oct 2, 2020

fedorov commented Oct 2, 2020

gaurav commented Nov 12, 2020

gaurav commented Jul 6, 2021

gaurav commented Oct 1, 2021