Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Synapse Datasets + Collections] Define and incorporate schema for Synapse Datasets #136

Open
aclayton555 opened this issue Aug 30, 2024 · 8 comments
Assignees

Comments

@aclayton555
Copy link

Emerges from exploratory and feasibility analysis in: mc2-center/mc2-center-dcc#71

This ticket should track efforts to develop and implement a schema for annotating Synapse Datasets curated as part of the proposed MC2 Center workflow (Note that this is different from the existing 'Datasets' component in the MC2 Center data model). This is the first of several steps, which may be tracked in separate tickets as this work progress:

  1. Define the schema (there have been ongoing discussions on this among the data managers group at Sage, but no resolution)
  2. implementation as the JSON (this will be applicable and complimentary to ongoing efforts in NF)
  3. exploration and incorporation of automation (again, pull in efforts from NF)
  4. longer term: how this will look on the portal and what the expected user experience will be.
@aclayton555
Copy link
Author

aclayton555 commented Sep 4, 2024

24-9: @aditya-nath-sage will pick this up and chat with @jaybee84 about aligning approaches with NF.

Target output for this sprint: design doc for how to implement datasets across MC2 and NF (much of this captured in linked ticket above), including tentative annotation process. (will this leverage schematic or the Synapse API?)

Additional info on Synapse Datasets: https://help.synapse.org/docs/Datasets.2611281979.html

@aditya-nath-sage
Copy link

aditya-nath-sage commented Sep 4, 2024

Goal is to create a design document for how MC2 and NF will want to handle this issue. Example design doc: https://docs.google.com/document/d/1dF1-FjGSdO3nkKArEsrnjnWFLeOV78MlvGZvM8smJVk/edit?pli=1#heading=h.47emx3tcx2wj

@aclayton555
Copy link
Author

24-9 Close-Out: Currently working with DM group (at Sage) to create a org-wide Dataset schema. This may take some time to reach consensus, but we can prioritize incorporating a placeholder model that we can then add the finalized schema later. Check on this mid sprint in 24-10.

@aclayton555
Copy link
Author

@aditya-nath-sage let's touch base on this during our check-in tomorrow!

@aclayton555
Copy link
Author

Aditya and Orion to meet to align on this. In the meantime, @aditya-nath-sage to review ongoing design doc

Establish end of year goal for this effort

@aclayton555
Copy link
Author

24-10: Orion has a rough script on how to bind entities in Synapse. Need to understand how the schematic outputs will work here and what the schema looks like.

@aclayton555
Copy link
Author

24-11/12 Scope: Start working on this. This about how we surface datasets and collections that are on Synapse, and how these connect to publications via queryable metadata. Good to take stock of how many Datasets exist currently. Goal for end of sprint is a prelim design doc.

Another thought: for the record based datasets we have, how can we maybe generate and surface a collection of related datasets. SOme limitations here, as Synapse Collections currently only consolidate Dataset entities. One possibility is to generate entities from records, and create a Dataset from these, then create a Collection.

@Bankso
Copy link
Contributor

Bankso commented Nov 2, 2024

Rough draft of a schema bind script: https://github.com/mc2-center/mc2-center-dcc/blob/add-utils-11-24/utils/synapse_json_schema_bind.py

Rough draft of a script to convert Synapse table info to annotations: https://github.com/mc2-center/mc2-center-dcc/blob/add-utils-11-24/utils/table_to_annotations.py

Script for creating a Synapse Dataset and adding entities from a folder: https://github.com/mc2-center/mc2-center-dcc/blob/add-utils-11-24/utils/build_datasets.py

  • I imagine we could repurpose this to create Synapse Datasets using dataset link entities. To repurpose, we would want to add code (or write a separate script) where we 1) extract Dataset View table entries, likely on a project-by-project basis; 2) convert the Dataset link into a Synapse link entity (effectively a file) and store it in the 'datasets' folder 3) add the remaining Dataset View annotations to the link entity as Synapse annotations; 4) use 'build_datasets.py' to add link entities to Datasets. Step 4 could be modified to select for dataset link entities that have the same 'Study Key' or 'PublicationView Key', so only related dataset link entities are combined into a Synapse Dataset
  • We should also consider how this relates to the 'DatasetView_id' attribute. Currently, we mint Synapse Ids to serve as entries in this field, by creating and immediately deleting folders. If we create link entities as part of our curation workflow, we could remove the minting step and just use the link Synapse Id instead.
  • Note that, when converting previously curated Dataset View entries to link entities, the entityId and DatasetView_id will end up being different, which may cause weirdness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants