Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data model evolution planning #115

Closed
aclayton555 opened this issue Jul 8, 2024 · 6 comments
Closed

Data model evolution planning #115

aclayton555 opened this issue Jul 8, 2024 · 6 comments
Assignees

Comments

@aclayton555
Copy link

aclayton555 commented Jul 8, 2024

To be performed in July 2024. Outcomes of this ticket should be a relatively comprehensive plan and related tickets to coordinate work for to August 2024 and onward.

Cover:

SCOPE: what needs to be done (e.g. data model changes to enhance linkages)
OUTPUTS: expects outputs (e.g. RFCs and deployment schedule)
DEPENDENCIES AND IMPACT: what, if any, downstream dependencies or required changes (e.g. data model changes that impact syncing scripts...especially since Verena will be OOO).
RESOURCES: how and who will perform this work.

Relates to #97 and #56

@Bankso
Copy link
Contributor

Bankso commented Aug 30, 2024

High level summary of work completed for this:

  • Model and Individual v0 schemas created
  • Biospecimen data model updated/expanded
  • GeoMx model expanded/updated and split into level-specific folders, including the GeoMx config schema
  • Study and FileView schemas created and implemented
  • DUO code attribute added and integrated (not yet in use, currently planning for implementation with GovInn)
  • Dataset Sharing Plan schema created and implemented
  • 10X Visium model adapted and implemented
  • Implementation of the <component>_id and <component> Key reference system
  • Separated attributes and valid values from shared into model-specific folders
  • Initial provisioning of folders and CSVs for sequencing model

See additional info added in #116

@aclayton555
Copy link
Author

24-7/8 close-out: have made a lot of changes in this refactor, and really want another set of eyes on this to make sure this makes sense. Okay to wait until October for a deep dive.

Priority however is to at least close out the CDS attribute mapping. Want to have this complete by site visit.

Set up meeting with Aditi, Orion, Aditya asap to push through this. Maybe bring Jess in.

@aclayton555
Copy link
Author

24-9: Working session scheduled fro Sept 11 at 9am PT.

Outcome of that meeting will be CDS mapping (priority for site visit). Toward end of that discussion, think about timelines/phased approach for releases (may not want to do this all at once since there will be some major changes).

@aclayton555
Copy link
Author

aclayton555 commented Sep 11, 2024

Notes from Sept 11 working session:

  • [Component]_id attribute is used for both Upsert, as well as a primary/foreign key unique identifier
  • Have a shared attribute table that defines set of shared attributes across components - would be nice to have this in HTAN, in addition to modularized approach
  • "Study" schema is intended to be flexible, but brings in CDS template attributes. These are flagged under "Source"
  • Model bifurcates into "resource" type schemas (e.g. dataset, grant) and the newer experimental information type schemas.
  • "Models" refers to experimental model systems (e.g. Zebrafish, cell lines), whereas "individuals" refers to human participants. For "Models," this is a proposal for a schema driven by use cases that Orion has encountered (@Bankso to consolidate documenting these). For "Individuals," this pull in a lot of the CDS attributes
  • Suggested addition to "Models" (re: "Model Method") in [Models] Option for "Model Method" to capture protocol or publication #143
  • Intended redundancy in attribute naming (i.e. component prefixes on attributes in each schema) because we don't know to what level contributors will want to annotated resources. However, there are opportunities to reduce redundancies in things like the Shared attributes, but note that mappings are established to maintain harmonization of these terms (e.g. Assay Type) across different schemas.
  • Biospecimen (parent vs. child) - @Bankso wants to think about and document how to do this, as we have already encountered issues with this in light sheet microsopy. Jess notes that AMP-AIM is also thinking about this. HTAN has outlined this in their ID provenance structure: https://docs.humantumoratlas.org/data_model/identifiers/
  • New "FileView" schema incorporates DUO codes (this is a shared attribute also with Dataset and Study). Ongoing conversations with GovInn about inferred annotations from DUO codes (i.e. a certain DUO code will specify a certain access restrictions). Also thinking about how to leverage the FileView level to capture longitudinal data and time course information, as represented in different contributed files.

Next steps:

  • Overall, CDS attribute mapping on track.
  • Welcome team input into the overall design and complexity. Want to have this as a solid foundation to build on. RFC process and direct contributor engagement will be critical to help inform required vs non-required.
  • In our documentation, want to make it clear (and provide examples!) of how contributors should expect to engage with templates (i.e. some vs all). the ongoing data sharing pilots can provide examples for this - see [Q4 2024] [Contributor-Facing Documentation] Schema Updates + Clarity on how to engage with schema + examples #142

@aclayton555
Copy link
Author

24-9 Close-out: This work will continue on into the next sprint. On track, but check in mid sprint.

@aclayton555
Copy link
Author

24-10: Okay to close. Next phase to work to continue in testing and in docs: #142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants