Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage / provenance representation #738

Open
benjelloun opened this issue Sep 11, 2024 · 2 comments
Open

Lineage / provenance representation #738

benjelloun opened this issue Sep 11, 2024 · 2 comments

Comments

@benjelloun
Copy link
Contributor

Define a mechanism to describe lineage of data / provenance information.

This mechanism should support multiple levels of granularity:

  • Dataset level
  • RecordSet level
  • Field level
  • Row / data value level

We would ideally reuse existing vocabularies such as PROV-O.

@benjelloun benjelloun converted this from a draft issue Sep 11, 2024
@goeffthomas
Copy link
Contributor

We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating.

@wumpus
Copy link

wumpus commented Nov 3, 2024

Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I would make croissants for all 105 Common Crawl crawls (see #762), and people publishing cleaned ML training sets would write croissants that pointed at CC crawl croissants.

Apparently DDI-CDI has this kind of thing built into it. That was the presentation last week at the Croissant meeting, talk was by Arofan Gregory and slides are https://docs.google.com/presentation/d/1-9sg1X8siHZCa4Zh_7vIOzreEz0-3d-d/edit?usp=sharing&ouid=104300626440986451054&rtpof=true&sd=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants