-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lineage / provenance representation #738
Comments
We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating. |
Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I would make croissants for all 105 Common Crawl crawls (see #762), and people publishing cleaned ML training sets would write croissants that pointed at CC crawl croissants. Apparently DDI-CDI has this kind of thing built into it. That was the presentation last week at the Croissant meeting, talk was by Arofan Gregory and slides are https://docs.google.com/presentation/d/1-9sg1X8siHZCa4Zh_7vIOzreEz0-3d-d/edit?usp=sharing&ouid=104300626440986451054&rtpof=true&sd=true |
Define a mechanism to describe lineage of data / provenance information.
This mechanism should support multiple levels of granularity:
We would ideally reuse existing vocabularies such as PROV-O.
The text was updated successfully, but these errors were encountered: