Lineage / provenance representation #738

benjelloun · 2024-09-11T16:01:27Z

Define a mechanism to describe lineage of data / provenance information.

This mechanism should support multiple levels of granularity:

Dataset level
RecordSet level
Field level
Row / data value level

We would ideally reuse existing vocabularies such as PROV-O.

goeffthomas · 2024-10-02T17:42:21Z

We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating.

wumpus · 2024-11-03T00:39:21Z

Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I would make croissants for all 105 Common Crawl crawls (see #762), and people publishing cleaned ML training sets would write croissants that pointed at CC crawl croissants.

Apparently DDI-CDI has this kind of thing built into it. That was the presentation last week at the Croissant meeting, talk was by Arofan Gregory and slides are https://docs.google.com/presentation/d/1-9sg1X8siHZCa4Zh_7vIOzreEz0-3d-d/edit?usp=sharing&ouid=104300626440986451054&rtpof=true&sd=true

benjelloun added this to New Croissant features Sep 11, 2024

benjelloun converted this from a draft issue Sep 11, 2024

wumpus mentioned this issue Nov 3, 2024

Croissant vocabulary for crawled datasets #762

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lineage / provenance representation #738

Lineage / provenance representation #738

benjelloun commented Sep 11, 2024

goeffthomas commented Oct 2, 2024

wumpus commented Nov 3, 2024

Lineage / provenance representation #738

Lineage / provenance representation #738

Comments

benjelloun commented Sep 11, 2024

goeffthomas commented Oct 2, 2024

wumpus commented Nov 3, 2024