Pipeline that Starts with Previously Segmented Media #63

faberf · 2024-04-25T07:59:16Z

Consider the usecase where a large catalogue has already been segmented and some features have been extracted. Now, an additional feature needs to be extracted and connected to the existing segments. Currently, there is no practical way to do this (AFAIK) and the entire pipeline needs to be rerun.

I propose implementing an operator that retrieves segments that have been persisted, along with their source attributes. This operator would be the initial operator in the extraction pipeline for new features. I am not sure if retrieval at indexing time is meant to work with the existing querying system or if some problems will arise here. Also, in current pipeline configs the enumerators must come first, so work is needed here as well.

sauterl · 2024-04-25T08:57:15Z

This is an interesting question, which I think we should address.
I see multiple use cases that could be tackled in one go:

As described originally, the addition of a new feature for all retrievables
Updating an existing feature on all or some retrievables
With more verbose extraction logging, a mechanism for recovering a partially successful extraction, e.g. resuming of an extraction on all fields, for some of the sources.

faberf · 2024-04-26T07:52:25Z

Another idea:
Create a source which emits segments as retrievables that have been persisted in a previous run together with special content elements that describe which content elements are missing. Then, implement a special decoder which takes enumerated files and these retrieved retrievables (together with the gaps) and attemps to fill all the gaps.

faberf · 2024-06-24T09:24:49Z

I have an idea for solving this issue which also addresses the problem of restarting failed ingestions.

Include an option to configure the version of a pipeline config in the schema
in the backend, there is a one to many mapping from source metadata to versioned pipelines
the semantic is: Source S has been fully processed by pipeline P1 version V1 and pipeline P2 version V2 and so on
augment the enumerator to skip files that match a given metadata
augment the sink to properly tag the source as completed (relative to the given pipeline and version)
all the tagging, and checking logic should be reusable, to make it easy for new enumerators and sinks to be developed

@ppanopticon @lucaro What changes would you make to this concept?

EDIT: I just realized this actually does not address the issue, as everything would be resegmented upon version update.

lucaro · 2024-06-24T10:40:45Z

I guess there are fundamentally only two (types of) mechanisms needed: an enumerator that checks for every source if it is already known and emits the relevant retrievable with the already existing id without persisting it anew and a (or possibly multiple) segmenters that look up the existing segment boundaries for an existing retrievable and emit the same retrievables with the same ids and content again. Any versioning you might want to do of pipelines is, in my view, completely independent from these mechanisms.

ppanopticon assigned faberf Jun 17, 2024

ppanopticon added the enhancement New feature or request label Jun 17, 2024

ppanopticon added this to the Release Candidate #2 milestone Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline that Starts with Previously Segmented Media #63

Pipeline that Starts with Previously Segmented Media #63

faberf commented Apr 25, 2024

sauterl commented Apr 25, 2024

faberf commented Apr 26, 2024 •

edited

Loading

faberf commented Jun 24, 2024 •

edited

Loading

lucaro commented Jun 24, 2024

Pipeline that Starts with Previously Segmented Media #63

Pipeline that Starts with Previously Segmented Media #63

Comments

faberf commented Apr 25, 2024

sauterl commented Apr 25, 2024

faberf commented Apr 26, 2024 • edited Loading

faberf commented Jun 24, 2024 • edited Loading

lucaro commented Jun 24, 2024

faberf commented Apr 26, 2024 •

edited

Loading

faberf commented Jun 24, 2024 •

edited

Loading