Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline that Starts with Previously Segmented Media #63

Open
faberf opened this issue Apr 25, 2024 · 4 comments
Open

Pipeline that Starts with Previously Segmented Media #63

faberf opened this issue Apr 25, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@faberf
Copy link
Contributor

faberf commented Apr 25, 2024

Consider the usecase where a large catalogue has already been segmented and some features have been extracted. Now, an additional feature needs to be extracted and connected to the existing segments. Currently, there is no practical way to do this (AFAIK) and the entire pipeline needs to be rerun.

I propose implementing an operator that retrieves segments that have been persisted, along with their source attributes. This operator would be the initial operator in the extraction pipeline for new features. I am not sure if retrieval at indexing time is meant to work with the existing querying system or if some problems will arise here. Also, in current pipeline configs the enumerators must come first, so work is needed here as well.

@sauterl
Copy link
Contributor

sauterl commented Apr 25, 2024

This is an interesting question, which I think we should address.
I see multiple use cases that could be tackled in one go:

  • As described originally, the addition of a new feature for all retrievables
  • Updating an existing feature on all or some retrievables
  • With more verbose extraction logging, a mechanism for recovering a partially successful extraction, e.g. resuming of an extraction on all fields, for some of the sources.

@faberf
Copy link
Contributor Author

faberf commented Apr 26, 2024

Another idea:
Create a source which emits segments as retrievables that have been persisted in a previous run together with special content elements that describe which content elements are missing. Then, implement a special decoder which takes enumerated files and these retrieved retrievables (together with the gaps) and attemps to fill all the gaps.

@ppanopticon ppanopticon added the enhancement New feature or request label Jun 17, 2024
@faberf
Copy link
Contributor Author

faberf commented Jun 24, 2024

I have an idea for solving this issue which also addresses the problem of restarting failed ingestions.

  • Include an option to configure the version of a pipeline config in the schema
  • in the backend, there is a one to many mapping from source metadata to versioned pipelines
  • the semantic is: Source S has been fully processed by pipeline P1 version V1 and pipeline P2 version V2 and so on
  • augment the enumerator to skip files that match a given metadata
  • augment the sink to properly tag the source as completed (relative to the given pipeline and version)
  • all the tagging, and checking logic should be reusable, to make it easy for new enumerators and sinks to be developed

@ppanopticon @lucaro What changes would you make to this concept?

EDIT: I just realized this actually does not address the issue, as everything would be resegmented upon version update.

@lucaro
Copy link
Member

lucaro commented Jun 24, 2024

I guess there are fundamentally only two (types of) mechanisms needed: an enumerator that checks for every source if it is already known and emits the relevant retrievable with the already existing id without persisting it anew and a (or possibly multiple) segmenters that look up the existing segment boundaries for an existing retrievable and emit the same retrievables with the same ids and content again. Any versioning you might want to do of pipelines is, in my view, completely independent from these mechanisms.

@ppanopticon ppanopticon added this to the Release Candidate #2 milestone Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants