Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push Solr Indexed Flavors into a Frictionless data package #14

Open
DiegoPino opened this issue Dec 7, 2020 · 3 comments
Open

Push Solr Indexed Flavors into a Frictionless data package #14

DiegoPino opened this issue Dec 7, 2020 · 3 comments
Assignees
Labels
Datapackage / Frictionless Packaging and wrapping in Xmas paper enhancement New feature or request Solr Indexing Putting things where they can be found

Comments

@DiegoPino
Copy link
Member

Why?

Or Strawberryfield data source is totally virtual. During a processing chain we use local storage key values to allow Search API to fetch the recently ingested data. But for a longer/complete reindex we want to have that data in a more stable place, specially for longer running/expensive operations like HORC.

The logic we want is that after a Processor's output has been tracked we push the data into a (new or existing) frictionless data package managed by us file. Idea is if the file exists and the content of a certain Flavor ID is inside we update, if not we create and add.

The the Flavor Data source can always try to fetch from the less expensive Key/Value store if found, or if not, see if the Node itself has one of the packages corresponding to the same FLV id.

Flavors indexed into Solr have the id pattern (Flavor ID)
"ss_search_api_id":"strawberryfield_flavor_datasource/2017:1:en:1d9ae1cd-b3d0-477c-8061-313bb1bc9273:ocr",
Which means:
strawberryfield_flavor_datasource => the data source
2017 => the Node ID
1 = The sequence (remember this is one Node to many files to many sequences)
1d9ae1cd-b3d0-477c-8061-313bb1bc9273 => The File UUID that was processed
ocr => the Plugin type that generated this

Depending on how well I can deal with this issue esmero/strawberryfield#115 we may want to have many Frictionless Data Packages or a single one

The operation would be (pseudo buggy code)

  • Post processor (flavor) is tracked into Index // Already do this
  • Post processor checks if Node (source) has already a datapackage for that FLV (e.g ocr)
  • If yes, checks the manifest.json of the ZIP, if the same Flavor ID is already there replace it
  • If not, creates the Datapackage and initializes it, adds the first Postprocessor output and attaches it to the Node.
  • This happens for every sequence/etc/.

On reindexing/indexing/update from Search API:

  • We get a Flavor ID. // already do this
  • We check if the pattern makes sense and validate the data // already do this
  • We check if the ID is in the key/value store // already do this
  • If yes -> great adds the data again // // already do this
  • If no -> checks if the Node has the datapackage and it contains the Flavor ID, if so, fetches the data, rebuilds the needed data structure for Search API (because its more than just the HOCR) and passes that back to search API
  • If none, means it does not exist anymore, processing was deleted of the original files are gone and Solr document is removed.

@giancarlobi ideas/thoughts?

@DiegoPino DiegoPino self-assigned this Dec 7, 2020
@DiegoPino DiegoPino added Datapackage / Frictionless Packaging and wrapping in Xmas paper enhancement New feature or request Solr Indexing Putting things where they can be found labels Dec 7, 2020
@giancarlobi
Copy link
Contributor

@DiegoPino some "philosophical" thoughts about this, premised I agree with all you wrote and probably I have to analyze deeper single steps.
Archipelago has to satisfy also long term preservation requirement, I think, and what is the piece of our architecture that can be the most reliable to preserve data? Obviously the storage (filesystem/S3/...) that can be mirrored/backuped, all those tricks that make storage "for ever without loss data".
What are the pieces less reliable in our architecture? For many reasons, I think that MySQL DB (Drupal) and Solr DATA, in addition to servers but that is another topic, are the pieces that can lost data and they are more complex to backup (where? to storage so ...).
Conclusion: we have to be ready for a restore into MySQL and Solr by data in storage.
This involves data package and all you wrote above, we need to store SBF-JSONs, flavours data and Solr docs (or something we allow to reindex Solr) into a data package that will be allow to rebuild the whole Archipelago if something goes wrong.
I know, this is the idea, not code and I know that we need code to make this working so take this as a raining day thought.

@DiegoPino
Copy link
Member Author

@giancarlobi I totally agree. Interestingly enough we have almost every (data) we just need to code it and since we are building AMI this is maybe a good moment to start doing that. PS: I will copy your thoughts and also this post later this week into its own ISSUES to complement what is missing:

Original Data:

  • WE have: We have the DOStore folder in the persistent storage and every file keyed by its checksum. All these are simple full dumps of the Node and the pure JSON data.

  • We are missing:

    • Right now we are only depositing on Entity Insert events. Having the same happening on Update (Save Event) is easy but it needs to be done. We need to keep track of every change on an object
    • We can traverse every folder and reinsert but there is something tricky: our top level json keys referring to files save File Entity IDs (numeric). In case of a full restore those are not very useful (we can not ask Drupal to respected them) so we need to use the as:document, etc structures that we have there to create the files entities back from storage using the given UUIDs and then replace on the JSON the existing IDs with the newly created ones. Its not complex but it adds some overhead and logic needs to be perfect. We may also want to do fixity checks while doing this. Eventually we can also use Frictionless Data packages to generate a ZIP with the Metadata JSON and all the corresponding files. So Objects can be shared between repositories or put into Long term cheap storage.
  • We really want to make ID to UUID to ID and back a service that can be used for anything. Entity to Entity relationships are the most complex topic in Drupal.

  • We may want to have time based restores. And also Item level restores. all via the UI.

Solr: if we have Data-packages reindexing is a breeze. It is also important that we have SBR processors kill switches, we do not want to re-process data when doing a full restore.

@alliomeria @dmer any thoughts on this? All this looks like code we can get rolling quite fast but then again we have our hands so full. Should we add this to the roadmap in a more concrete fashion?

@dmer
Copy link

dmer commented Dec 8, 2020

@DiegoPino I really like the ideas here (as far as I understand them :) It sounds like this would make a much more secure and robust backup (and more importantly) restore. Having time-based and/or single item level restore available via the UI would be a huge improvement on the current restore capabilities that I'm used to w/ Islandora.

As to your question about when. My main input is that I'm wanting to start working with the AMI tools asap so anything that sounds like it might delay that I'm suspicious of! - perhaps I'll be able to better answer after our briefing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datapackage / Frictionless Packaging and wrapping in Xmas paper enhancement New feature or request Solr Indexing Putting things where they can be found
Projects
None yet
Development

No branches or pull requests

3 participants