ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

DiegoPino · 2021-07-25T22:39:16Z

See #14 and #4

LOD reconciliation works but not perfect yet. Still some tiny issues missing and some re-reconciling/correcting UI needed. Should be ready tonight? Left on this pull still a few DPM()s for my own enjoyment.Main issue here is when the Terms to reconciliation are hundreds and our current form spends too much time building it. Will need a workaround that only enables the autocompletes on click (opening the eternal AJAX battle of dynamic field rendering of course)
On LOD CSV is being generated and each LoD endpoint is being called.
Fix/manual attaches also reconciled data to temporary structure that can be read into the twig template natively. Need more tests
CSV export works now perfectly (really)
Webform search and replace (what a mess but it works!)
Many improvement on AMI, Separate batches for files, Solr works better, CSV and ZIP can live in S3.
Forced revisions on Search and Replace and AMI Updates with revision Notes

Not perfect yet. Still some tiny issues missing and some re-reconciling/correcting UI needed. Should be ready tonight? Left on this pull still a few DPM()s for my own enjoyment.

src/Form/AmiMultiStepIngest.php

@alliomeria

@alliomeria gosh. I made it i think? OF course the UI can get some better things and we need to add a LOT of options to avoid "overwriting" all your lovely Manually LoD. But gosh. This is so good!!

And our CSV headers are not (lowercase/normalized) And gosh. Well. But this seems to work. Last commit for tonight.

alliomeria · 2021-07-28T13:11:21Z

@DiegoPino woohoo! This is indeed so good! 🌟😄 Is this ok for me to pull & use/test later today?

@alliomeria

@alliomeria i will need you to test this when you can. This is finally working as expected. For DCMNY this gives us initially 714 headers! but then the cleanup reduces things significantly to lower numbers (e.g for gsmt:icc to 87 or so) We get more data of course now!! (because of children really providing their correct data when not collapsing). Uff. Was super hard to find my own error but the gist for my own future me: Batch driven Plugins like the SolrImport one provide the headers via their ::getInfo method but because we do not really know until the batch ends how many of the headers will really survive the cleanup we do not set them into the actual Config (config that is stored in each AMI set) which really means we have to manually pass them to the batch! Good, so good

DiegoPino · 2021-08-03T22:30:59Z

@alliomeria this includes now the correct Header normalization, inclusive for non collapsed Children. You may see a LOT of new headers appear when getting children (sorry) but its OK, better so. Still from the 714 initial fields, most of the time the real number ends being reduced to a manageable 87-88 one.

Next step will be to allow all RELS-EXT an RELS-INT to not be harvest to make the amount smaller.

Please give it a try tomorrow, this is really really better now!

…ering On Books with pages

alliomeria · 2021-08-04T21:10:23Z

@DiegoPino --AMI's LoD Reconciliation is awesome. All’s been going very well and running quickly with various testing and checking (for batches big & small). This is such a great tool for Archipelago users, one that is going to enable a lot of very good work to take place in repository lands. 🤓

Will share some additional notes/comments about a couple small UI things a little later on (preview: think your idea of separating out the various elements into different pages/sections for the Edit LoD is a good one, especially for larger sets), but wanted to drop off this kudos here first. This is really great & I am truly looking forward to using this in our work ahead!

@alliomeria

@alliomeria this should fix the double encoding issue, but just to be sure, could you try Solr import 7 with a collection that has complex weird characters in its descriptions/titles (maybe NYHS?) and also with a google sheet using, e.g Japanese + "" and & and ' somewher? WE can test tomorrow together. Thing is i removed a lot of code that i "for some reason" had added, so wonder if i will not break now something somewhere else, and thus "testing" is required!

I was resetting the array to give this a default value. happens that i needed the first key, not the first value and now that our "labels" are Capitalized and not == to the value (e.g Direct v/s direct) i was breaking the Form State. What a mess!

- Can not handle CSV rows that are either indexed arrays (default) or associative ones. - ::createAmiSet() can be passed a Name (title/label) via the \stdClass $data using $data->name. used for now on the CSV export but as easy-peasy as it gets to be added to the AMI normal Multi Step one (next pull?)

And mostly the logic generated by \Drupal\views_bulk_operations\Form\ConfigureAction::buildForm where form_state does not survive the ajax triggered dynamic fields that have internally a submit (e.g OpenStreetmaps). So will mark as experimental until i find a strange/dark hack to get this rolling. webform elements, gosh

Quite smart may i say! So much and so little code really. Test it, let me know how it goes

alliomeria · 2021-09-13T18:26:10Z

Hello @DiegoPino & @aksm, adding some follow up related to the batch offsets per our recent discussions.

When testing using the I7 Solr Importer for iterative batches for a single large (30k+ objects) collection (👀 photosnycbeyond:collection) using both the ISSUE-14 and 0.2.0 branches, I came across these 2 things:

Total Number of Rows: it seems that the "Number of Rows" specification will round up to the next hundred value (ie, 150 rows specification -> 200 rows fetched; 250 specification -> 300 rows fetched). @DiegoPino is this expected behavior? If so, I can add a note to the documentation about this; Perhaps inline documentation would be helpful for this too?
Offset sorting between batches: the automatic sorting by PID for successive batches does not seem to be going as expected. Based upon initial review of the three successive 10k batches (and other successive batches for smaller collections), it does not appear that the overall outcome is effected (no duplicate PIDs/objects found within related batches). Having the expected sorting (ascending) by PID for successive related batches would make review and confirmation of proper offsetting and grand object total per collection readily apparent.

DiegoPino · 2021-09-13T19:04:46Z

@alliomeria thanks.

Re 1. That is a glitch (a.k.a Bug): Because the batch sizes are of 100, I'm failing on limiting the last batch to the already harvested + missing part instead of going full in 100s. Can fix that. FYI. I can only do that for Top objects since we will never know upfront how many child objects we may have in case you go "uncollapsed" with your workflow
Re 2. This is a bit more complex and documentation (inline) may help. Solr (not us) does an offset and then sorts by PID. So basically it orders after doing the offset which means (somehow?) that the order of a single 30K will be different than the order of 3 x 10K, even if internally the Order is by PID. I need to test some other sort configurations and figure out what/how we can be consistent. Still good news that at the end you still get all and there is no overlap (so the offset does work) but yes, it would be great that summing all spreadsheets would have the same order as a big large one.

Thanks, will work on that!

@alliomeria

- This adds new methods for offsetting and paging CSV data (was a pain because JSON pretty printed has breaklines). Since i was here i also added options for getting headers (or not) for offset and paged CSV access - Adds an actual pager to the Edit Reconciling page - Nothing of this is Done-done. I need to now change the "saving" algorithm to allow partial / offset saves to happen. Probably using the already created KeyValue for the set instead of reading the actual Form data as we did with large full sets and also need the "this was fixed" checkbox to be added to a new Column More tomorrow @alliomeria !

Allison says..

@patdunlavey

After dealing with Direct Ingest i broke Twig templates.... @patdunlavey ... you will see this fail... should back again to its normal behavior.

@todo

It was getting weird to have such cross dependencies. Now both deal with their own concerns and there is complementation. This changes a lot Also: - Json Decoding Cells now deals with Smart Quote exceptions and tries to be as smart as possible in case of errors. But if the key is one of our core ones ap: or as: then in case of failure we reset to NULL value. This allows us to avoid breaking code in case one our controlled values via a Twig template OR a Direct ingest ends with wrong data. @todo. Implement a JSON-SCHEMA level validation for our own internal control vocabulary (WIP!)

@alliomeria

… numbers @alliomeria when you come back. Solr exact fetch is fixed, also fixes the real time report to the number of rows requested. Tested on collapsed/uncollapsed and with offsets to check if a set of consequent harvests with offsets was correct and not missing anything. All checks out!

@alliomeria

…the options This is basically the same as setting type = "node" but allows us in the future to also allow certain Actions to run on e.g AMI Sets @alliomeria how to test? - git pull this branch - Clear Caches - Edit your Search and replace View. Remove the "Global: Views bulk operations" field (instead of installing/reinstalling the module, this seems to be the best option a.k.a as TRICK to make actions show again) - Add it again. The list should be refreshed. Check the ones you want to use - Save the view - test!

alliomeria · 2021-09-30T21:00:21Z

Hi @DiegoPino! I re-pulled earlier in the week and again a few hours ago. All has been going very well with various testing & so far have checked these things without encountering any issues:

I7 Solr Importer: collapsed & non-collapsed children objects, for image & document-based object types; via Template
Fetching batch sets offset by specified increments; all good, respecting specified offset exactly and sorting by PID seems to be consistent
LoD Reconciliation:
- fetching single/top result, able to (manually) add multiple terms if desired (**note to all metadata pros/catalogers, seems there will be a fair amount of reworking needed for some of the more complex hierarchical subdivisions/headings and LoD vocabularies in our collective futures 🕯️ 😓)
- saving progress via page at-a-time; using the checked function
Again, as noted here, I know it's a simple thing but it's nice to have the option to name the AMI set during the initial steps.

Exporting:

After the initial wonkiness for Views Bulk Actions in the Search & Replace view in my local related to some of my initial missteps adding the Advanced Action separately first, I tested a large batch export (all that exists in my local, ~1200 objs) using only the "Attach CSV to a new AMI Set" option (interesting to see the overview of all the different data, can see this being helpful to have a snapshot to reference)
I also tested a few small batches using the different options combinations & all went as expected (@DiegoPino & @aksm, for our internal reference I dropped off some of the testing outputs here - AMI Test Spreadsheets doc, cells in green are good in the local csv file copy, but too large for google sheets display)

Hope some of these notes are useful. I will keep testing & using AMI and report any issues if encountered. AMI is for sure a beast 🐯🐉 of a module as you say @DiegoPino!

alliomeria · 2021-09-30T21:04:42Z

ps: @DiegoPino --your latest commit about the EntityChangedActionDeriver did address the issue I was having with that errant leftover Action. The Action was gone after clearing caches and I did not need to remove the "Global" Views bulk operations" from the view (I could though, if that's the best/a good trick to use in this case). Thank you very much for helping me sort out this issue in my local!

DiegoPino · 2021-09-30T21:09:05Z

@alliomeria thanks!!! We will come back to do some UX refinements to the CSV export when I'm back. I really appreciate all the testing!

Happens that i uncommented a line i had commented because i used to be smart. Not anymore. If you add a #name property to an AJAX driven form with select, the original value gets cached for ever and never changes even if you submit it differently. How do i know this? Just because i tested it. There is NO documentation. Wonder if that is also the issue with Webform based find and replace??

@alliomeria

Internally our CSV to data structure is already normalized for lower case (we need to document this better). So for Solr import, also inmediatelly set the column names to lower case and in general enforce checking against lowercases when doing validation. Does not require any change really on Twig templates because for Twig "data.HoLa" is the same as "data.hola". Just saying @alliomeria

all ZIP files will go eventually (once the AMI set entity is saved or when uploaded directly via "Edit" of a Set, into private://ami/zip

Since we can not predict the final extension of a download upfront without download i check now using gob() for the first part of the possible filename. If there is already a single (only a single one) File with the same starting future name (which is consistent since we use the URL of the file to generate it) then we reuse the already downloaded one. Will add also a button that deletes all files for an AMI set in the future and one that forces the download even if there. Also this piece of code ensures ZIP files (After first upload) are moved into private://ami/zip once the AMI set is saved

@aksm

And first pass works, like a charm! Ok, need to add 1.- Config entry to set "what is many files" and what not. 2.- Better reuse of as:technical metadata, so we do not reprocess that (file re use after download is already in place). Probably the best thing ever here. @aksm @alliomeria

@alliomeria

@alliomeria can you test with the same CSV with 387 objects? And please twice? First time may be slow-ish, second time quite fast

also reprocessing of cached TECHMD, etc

@dmer

@dmer @patdunlavey can you check this code please (as you know release process so as soon as you can). I wonder if you removed moderation workflows which enables by default revisions or if you have not set revisions at the bundle level?) But this will force that.

@alliomeria

Works. Looks good. I'm good. A good person right @alliomeria and @aksm ? Revisit CSV offsets once we are on PHP 8.1 I found a PHP BUG!!

…hing

Need to check now CSV write!

I have been willing to write the first PHP remote streamer library from forever and i never have the time. So for now, for the people that decide to put ALLL on S3 we have to download the file.. no other option. So we inject the StrawberryfieldFileMetadataService to reuse the ::ensureFileAvailability function. But we can not delete the file here since we may need it for the rest of our times.

But.. VBO actions is not respecting the Facet selection when using the "Select all" option... We should make sure that is clear? Documentation? Eventually patch VBO or, alter the form and remove Select All completely?

First pass on LoD Reconciling

a157a20

Not perfect yet. Still some tiny issues missing and some re-reconciling/correcting UI needed. Should be ready tonight? Left on this pull still a few DPM()s for my own enjoyment.

DiegoPino self-assigned this Jul 25, 2021

DiegoPino added enhancement New feature or request Ingest Setup Knobs and Levers you move while thinking about feelings and metadata and CSV files queue FIFO queue workers Ones taking the FI and doing the FO Reporting Errors, Logs, etc. UI Buttons and pixels labels Jul 25, 2021

DiegoPino added this to the 0.2.0 milestone Jul 25, 2021

DiegoPino commented Jul 25, 2021

View reviewed changes

src/Form/AmiMultiStepIngest.php Show resolved Hide resolved

DiegoPino added 3 commits July 26, 2021 10:49

Push headers/erase old LoD reconciled when re-running

cd1b043

Wow. We got this even with LoD Fix/Edit form!!

9909ff7

@alliomeria gosh. I made it i think? OF course the UI can get some better things and we need to add a LOT of options to avoid "overwriting" all your lovely Manually LoD. But gosh. This is so good!!

Because MADS RDF is case Sensitive.. add CONST mapping

95922b1

And our CSV headers are not (lowercase/normalized) And gosh. Well. But this seems to work. Last commit for tonight.

DiegoPino added 2 commits July 30, 2021 09:54

Make sure sort by PID ::getData is given so offset is consistent

07bd252

use RELS_EXT_isPageNumber_literal_intDerivedFromString_l for page ord…

36f53a8

…ering On Books with pages

DiegoPino added 7 commits August 19, 2021 18:05

weird left over ;

fb8cd17

Fix some silly select defaults

8f1cba7

I was resetting the array to give this a default value. happens that i needed the first key, not the first value and now that our "labels" are Capitalized and not == to the value (e.g Direct v/s direct) i was breaking the Form State. What a mess!

New Permissions to allow export to CSV

a1472be

Full Export Action to CSV

84ed6ab

Quite smart may i say! So much and so little code really. Test it, let me know how it goes

DiegoPino changed the title ~~ISSUE-14: First pass on LoD Reconciling~~ ISSUE-14: First pass on LoD Reconciling + CSV Export Sep 2, 2021

DiegoPino added 8 commits September 23, 2021 15:18

Not be super-smart. Just smart-ish with JSON attempts of decoding

a25920d

Allison says..

Persist LoD Per Page and also Globally into a CSV

3758608

Small cleanup and better checks on passed for IngestADOQueueWorker

583e6bc

Gosh. Restored Twig Template processing

c59d170

After dealing with Direct Ingest i broke Twig templates.... @patdunlavey ... you will see this fail... should back again to its normal behavior.

Allow (enforce) AMI Set Label to be provided during Setup

a161939

DiegoPino added 17 commits October 4, 2021 20:53

Small typo on Docs

19ea03b

Small fail safe update in case the queue fails for not-us reasons

fffc059

Make ZIP upload field Private instead of tmp

d71bb08

all ZIP files will go eventually (once the AMI set entity is saved or when uploaded directly via "Edit" of a Set, into private://ami/zip

This looks better

4581569

@alliomeria can you test with the same CSV with 387 objects? And please twice? First time may be slow-ish, second time quite fast

New options. Force file processing as separate item can be set per SET

f629a7c

also reprocessing of cached TECHMD, etc

Enforces revisions?

054c668

@dmer @patdunlavey can you check this code please (as you know release process so as soon as you can). I wonder if you removed moderation workflows which enables by default revisions or if you have not set revisions at the bundle level?) But this will force that.

AMI version of Entity Preview for Format Strawberryfield

e62e6ce

Works. Looks good. I'm good. A good person right @alliomeria and @aksm ? Revisit CSV offsets once we are on PHP 8.1 I found a PHP BUG!!

In case the key is not present or the set is "type" less default to t…

7faea54

…hing

Fix remote CSV load

7edc6f8

Need to check now CSV write!

Allows CSV to work from S3 too

88ba765

Another one. ZIP loading from S3 still needed

2166ce4

Webform based Search and Replace is working!

2b92b72

But.. VBO actions is not respecting the Facet selection when using the "Select all" option... We should make sure that is clear? Documentation? Eventually patch VBO or, alter the form and remove Select All completely?

DiegoPino changed the title ~~ISSUE-14: First pass on LoD Reconciling + CSV Export~~ ISSUE-14: LoD Reconciling + CSV Export + Release Improvements Nov 18, 2021

DiegoPino merged commit effe38e into 0.2.0 Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

DiegoPino commented Jul 25, 2021 •

edited

Loading

alliomeria commented Jul 28, 2021 •

edited

Loading

DiegoPino commented Aug 3, 2021 •

edited

Loading

alliomeria commented Aug 4, 2021 •

edited

Loading

alliomeria commented Sep 13, 2021

DiegoPino commented Sep 13, 2021

alliomeria commented Sep 30, 2021

alliomeria commented Sep 30, 2021 •

edited

Loading

DiegoPino commented Sep 30, 2021

ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

Conversation

DiegoPino commented Jul 25, 2021 • edited Loading

alliomeria commented Jul 28, 2021 • edited Loading

DiegoPino commented Aug 3, 2021 • edited Loading

alliomeria commented Aug 4, 2021 • edited Loading

alliomeria commented Sep 13, 2021

DiegoPino commented Sep 13, 2021

alliomeria commented Sep 30, 2021

alliomeria commented Sep 30, 2021 • edited Loading

DiegoPino commented Sep 30, 2021

DiegoPino commented Jul 25, 2021 •

edited

Loading

alliomeria commented Jul 28, 2021 •

edited

Loading

DiegoPino commented Aug 3, 2021 •

edited

Loading

alliomeria commented Aug 4, 2021 •

edited

Loading

alliomeria commented Sep 30, 2021 •

edited

Loading