Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-14: LoD Reconciling + CSV Export + Release Improvements #31

Merged
merged 42 commits into from
Nov 18, 2021

Conversation

DiegoPino
Copy link
Member

@DiegoPino DiegoPino commented Jul 25, 2021

See #14 and #4

  • LOD reconciliation works but not perfect yet. Still some tiny issues missing and some re-reconciling/correcting UI needed. Should be ready tonight? Left on this pull still a few DPM()s for my own enjoyment.Main issue here is when the Terms to reconciliation are hundreds and our current form spends too much time building it. Will need a workaround that only enables the autocompletes on click (opening the eternal AJAX battle of dynamic field rendering of course)

  • On LOD CSV is being generated and each LoD endpoint is being called.

  • Fix/manual attaches also reconciled data to temporary structure that can be read into the twig template natively. Need more tests

  • CSV export works now perfectly (really)

  • Webform search and replace (what a mess but it works!)

  • Many improvement on AMI, Separate batches for files, Solr works better, CSV and ZIP can live in S3.

  • Forced revisions on Search and Replace and AMI Updates with revision Notes

Not perfect yet. Still some tiny issues missing and some re-reconciling/correcting UI needed. Should be ready tonight? Left on this pull still a few DPM()s for my own enjoyment.
@DiegoPino DiegoPino self-assigned this Jul 25, 2021
@DiegoPino DiegoPino added enhancement New feature or request Ingest Setup Knobs and Levers you move while thinking about feelings and metadata and CSV files queue FIFO queue workers Ones taking the FI and doing the FO Reporting Errors, Logs, etc. UI Buttons and pixels labels Jul 25, 2021
@DiegoPino DiegoPino added this to the 0.2.0 milestone Jul 25, 2021
@alliomeria gosh. I made it i think?
OF course the UI can get some better things and we need to add a LOT of options to avoid "overwriting" all your lovely Manually LoD. But gosh. This is so good!!
And our CSV headers are not (lowercase/normalized) And gosh. Well. But this seems to work. Last commit for tonight.
@alliomeria
Copy link
Contributor

alliomeria commented Jul 28, 2021

@DiegoPino woohoo! This is indeed so good! 🌟😄 Is this ok for me to pull & use/test later today?

@alliomeria i will need you to test this when you can. This is finally working as expected. For DCMNY this gives us initially 714 headers! but then the cleanup reduces things significantly to lower numbers (e.g for gsmt:icc to 87 or so)
We get more data of course now!! (because of children really providing their correct data when not collapsing).

Uff. Was super hard to find my own error but the gist for my own future me:

Batch driven Plugins like the SolrImport one provide the headers via their ::getInfo method but because we do not really know until the batch ends how many of the headers will really survive the cleanup we do not set them into the actual Config (config that is stored in each AMI set) which really means we have to manually pass them to the batch!

Good, so good
@DiegoPino
Copy link
Member Author

DiegoPino commented Aug 3, 2021

@alliomeria this includes now the correct Header normalization, inclusive for non collapsed Children. You may see a LOT of new headers appear when getting children (sorry) but its OK, better so. Still from the 714 initial fields, most of the time the real number ends being reduced to a manageable 87-88 one.

Next step will be to allow all RELS-EXT an RELS-INT to not be harvest to make the amount smaller.

Please give it a try tomorrow, this is really really better now!

@alliomeria
Copy link
Contributor

alliomeria commented Aug 4, 2021

@DiegoPino --AMI's LoD Reconciliation is awesome. All’s been going very well and running quickly with various testing and checking (for batches big & small). This is such a great tool for Archipelago users, one that is going to enable a lot of very good work to take place in repository lands. 🤓

Will share some additional notes/comments about a couple small UI things a little later on (preview: think your idea of separating out the various elements into different pages/sections for the Edit LoD is a good one, especially for larger sets), but wanted to drop off this kudos here first. This is really great & I am truly looking forward to using this in our work ahead!

@alliomeria this should fix the double encoding issue, but
just to be sure, could you try Solr import 7 with a collection that has complex weird characters in its descriptions/titles (maybe NYHS?)
and also with a google sheet using, e.g Japanese + "" and & and ' somewher?

WE can test tomorrow together. Thing is i removed a lot of code that i "for some reason" had added, so wonder if i will not break now something somewhere else, and thus "testing" is required!
I was resetting the array to give this a default value. happens that i needed the first key, not the first value and now that our "labels" are Capitalized and not == to the value (e.g Direct v/s direct) i was breaking the Form State. What a mess!
- Can not handle CSV rows that are either indexed arrays (default) or associative ones.
- ::createAmiSet() can be passed a Name (title/label) via the \stdClass $data using $data->name. used for now on the CSV export but as easy-peasy as it gets to be added to the AMI normal Multi Step one (next pull?)
And mostly the logic generated by \Drupal\views_bulk_operations\Form\ConfigureAction::buildForm where form_state does not survive the ajax triggered dynamic fields that have internally a submit (e.g OpenStreetmaps). So will mark as experimental until i find a strange/dark hack to get this rolling. webform elements, gosh
Quite smart may i say! So much and so little code really. Test it, let me know how it goes
@DiegoPino DiegoPino changed the title ISSUE-14: First pass on LoD Reconciling ISSUE-14: First pass on LoD Reconciling + CSV Export Sep 2, 2021
@alliomeria
Copy link
Contributor

Hello @DiegoPino & @aksm, adding some follow up related to the batch offsets per our recent discussions.

When testing using the I7 Solr Importer for iterative batches for a single large (30k+ objects) collection (👀 photosnycbeyond:collection) using both the ISSUE-14 and 0.2.0 branches, I came across these 2 things:

  1. Total Number of Rows: it seems that the "Number of Rows" specification will round up to the next hundred value (ie, 150 rows specification -> 200 rows fetched; 250 specification -> 300 rows fetched). @DiegoPino is this expected behavior? If so, I can add a note to the documentation about this; Perhaps inline documentation would be helpful for this too?

  2. Offset sorting between batches: the automatic sorting by PID for successive batches does not seem to be going as expected. Based upon initial review of the three successive 10k batches (and other successive batches for smaller collections), it does not appear that the overall outcome is effected (no duplicate PIDs/objects found within related batches). Having the expected sorting (ascending) by PID for successive related batches would make review and confirmation of proper offsetting and grand object total per collection readily apparent.

@DiegoPino
Copy link
Member Author

@alliomeria thanks.

Re 1. That is a glitch (a.k.a Bug): Because the batch sizes are of 100, I'm failing on limiting the last batch to the already harvested + missing part instead of going full in 100s. Can fix that. FYI. I can only do that for Top objects since we will never know upfront how many child objects we may have in case you go "uncollapsed" with your workflow
Re 2. This is a bit more complex and documentation (inline) may help. Solr (not us) does an offset and then sorts by PID. So basically it orders after doing the offset which means (somehow?) that the order of a single 30K will be different than the order of 3 x 10K, even if internally the Order is by PID. I need to test some other sort configurations and figure out what/how we can be consistent. Still good news that at the end you still get all and there is no overlap (so the offset does work) but yes, it would be great that summing all spreadsheets would have the same order as a big large one.

Thanks, will work on that!

- This adds new methods for offsetting and paging CSV data (was a pain because JSON pretty printed has breaklines). Since i was here i also added options for getting headers (or not) for offset and paged CSV access
- Adds an actual pager to the Edit Reconciling page
- Nothing of this is Done-done. I need to now change the "saving" algorithm to allow partial / offset saves to happen. Probably using the already created KeyValue for the set instead of reading the actual Form data as we did with large full sets and also need the "this was fixed" checkbox to be added to a new Column

More tomorrow @alliomeria !
After dealing with Direct Ingest i broke Twig templates.... @patdunlavey ... you will see this fail... should back again to its normal behavior.
It was getting weird to have such cross dependencies. Now both deal with their own concerns and there is complementation. This changes a lot

Also:
- Json Decoding Cells now deals with Smart Quote exceptions and tries to be as smart as possible in case of errors. But if the key is one of our core ones ap: or as: then in case of failure we reset to NULL value. This allows us to avoid breaking code in case one our controlled values via a Twig template OR a Direct ingest ends with wrong data.

@todo. Implement a JSON-SCHEMA level validation for our own internal control vocabulary (WIP!)
… numbers

@alliomeria when you come back. Solr exact fetch is fixed, also fixes the real time report to the number of rows requested. Tested on collapsed/uncollapsed and with offsets to check if a set of consequent harvests with  offsets was correct and not missing anything. All checks out!
…the options

This is basically the same as setting type = "node" but allows us in the future to also allow certain Actions to run on e.g AMI Sets

@alliomeria how to test?

- git pull this branch
- Clear Caches
- Edit your Search and replace View. Remove the "Global: Views bulk operations" field (instead of installing/reinstalling the module, this seems to be the best option a.k.a as TRICK to make actions show again)
- Add it again. The list should be refreshed. Check the ones you want to use
- Save the view
- test!
@alliomeria
Copy link
Contributor

Hi @DiegoPino! I re-pulled earlier in the week and again a few hours ago. All has been going very well with various testing & so far have checked these things without encountering any issues:

  • I7 Solr Importer: collapsed & non-collapsed children objects, for image & document-based object types; via Template
  • Fetching batch sets offset by specified increments; all good, respecting specified offset exactly and sorting by PID seems to be consistent
  • LoD Reconciliation:
    • fetching single/top result, able to (manually) add multiple terms if desired (**note to all metadata pros/catalogers, seems there will be a fair amount of reworking needed for some of the more complex hierarchical subdivisions/headings and LoD vocabularies in our collective futures 🕯️ 😓)
    • saving progress via page at-a-time; using the checked function
  • Again, as noted here, I know it's a simple thing but it's nice to have the option to name the AMI set during the initial steps.

Exporting:

  • After the initial wonkiness for Views Bulk Actions in the Search & Replace view in my local related to some of my initial missteps adding the Advanced Action separately first, I tested a large batch export (all that exists in my local, ~1200 objs) using only the "Attach CSV to a new AMI Set" option (interesting to see the overview of all the different data, can see this being helpful to have a snapshot to reference)
  • I also tested a few small batches using the different options combinations & all went as expected (@DiegoPino & @aksm, for our internal reference I dropped off some of the testing outputs here - AMI Test Spreadsheets doc, cells in green are good in the local csv file copy, but too large for google sheets display)

Hope some of these notes are useful. I will keep testing & using AMI and report any issues if encountered. AMI is for sure a beast 🐯🐉 of a module as you say @DiegoPino!

@alliomeria
Copy link
Contributor

alliomeria commented Sep 30, 2021

ps: @DiegoPino --your latest commit about the EntityChangedActionDeriver did address the issue I was having with that errant leftover Action. The Action was gone after clearing caches and I did not need to remove the "Global" Views bulk operations" from the view (I could though, if that's the best/a good trick to use in this case). Thank you very much for helping me sort out this issue in my local!

@DiegoPino
Copy link
Member Author

@alliomeria thanks!!! We will come back to do some UX refinements to the CSV export when I'm back. I really appreciate all the testing!

Happens that i uncommented a line i had commented because i used to be smart. Not anymore. If you add a #name property to an AJAX driven form with select, the original value gets cached for ever and never changes even if you submit it differently. How do i know this? Just because i tested it. There is NO documentation. Wonder if that is also the issue with Webform based find and replace??
Internally our CSV to data structure is already normalized for lower case (we need to document this better). So for Solr import, also inmediatelly set the column names to lower case and in general enforce checking against lowercases when doing validation. Does not require any change really on Twig templates because for Twig "data.HoLa" is the same as "data.hola". Just saying @alliomeria
all ZIP files will go eventually  (once the AMI set entity is saved or when uploaded directly via "Edit" of a Set, into private://ami/zip
Since we can not predict the final extension of a download upfront without download i check now using gob() for the first part of the possible filename.
If there is already a single (only a single one) File with the same starting future name (which is consistent since we use the URL of the file to generate it) then we reuse the already downloaded one. Will add also a button that deletes all files for an AMI set in the future and one that forces the download even if there.

Also this piece of code ensures ZIP files (After first upload) are moved into private://ami/zip once the AMI set is saved
And first pass works, like a charm!
Ok, need to add
1.- Config entry to set "what is many files" and what not.
2.- Better reuse of as:technical metadata, so we do not reprocess that (file re use after download is already in place).

Probably the best thing ever here. @aksm @alliomeria
@alliomeria can you test with the same CSV with 387 objects?
And please twice? First time may be slow-ish, second time quite fast
@dmer @patdunlavey can you check this code please (as you know release process so as soon as you can). I wonder if you removed moderation workflows which enables by default revisions or if you have not set revisions at the bundle level?) But this will force that.
Works. Looks good. I'm good. A good person right @alliomeria and @aksm ?
Revisit CSV offsets once we are on PHP 8.1 I found a PHP BUG!!
Need to check now CSV write!
I have been willing to write the first PHP remote streamer library from forever and i never have the time. So for now, for the people that decide to put ALLL on S3 we have to download the file.. no other option.
So we inject the StrawberryfieldFileMetadataService to reuse the ::ensureFileAvailability function.
But we can not delete the file here since we may need it for the rest of our times.
But.. VBO actions is not respecting the Facet selection when using the "Select all" option... We should make sure that is clear? Documentation? Eventually patch VBO or, alter the form and remove Select All completely?
@DiegoPino DiegoPino changed the title ISSUE-14: First pass on LoD Reconciling + CSV Export ISSUE-14: LoD Reconciling + CSV Export + Release Improvements Nov 18, 2021
@DiegoPino DiegoPino merged commit effe38e into 0.2.0 Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Ingest Setup Knobs and Levers you move while thinking about feelings and metadata and CSV files queue workers Ones taking the FI and doing the FO queue FIFO Reporting Errors, Logs, etc. UI Buttons and pixels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants