Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T300 bulkrax8.0.0 #555

Merged
merged 30 commits into from
Sep 4, 2024
Merged

T300 bulkrax8.0.0 #555

merged 30 commits into from
Sep 4, 2024

Conversation

kerchner
Copy link
Member

@kerchner kerchner commented Jun 3, 2024

This PR does the following:

  • Upgrades Bulkrax from 2.3.0 to a post-8.0.0 commit (when this commit is included in the next Bulkrax release, we can at that point upgrade to that release)

Set up the code branch for testing:

  • Do a bundle install to upgrade Bulkrax.
  • Make sure to run db:migrate -- the Bulkrax upgrade affects Bulkrax-related database tables.
  • On the host server, create a /opt/scholarspace/scholarspace-ingest directory. (Creation of this directory, if not already present, has been added to Dockerfile.)

Set up data for testing:

There is an AWS S3 bucket called proquest-etds-test containing a set of ProQuest ETD zip files suitable for testing. Steps to set up the data so that it can be accessed from within the docker container app-server and sidekiq volumes:

  • Install s3fs-fuse on your server
  • Set up credentials in ${HOME}/.passwd-s3fs - please see me or @kilahimm for the bulkrax access key credentials.
  • Mount the bucket on your file system. For example, you might create a /data-s3-etds directory and then run s3fs proquest-etds-test /data-s3-etds
  • Copy some of the ETDs from the S3 mount point to /opt/scholarspace/scholarspace-ingest or a directory beneath it.

Test:

  • Run the ingest_bulkrax_prep rake task inside the container (or from the outside using docker exec. The task requires an argument, which is the path to the directory containing the ProQuest zips you wish to include in the ingest. For example, bundle exec rails gwss:ingest_pq_etds['/opt/scholarspace/scholarspace-ingest/etd-zips'] # where etds are in /opt/scholarspace/scholarspace-ingest/etd-zips
  • This will create /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip and within this folder, metadata.csv and files/ containing attachments (within subdirectories, one per ETD)
  • You may wish to scp tmp/bulkrax_zip/metadata.csv to your local computer so that you can more conveniently view it, for example using Excel, in order to validate test results.
  • Within the GW ScholarSpace web application, log in as an administrative user. On the Dashboard, click on Importers. Create a New importer with the following values:
    • Name = any name
    • Administrative Set = ETDs
    • Frequency = Once (on save)
    • Limit = leave blank
    • Parser = CSV - Comma Separated Values
    • Visibility = Public
    • Rights Statement = leave blank
    • Add CSV File to Import: Specify a Path on the Server. Import file path = /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip/metadata.csv

Before starting the import, open a tab to the Sidekiq administrator (at /sidekiq) so that you can watch progress of the queues.

Then proceed and click Create and Import.

*If you wish to re-run the task to generate the bulkrax-ready metadata and files, then you'll need to first clear out the results of the previous run: rm -r /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip

Validate the test results

  • Validate via the Importer in the UI, that there were no errors.
  • After waiting for all jobs to finish in Sidekiq:
    • confirm that an ETD object was created for each ETD in metadata.csv (and that at least the count of ETDs matches the number of ETD zip files).
    • confirm that files which should be embargoed are embargoed. Note that there seems to be a bug in the Hyrax UI, whereby embargo release date does not display. However, the embargo release date on a FileSet (or other object) can be confirmed via the rails console if desired.
    • confirm that the thumbnail derivatives are created for attachments of thumbnail-able types.
    • confirm that the thumbnail for the GwETD is the thumbnail for the main thesis/dissertation PDF.
    • confirm that after the ETDs have all finished importing, the ETDs are public, even if some of them have embargoed FileSets
  • Via Sidekiq, there may be occasional objects (PDFs, etc.) that failed to characterize using ghostscript. These seem to fail in the same manner when attaching through the UI. As long as the objects (PDF, etc.) do attach and are downloadable via the UI, and that these do not somehow crash the whole ingest (they shouldn't), then the PR should pass.

Other notes to the reviewer

I recognize that the code for the rake task, as currently written, is probably not the most ruby-esque in style. I am very open to suggestions for improvements to the style (and, of course, function) of the code.

Final thoughts

While this procedure for parsing the ProQuest zips, then loading via the front end, is still somewhat manual, requiring running back-end tasks, loading ETDs is generally something that we do just a few times per year. I also recognize that separating out ETDs from the S3 mount to be newly loaded, and avoiding re-loading ETDs that we already loaded, will also be somewhat of a manual process, although I intend to post suggested bash commands to make this a bit easier. The concept is to get this much working, try it for a few cycles of loads, and take what we learn and then decide what enhancements to the overall workflow would be most worthwhile.

kerchner added 22 commits May 8, 2024 18:05
… to bulkrax zip staging directory, but will need to segregate files from each ETD into separate directoroes
…sn't yet an admin set for admin to deposit to
@kerchner kerchner marked this pull request as ready for review June 7, 2024 05:56
@kerchner kerchner requested review from alepbloyd and dolsysmith June 7, 2024 05:57
Copy link
Contributor

@alepbloyd alepbloyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to work as described! Woo! I took a quick look at the ingest code, and I think it's fine as is - if it's working properly and probably won't be changing frequently, I don't think it's worth trying to pick it apart to make it more ruby-ish.

Two thoughts for future iterations:

  1. I'll try to take a stab at writing a handful of RSpec tests for this. I think we could make a dummy ETD zip with embargoed files as a fixture, and then run the ingest task against that zip file. I don't think that needs to hold up merging this though and can do that as a separate PR.
  2. It seems like there's a way to use AWS Lambda to trigger events when the contents of an S3 bucket change (https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-s3). I wonder if we could set that up to run the ingest when a new ETD zip is deposited in the S3 bucket and remove the need to manually run this? Definitely something we'd want to discuss more, but I might mess around and see if I can get a proof of concept running when I have some time.

@kerchner
Copy link
Member Author

kerchner commented Jun 10, 2024

Thanks @alepbloyd !

Re: 1. I am definitely interested to see what you come up with for RSpec tests, I realize that those need to be written as well.

Re: 2. That sounds like a nice technological solution. However - and we should discuss this - I don't think we're near the point where we'd want any works at all being loaded into production without a human triggering the process and without a human doing QC. In practice we should first run any ingest in the test environment and watch the process to observe any issues and check the results. There are ETDs that can choke, especially on the steps where ghostscript characterizes them and where ImageMagick generates thumbnails; in theory we could also encounter issues parsing the ProQuest XML. So, running the ingest is always well served by having a person monitor the process.

@dolsysmith do you also want to test this PR, or should we proceed with merge?

Copy link
Contributor

@dolsysmith dolsysmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the tests on the complete contents of the etds folder in the proquest-etds-test bucket.

The ingest_bulkrax_prep task yielded 99 works with associated files. The metadata sheet seems correctly formatted, etc.

I then created an importer for these 99 works. Most works were successfully imported, but a few errors did appear:

  • Derivatives failed to be created for at least three filesets. See the screenshots below from the Sidekiq logs.
    • In at least one case, the log revealed a Ghostscript error. The file is attached and downloaded to the work, but the derivative is lacking.
    • In another case, the logs showed a timeout error. Based on what I was seeing in the logs, I think it might have been triggered by a Java heap error on the Solr app. The PDF in question is pretty large: Li_gwu_0075A_16600.pdf.
  • One work's page (work, not fileset) is not accessible at all: Investigating Multicolor Affine Urn Models with Multiple Drawings. I can retrieve the ActiveRecord instance in the Rails console, and the metadata looks fine, but the app throws an error in the UI -- I think because it's looking for an image method on the WorkShowPresenter instance.

Embargoing seems to work, but I am seeing that files that are supposed to be embargoed as restricted -- per the metadata CSV -- are actually marked private in GWSS. It's my understanding that the restricted permissions are supposed to allow GW community access (but maybe it has a different meaning in the Bullkrax context?).

Another observation, which may be more of a Bulkrax bug, is that these errors didn't bubble up to the Importer UI in GWSS. All of the works were marked as Completed, not Completed with failures, despite the aforementioned errors' appearing in the logs and the derivatives' not being created.

Let me know if you want me to re-test anything with a smaller batch. Perhaps it was the sheer amount of files that caused some of these errors.

Screenshot 2024-06-10 at 12 57 42 PM

Screenshot 2024-06-10 at 12 55 53 PM

Screenshot 2024-06-10 at 12 41 08 PM

Screenshot 2024-06-10 at 12 34 42 PM

@kerchner
Copy link
Member Author

As per discussion, an embargo value of restricted in the bulkrax metadata CSV seems to map to private visibility in the app (and a label of embargo is correctly applied as well).

@dolsysmith is going to try re-loading the ETDs in question via the UI to see if the failure to generate derivatives occurs there as well, and whether the Investigating Multicolor Affine Urn Models with Multiple Drawings ETD can be uploaded or not.

@alepbloyd
Copy link
Contributor

If additional context is helpful, I only ran the import on the ETDs numbered up through 1029184 when I was testing.

I'm working on trying to get some RSpec tests going for this, so if there's any additional troubleshooting y'all want me to try on my server - just let me know.

@dolsysmith
Copy link
Contributor

@kerchner I was able to replicate the errors in creating derivatives by uploading the works through the UI. One of these is throwing the ImageMagick/GhostScript error, and the other (the 17 MB file) is just causing a timeout -- I can't see any other errors for that in the logs. But that's the same behavior as when uploaded via Bulkrax, so I don't think the problem is with the code in this branch.

Regarding the work "nvestigating Multicolor Affine Urn Models with Multiple Drawings, I was able to upload this work directly into the UI without issue. So I then tried to replicate by re-uploading it as part of small batch with Bulkrax (only 4 works), and this also worked. So I'm afraid I'm not sure what happened the first time. I deleted the original, error-inducing work using the console, so I'm afraid I don't have a record of it anymore for troubleshooting.

@dolsysmith
Copy link
Contributor

Also, I notice that both of my Bulkrax imports still show a "pending" status in the Bulkrax menu, even though when you click on either, there are no pending jobs. Are you seeing the same on your instance?

Screenshot 2024-06-10 at 3 44 49 PM

@kerchner
Copy link
Member Author

Also, I notice that both of my Bulkrax imports still show a "pending" status in the Bulkrax menu, even though when you click on either, there are no pending jobs. Are you seeing the same on your instance?

Screenshot 2024-06-10 at 3 44 49 PM

At least when I try the full 99 test ETDs, the importer itself gets a green checkmark pretty quickly, even while its component objects are still pending. In other situations, I've also gotten the red x quickly as well. Not sure why yours still shows pending.

@kerchner
Copy link
Member Author

@dolsysmith I tried reloading all 99 ETDs as a single import. Investigating Multicolor Affine Urn Models with Multiple Drawings looks like it loaded fine, including thumbnail. Items with issues:

  • Li loaded without a problem, but there is no thumbnail. (and note that this batch has ETDs much larger than Li, so I'm not sure that file size per se is the issue)
  • Clementi loaded, but there is no thumbnail.
  • Moore loaded, and shows a thumbnail for the catalog result, but not on the item view.

The errors I'm seeing with the Sidekiq jobs are consistent with the above, in that for each of the 3, the CreateRelationships job is not stuck, but the CreateDerivatives job is. Each has a different error.

image

I haven't tried reloading via the UI, but I would expect to get the same results as you did (i.e. ok but no thumbnail).

I believe these are errors that we cannot easily fix in Hyrax/Ghostscript/Minimagick, but we can live with. Plus they are neither new nor unique to this new Bulkrax loader. Over the past few years they have become less frequent, but they do still occur as we see here.

@alepbloyd @dolsysmith @kilahimm What are your thoughts on proceeding with merge given these results of testing?

etd_date_created = get_date_created(doc)
work_metadata['date_created'] = etd_date_created unless etd_date_created.nil?
work_metadata['committee_member'] = get_committee_members(doc).join(';')
work_metadata['rights_statement'] = 'http://rightsstatements.org/vocab/InC/1.0/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the license and resource_type fields, since those are required when updating/editing a work in the UI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this - I'll fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add resource_type and @dolsysmith you'll add license in the branch for #458

@dolsysmith
Copy link
Contributor

@kerchner Merging makes sense. See my comment above about a minor change to ensure that the ETD's created have values for all of the fields required by the UI. If required values are missing, one cannot save changes when. manually editing the work, which may prove confusing for users.

@kerchner
Copy link
Member Author

kerchner commented Jun 11, 2024

Work-in-progress: Adding the "All rights reserved" license will require some problem-solving because Bulkrax wants to append a / to the license ID value in the license metadata CSV field (see https://github.com/samvera/bulkrax/blob/v8.1.0/app/models/concerns/bulkrax/import_behavior.rb#L145-L146 ) - and the "corrected" value no longer matches the value in the authority yaml file of http://www.europeana.eu/portal/rights/rr-r.html (which does not have a trailing /, nor should it)

Also noting that we do not seem to ever display the license field on works in GW ScholarSpace to the user, yet the field is required. Should it be required? I've sent the question to Barrett.

@kerchner
Copy link
Member Author

kerchner commented Jun 14, 2024

@dolsysmith @kilahimm @alepbloyd
I would propose the following while the question of the desired value for license is reviewed by ScholComm: I can create a new issue to track the deficiency where the ingest task is not setting license - which does have the aforementioned effect on content-admin users. Proceed with merge. If desired, we can use a script or task to set license for all ETDs lacking it, to either the currently used value or some new value. This would be a temporary measure to address the next batch of ETDs. If the currently used URI for "All Rights Reserved" is confirmed going forward, we will need to find a solution to work around Bulkrax aggressively slash-terminating values in the license column of the metadata CSV.

Thoughts on this?

@dolsysmith
Copy link
Contributor

I updated this ticket to reflect ScholComm's discussion of the license field.

@dolsysmith
Copy link
Contributor

@kerchner The deprecated (and problematic) license value has been replaced in #458 with a value that doesn't get parsed as a URI, so it doesn't seem to cause a problem for Bulkrax. I think the license field needs to be added to the ingest_bulkrax_prep task, and it should be set to the new value, All rights reserved.

@kerchner kerchner self-assigned this Aug 27, 2024
@dolsysmith dolsysmith merged commit 5dfd206 into master Sep 4, 2024
1 check passed
@dolsysmith dolsysmith deleted the t300-bulkrax8.0.0 branch September 4, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants