Ability to pull OAI-PMH metadata (and view records) #664

mlhale7 · 2024-07-23T16:52:48Z

Story

ref. #665
I am unable to pull metadata using OAI-PMH with my regular tools (https://github.com/vphill/pyoaiharvester) and I do not see individual records when navigating the feed (https://digitalcollections.lib.utk.edu/catalog/oai) in the browser. There appears to be an identifier issue that is causing no records to be retrievable.

Acceptance Criteria

Ability to pull OAI records in oai_dc or mods using standard command line tools or Repox
Ability to see individual records in the browser by navigating https://digitalcollections.lib.utk.edu/catalog/oai

Screenshots / Video

When I use pyoaiharvester, I get a ZeroDivisionError. Here's a screenshot showing the command and error:

When I click on "oai_dc" in the browser to retrieve an individual record, I get the error "idDoesNotExist." Here's a screenshot:

Finally, looking in the browser at https://digitalcollections.lib.utk.edu/catalog/oai, I am seeing records for attachments that I would not expect (PRESERVE, MODS, etc.) We just want a single record to appear for each digital asset.

Testing Instructions and Sample Files

Notes

A conjecture - potentially this issue was introduced when we changed the URL to "digitalcollections.lib.utk.edu"?

mlhale7 · 2024-07-23T17:07:21Z

According to Repox (which we use for DPLA harvesting), our URL does not exist (I've tried both http and https just in case).

kirkkwang · 2024-08-01T18:23:51Z

An update:

Using the harvest (https://github.com/vphill/pyoaiharvester) as it was not working because https://digitalcollections.lib.utk.edu/ still has the HTTP Auth (where you need to enter the username and password before you can see the site). Modifying the harvester's code to allow you to pass it in via the URL like http://username:[email protected]/ will get past that issue.

However, we are quickly met with a different issue. There seems to be an issue with the SSL certificates.

Details in this report.

Rob said he will address the cert issue. After that is fixed we can try the harvester tool again.

kirkkwang · 2024-08-05T15:51:04Z

To summarize the findings here:

There are two things at play here that makes the pyoaiharvester not work with the production site.

Currently on the the digitalcollections tenant has basic auth turned on (the pop up where you are prompted for the username/password). The pyoaiharvester tool does not allow passing in basic auth through the URL. The assumption is that this will be turned off for launch so this should not be an issue.
We use CrowdSec as part of our security measures. CrowdSec seems to be blocking the User-Agent pyoaiharvester/3.0. This may be because it has been reported as potentially malicious or exhibiting suspicious behavior. A workaround is to change the User-Agent in the script to something like utklibraryoai. We are investigating the specific reasons for this block and assessing any necessary adjustments to our security configuration.

mlhale7 · 2024-08-22T15:42:52Z

Thanks to @kirkkwang, I was able to successfully pull records in oai_dc and mods format. I'll comment back once I inspect these files more.

mlhale7 · 2024-08-22T15:49:07Z

This may need to be a separate ticket, but I'm finding some odd records in the OAI. For instance:

    <record>
        <header>
            <identifier>oai:hyku:9a77b15d-554d-4dfc-a49f-09fbcee8118c</identifier>
            <datestamp>2024-07-22T23:24:01Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OCR for alumnus:1507298774</dc:description>
                <dc:title>OCR</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

AND

    <record>
        <header>
            <identifier>oai:hyku:2f4595f1-7611-490a-b4b6-94336055d037</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TRANSCRIPT for jsevier:9</dc:description>
                <dc:title>TRANSCRIPT</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:f47c15e5-5055-44b7-9b55-526b3e3bfc68</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TEI for jsevier:9</dc:description>
                <dc:title>TEI</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:1211e8b1-040f-439d-b5d0-f1ad54f42e8d</identifier>
            <datestamp>2024-07-22T23:23:59Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:25</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:72f7889f-da05-4363-a42a-c74354351672</identifier>
            <datestamp>2024-07-22T23:24:00Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:2</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

I'm including @laritakr here for informational purposes. None of these types of resources (OBJ, Transcript, OCR, etc) should be present in OAI. We just want the main record. Getting rid of all of these extra records would also make pulling OAI a lot faster. Ultimately the issue is that I would need to find a way to exclude all of these extra attachments for DPLA ingests etc. if not removed as this information is not needed.

mlhale7 · 2024-08-22T16:36:37Z

Here's another odd record:

    <record>
        <header>
            <identifier>oai:hyku:24ac22aa-106b-4a88-a346-9e264d13d972</identifier>
            <datestamp>2023-09-01T03:59:26Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:publisher>utk</dc:publisher>
                <dc:rights>http://rightsstatements.org/vocab/InC/1.0/</dc:rights>
                <dc:title>504 error (shana)</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

mlhale7 · 2024-08-22T16:46:05Z

HOCR also is not something we want a record for:

<record><header><identifier>oai:hyku:ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier><datestamp>2024-07-24T05:13:52Z</datestamp><setSpec>collection:admin_set/default</setSpec></header><metadata><mods version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <titleInfo>
    <title>HOCR</title>
  </titleInfo>
  <identifier type="uuid">ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier>
  <originInfo/>
  <physicalDescription/>
  <subject/>
  <subject>
    <cartographics/>
  </subject>
  <location>
    <url usage="primary" access="object in context">https://digitalcollections.lib.utk.edu/concern/attachments/ccb61e6c-0c22-471c-9a22-e4dfe2953c62</url>
    <url access="preview" xlink:href="https://digitalcollections.lib.utk.edu/assets/work-ff055336041c3f7d310ad69109eda4a887b16ec501f35afc0a547c4adb97ee72.png"/>
  </location>
  <recordInfo>
    <recordIdentifier>ccb61e6c-0c22-471c-9a22-e4dfe2953c62</recordIdentifier>
    <recordOrigin>https://digitalcollections.lib.utk.edu/catalog/oai</recordOrigin>
    <recordCreationDate>2024-05-17T19:45:38Z</recordCreationDate>
    <recordChangeDate>2024-05-17T21:53:27Z</recordChangeDate>
  </recordInfo>
</mods></metadata></record>

laritakr · 2024-08-22T16:58:58Z

Restricting the types of works that show in your OAI feed will should be a new ticket, as it is separate from the requirements of this ticket.

This is due to the way the child works are created to allow additional metadata for file sets in your repo. We will need to identify which specific information we need to exclude and override standard OAI behavior.

mlhale7 · 2024-09-03T18:06:44Z

@kirkkwang - I was able to pull both MODS and DC. Given that all of the sets have to be pulled each time, right now the time needed to get OAI is a bit restrictive, but this will be addressed when the ability to pull separate collections is added (#680). I approve the work completed in this ticket.

kirkkwang · 2024-09-03T20:18:18Z

Great! thank you @mlhale7

mlhale7 mentioned this issue Jul 23, 2024

Ability to pull OAI-PMH metadata #665

Closed

2 tasks

kirkkwang self-assigned this Aug 1, 2024

orangewolf added the Pre-Launch label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to pull OAI-PMH metadata (and view records) #664

Ability to pull OAI-PMH metadata (and view records) #664

mlhale7 commented Jul 23, 2024 •

edited

Loading

mlhale7 commented Jul 23, 2024

kirkkwang commented Aug 1, 2024

kirkkwang commented Aug 5, 2024

mlhale7 commented Aug 22, 2024

mlhale7 commented Aug 22, 2024 •

edited

Loading

mlhale7 commented Aug 22, 2024

mlhale7 commented Aug 22, 2024

laritakr commented Aug 22, 2024 •

edited

Loading

mlhale7 commented Sep 3, 2024

kirkkwang commented Sep 3, 2024

Ability to pull OAI-PMH metadata (and view records) #664

Ability to pull OAI-PMH metadata (and view records) #664

Comments

mlhale7 commented Jul 23, 2024 • edited Loading

Story

Acceptance Criteria

Screenshots / Video

Testing Instructions and Sample Files

Notes

mlhale7 commented Jul 23, 2024

kirkkwang commented Aug 1, 2024

kirkkwang commented Aug 5, 2024

mlhale7 commented Aug 22, 2024

mlhale7 commented Aug 22, 2024 • edited Loading

mlhale7 commented Aug 22, 2024

mlhale7 commented Aug 22, 2024

laritakr commented Aug 22, 2024 • edited Loading

mlhale7 commented Sep 3, 2024

kirkkwang commented Sep 3, 2024

mlhale7 commented Jul 23, 2024 •

edited

Loading

mlhale7 commented Aug 22, 2024 •

edited

Loading

laritakr commented Aug 22, 2024 •

edited

Loading