Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to pull OAI-PMH metadata (and view records) #664

Open
2 tasks done
mlhale7 opened this issue Jul 23, 2024 · 10 comments
Open
2 tasks done

Ability to pull OAI-PMH metadata (and view records) #664

mlhale7 opened this issue Jul 23, 2024 · 10 comments
Assignees

Comments

@mlhale7
Copy link
Collaborator

mlhale7 commented Jul 23, 2024

Story

ref. #665
I am unable to pull metadata using OAI-PMH with my regular tools (https://github.com/vphill/pyoaiharvester) and I do not see individual records when navigating the feed (https://digitalcollections.lib.utk.edu/catalog/oai) in the browser. There appears to be an identifier issue that is causing no records to be retrievable.

Acceptance Criteria

Screenshots / Video

When I use pyoaiharvester, I get a ZeroDivisionError. Here's a screenshot showing the command and error:

commandforHykuOAIwithErrors

When I click on "oai_dc" in the browser to retrieve an individual record, I get the error "idDoesNotExist." Here's a screenshot:

Screenshot 2024-07-23 at 11 45 25 AM

Finally, looking in the browser at https://digitalcollections.lib.utk.edu/catalog/oai, I am seeing records for attachments that I would not expect (PRESERVE, MODS, etc.) We just want a single record to appear for each digital asset.

Screenshot 2024-07-23 at 12 51 54 PM

Testing Instructions and Sample Files

Notes

A conjecture - potentially this issue was introduced when we changed the URL to "digitalcollections.lib.utk.edu"?

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Jul 23, 2024

According to Repox (which we use for DPLA harvesting), our URL does not exist (I've tried both http and https just in case).

Screenshot 2024-07-23 at 1 05 20 PM

@kirkkwang
Copy link
Contributor

An update:

Using the harvest (https://github.com/vphill/pyoaiharvester) as it was not working because https://digitalcollections.lib.utk.edu/ still has the HTTP Auth (where you need to enter the username and password before you can see the site). Modifying the harvester's code to allow you to pass it in via the URL like http://username:[email protected]/ will get past that issue.

However, we are quickly met with a different issue. There seems to be an issue with the SSL certificates.

Image

Details in this report.

Rob said he will address the cert issue. After that is fixed we can try the harvester tool again.

@kirkkwang kirkkwang self-assigned this Aug 1, 2024
@kirkkwang
Copy link
Contributor

To summarize the findings here:

There are two things at play here that makes the pyoaiharvester not work with the production site.

  1. Currently on the the digitalcollections tenant has basic auth turned on (the pop up where you are prompted for the username/password). The pyoaiharvester tool does not allow passing in basic auth through the URL. The assumption is that this will be turned off for launch so this should not be an issue.
  2. We use CrowdSec as part of our security measures. CrowdSec seems to be blocking the User-Agent pyoaiharvester/3.0. This may be because it has been reported as potentially malicious or exhibiting suspicious behavior. A workaround is to change the User-Agent in the script to something like utklibraryoai. We are investigating the specific reasons for this block and assessing any necessary adjustments to our security configuration.

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Aug 22, 2024

Thanks to @kirkkwang, I was able to successfully pull records in oai_dc and mods format. I'll comment back once I inspect these files more.

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Aug 22, 2024

This may need to be a separate ticket, but I'm finding some odd records in the OAI. For instance:

    <record>
        <header>
            <identifier>oai:hyku:9a77b15d-554d-4dfc-a49f-09fbcee8118c</identifier>
            <datestamp>2024-07-22T23:24:01Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OCR for alumnus:1507298774</dc:description>
                <dc:title>OCR</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

AND

    <record>
        <header>
            <identifier>oai:hyku:2f4595f1-7611-490a-b4b6-94336055d037</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TRANSCRIPT for jsevier:9</dc:description>
                <dc:title>TRANSCRIPT</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:f47c15e5-5055-44b7-9b55-526b3e3bfc68</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TEI for jsevier:9</dc:description>
                <dc:title>TEI</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:1211e8b1-040f-439d-b5d0-f1ad54f42e8d</identifier>
            <datestamp>2024-07-22T23:23:59Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:25</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:72f7889f-da05-4363-a42a-c74354351672</identifier>
            <datestamp>2024-07-22T23:24:00Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:2</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

I'm including @laritakr here for informational purposes. None of these types of resources (OBJ, Transcript, OCR, etc) should be present in OAI. We just want the main record. Getting rid of all of these extra records would also make pulling OAI a lot faster. Ultimately the issue is that I would need to find a way to exclude all of these extra attachments for DPLA ingests etc. if not removed as this information is not needed.

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Aug 22, 2024

Here's another odd record:

    <record>
        <header>
            <identifier>oai:hyku:24ac22aa-106b-4a88-a346-9e264d13d972</identifier>
            <datestamp>2023-09-01T03:59:26Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:publisher>utk</dc:publisher>
                <dc:rights>http://rightsstatements.org/vocab/InC/1.0/</dc:rights>
                <dc:title>504 error (shana)</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Aug 22, 2024

HOCR also is not something we want a record for:

<record><header><identifier>oai:hyku:ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier><datestamp>2024-07-24T05:13:52Z</datestamp><setSpec>collection:admin_set/default</setSpec></header><metadata><mods version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <titleInfo>
    <title>HOCR</title>
  </titleInfo>
  <identifier type="uuid">ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier>
  <originInfo/>
  <physicalDescription/>
  <subject/>
  <subject>
    <cartographics/>
  </subject>
  <location>
    <url usage="primary" access="object in context">https://digitalcollections.lib.utk.edu/concern/attachments/ccb61e6c-0c22-471c-9a22-e4dfe2953c62</url>
    <url access="preview" xlink:href="https://digitalcollections.lib.utk.edu/assets/work-ff055336041c3f7d310ad69109eda4a887b16ec501f35afc0a547c4adb97ee72.png"/>
  </location>
  <recordInfo>
    <recordIdentifier>ccb61e6c-0c22-471c-9a22-e4dfe2953c62</recordIdentifier>
    <recordOrigin>https://digitalcollections.lib.utk.edu/catalog/oai</recordOrigin>
    <recordCreationDate>2024-05-17T19:45:38Z</recordCreationDate>
    <recordChangeDate>2024-05-17T21:53:27Z</recordChangeDate>
  </recordInfo>
</mods></metadata></record>

@laritakr
Copy link
Contributor

laritakr commented Aug 22, 2024

Restricting the types of works that show in your OAI feed will should be a new ticket, as it is separate from the requirements of this ticket.

This is due to the way the child works are created to allow additional metadata for file sets in your repo. We will need to identify which specific information we need to exclude and override standard OAI behavior.

@mlhale7
Copy link
Collaborator Author

mlhale7 commented Sep 3, 2024

@kirkkwang - I was able to pull both MODS and DC. Given that all of the sets have to be pulled each time, right now the time needed to get OAI is a bit restrictive, but this will be addressed when the ability to pull separate collections is added (#680). I approve the work completed in this ticket.

@kirkkwang
Copy link
Contributor

Great! thank you @mlhale7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Deploy to Production
Development

No branches or pull requests

4 participants