Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Bad file urls in OAI results for theses #1201

Closed
pbinkley opened this issue Jul 12, 2019 · 4 comments
Closed

WIP Bad file urls in OAI results for theses #1201

pbinkley opened this issue Jul 12, 2019 · 4 comments

Comments

@pbinkley
Copy link
Member

Describe the bug
The file urls in ETDMS and ORE responses for items where the file name contains a space are triple-urlencoded, and lead to a 404. Items whose filename does not contain a space are ok.

To Reproduce
Steps to reproduce the behavior:

  1. Fetch this ETDMS record from the OAI server
  2. Find the etd_ms:identifier field that points to a pdf: https://era.library.ualberta.ca/items/03d12bd1-3559-4d03-927c-dc7c7e7b8106/view/a40f5a31-ee78-42dc-8bc9-8e8648acf596/Hashemi_Seyed_Fall%2525202013.pdf
  3. Note the triple-encoded space in the file name: %252520
  4. Try retrieving that link - you get a 404. Same same with double or single encoded versions.
  5. Compare the View link in the application view of this item: https://era.library.ualberta.ca/items/03d12bd1-3559-4d03-927c-dc7c7e7b8106/view/a40f5a31-ee78-42dc-8bc9-8e8648acf596/Hashemi_Seyed_Fall-202013.pdf. Here the space has been replaced with a hyphen rather than being encoded. The link works.

Expected behavior
The link in the OAI record should lead to a successful download of the pdf. The link in the OAI record should also meet LAC's harvesting requirements, which are in the process of clarification (hence the WIP in the title of this issue).

Additional context
This is related to LAC's requirement that file names in download urls should not have "illegal characters", including spaces, when presented in OAI records. We need to investigate further to see whether we have other special characters than spaces, and whether they are handled well in the View urls. @sfarnel is investigating based on @leahvanderjagt 's email forwarding LAC's requirements.

This may be related to a problem on which we worked with them in 2017, when we found that their harvester url-decodes file urls before requesting them, causing a single-encoded space to become a simple space and break the http request. We don't know the current status of this bug. I've emailed the history of our 2017 investigation to them and will update this issue when we have more complete information.

@sfarnel
Copy link
Member

sfarnel commented Jul 12, 2019

See related issue at ualbertalib/metadata#457

@pbinkley
Copy link
Member Author

pbinkley commented Jan 6, 2020

The OAI instructions sent by LAC on 2019-12-20 still include the no-space requirement for links to files:

Direct links to full text files
The LAC harvesting workflow uses the ETD-MS identifier field to locate and download the associated files. We can download multiple files per thesis. The identifier URLs must point directly to the files (and not only to a repository landing page). There should be no spaces in the URLs.

@anayram
Copy link
Member

anayram commented Feb 27, 2020

Here is a list of triple-encoded file urls based on an OAI harvest done on 2020-02-27

file-links-triple-encoding.txt

@mbarnett mbarnett mentioned this issue Jun 11, 2020
5 tasks
@mbarnett
Copy link
Contributor

This particular set of encoding issues is now fixed with the launch of OAISys:

<etd_ms:identifier> https://era.library.ualberta.ca/items/03d12bd1-3559-4d03-927c-dc7c7e7b8106/view/a40f5a31-ee78-42dc-8bc9-8e8648acf596/Hashemi_Seyed_Fall-202013.pdf </etd_ms:identifier>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants