-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update arXiv translator to use recommended Atom API instead of OAI #3366
Conversation
- Detect the new (2018) search interface - Move the codes that obtains search results from new search interface and from old search/listing/catchup into their respective functions - Asyncify doWeb - For the legacy search function, prefer selector-based approach to XPath - Add test cases for new search
Thanks! I'd hold off on any further changes here for a sec because I do think we want to get #3168 merged. I'll try to do that this week. (The |
That said, if you could rebase on #3168, we could just do everything here. |
I've found the oai2 endpoint the least reliable arXiv API option for quite some time, so it'd be nice to switch away from it. Last time I looked, data quality wasn't exactly the same, but that was quite some time back. |
In terms of data quality, it seems that for at least one of the test cases, the OAI endpoint contained a "published" DOI whereas the Atom endpoint did not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main points remaining here it seems is determining whether version should be saved in the item's URL, within extra and whether this depends on the URL that the user is currently visitng.
arXiv.org.js
Outdated
const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean); | ||
if (categories && categories.length) newItem.tags.push(...categories); | ||
|
||
const arxivURL = text(entry, "id").replace(/v\d+/, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I wanted to align with the old code, which ignores the version entirely, but this might be due to the limitations of the oai2
endpoint rather than a design choice.
arXiv.org.js
Outdated
newItem.archiveID = "arXiv:" + articleID; | ||
newItem.complete(); | ||
} | ||
} | ||
|
||
function parseXML(text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this entirely if we commit to the "new" endpoint?
arXiv.org.js
Outdated
const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean); | ||
if (categories && categories.length) newItem.tags.push(...categories); | ||
|
||
const arxivURL = text(entry, "id").replace(/v\d+/, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new endpoint works well with versions, it's just that the old code (in doWeb
) looks for the id within the HTML instead of the URL and fetches the "unversioned" ID, which points to the latest version. I'm not sure what the user expects when they're looking at a specific version, of the expect the latest version ot b imported or the one they're currently looking at. Most of the time, I guess you wouldn't land on the "versioned" page anyway.
id = ZU.xpathText(doc, '(//span[@class="arxivid"]/a)[1]') | ||
|| ZU.xpathText(doc, '//b[starts-with(normalize-space(text()),"arXiv:")]'); | ||
|
||
if (!id) { // Honestly not sure where this might still be needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Atom endpoint handles version numbers well, so I don't see cases where we wouldn't rely on the url to get the ID. Might want to delete this.
.filter(Boolean); | ||
newItem.tags.push(...categories); | ||
|
||
let arxivURL = text(entry, "id").replace(/v\d+/, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a clear determination here wrt to user's expectations. The Atom endpoint always indicates the version in the url, so that's why the version is removed here. However, we need to clarify when and where the version should be saved in the item's url.
This is looking great. I'm getting more and more timeouts from the old export endpoint, so I'd love to get it merged. @adam3smith, what do you think? |
Note that https://github.com/zotero/utilities/blob/e00d98d3a11f6233651a052c108117cf44873edc/utilities.js#L435 should be updated after this PR is merged since the new endpoint explicitly does support versions. |
- Full Text PDF -> Preprint PDF - Extra: 'arXiv: [ID]' -> 'arXiv:[ID]' again Changed my mind on this - consistency is better for now, but we can consider changing it later.
OK, I think this is ready. @dstillman or @adam3smith, would appreciate a third opinion before we merge. |
Thank you so, so much! This is a huge improvement. |
Based on various tests, the currently used
oai2
endpoint is very slow (up to 20s for a single query). Conversely, the endpoint documented by arXiv is much faster. This is currently a WIP.Seems to be the similar idea as #3168