Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update arXiv translator to use recommended Atom API instead of OAI #3366

Merged
merged 15 commits into from
Oct 9, 2024

Conversation

thebluepotato
Copy link
Contributor

@thebluepotato thebluepotato commented Oct 1, 2024

Based on various tests, the currently used oai2 endpoint is very slow (up to 20s for a single query). Conversely, the endpoint documented by arXiv is much faster. This is currently a WIP.

Seems to be the similar idea as #3168

zoe-translates and others added 5 commits October 23, 2023 18:29
- Detect the new (2018) search interface
- Move the codes that obtains search results from new search interface
  and from old search/listing/catchup into their respective functions
- Asyncify doWeb
- For the legacy search function, prefer selector-based approach to
  XPath
- Add test cases for new search
@AbeJellinek
Copy link
Member

Thanks! I'd hold off on any further changes here for a sec because I do think we want to get #3168 merged. I'll try to do that this week. (The oai2 endpoint is really, really slow right now, but I don't remember that always being the case...)

@AbeJellinek
Copy link
Member

That said, if you could rebase on #3168, we could just do everything here.

@adam3smith
Copy link
Collaborator

I've found the oai2 endpoint the least reliable arXiv API option for quite some time, so it'd be nice to switch away from it. Last time I looked, data quality wasn't exactly the same, but that was quite some time back.

arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Show resolved Hide resolved
@thebluepotato
Copy link
Contributor Author

I've found the oai2 endpoint the least reliable arXiv API option for quite some time, so it'd be nice to switch away from it. Last time I looked, data quality wasn't exactly the same, but that was quite some time back.

In terms of data quality, it seems that for at least one of the test cases, the OAI endpoint contained a "published" DOI whereas the Atom endpoint did not

Copy link
Contributor Author

@thebluepotato thebluepotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main points remaining here it seems is determining whether version should be saved in the item's URL, within extra and whether this depends on the URL that the user is currently visitng.

arXiv.org.js Outdated
const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean);
if (categories && categories.length) newItem.tags.push(...categories);

const arxivURL = text(entry, "id").replace(/v\d+/, '');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I wanted to align with the old code, which ignores the version entirely, but this might be due to the limitations of the oai2 endpoint rather than a design choice.

arXiv.org.js Show resolved Hide resolved
arXiv.org.js Outdated
newItem.archiveID = "arXiv:" + articleID;
newItem.complete();
}
}

function parseXML(text) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this entirely if we commit to the "new" endpoint?

arXiv.org.js Outdated
const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean);
if (categories && categories.length) newItem.tags.push(...categories);

const arxivURL = text(entry, "id").replace(/v\d+/, '');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new endpoint works well with versions, it's just that the old code (in doWeb) looks for the id within the HTML instead of the URL and fetches the "unversioned" ID, which points to the latest version. I'm not sure what the user expects when they're looking at a specific version, of the expect the latest version ot b imported or the one they're currently looking at. Most of the time, I guess you wouldn't land on the "versioned" page anyway.

id = ZU.xpathText(doc, '(//span[@class="arxivid"]/a)[1]')
|| ZU.xpathText(doc, '//b[starts-with(normalize-space(text()),"arXiv:")]');

if (!id) { // Honestly not sure where this might still be needed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Atom endpoint handles version numbers well, so I don't see cases where we wouldn't rely on the url to get the ID. Might want to delete this.

.filter(Boolean);
newItem.tags.push(...categories);

let arxivURL = text(entry, "id").replace(/v\d+/, '');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a clear determination here wrt to user's expectations. The Atom endpoint always indicates the version in the url, so that's why the version is removed here. However, we need to clarify when and where the version should be saved in the item's url.

@thebluepotato thebluepotato marked this pull request as ready for review October 1, 2024 20:45
arXiv.org.js Outdated Show resolved Hide resolved
arXiv.org.js Show resolved Hide resolved
arXiv.org.js Outdated Show resolved Hide resolved
@AbeJellinek
Copy link
Member

This is looking great. I'm getting more and more timeouts from the old export endpoint, so I'd love to get it merged.

@adam3smith, what do you think?

@thebluepotato
Copy link
Contributor Author

Note that https://github.com/zotero/utilities/blob/e00d98d3a11f6233651a052c108117cf44873edc/utilities.js#L435 should be updated after this PR is merged since the new endpoint explicitly does support versions.

arXiv.org.js Outdated Show resolved Hide resolved
- Full Text PDF -> Preprint PDF
- Extra: 'arXiv: [ID]' -> 'arXiv:[ID]' again
  Changed my mind on this - consistency is better for now, but we can consider
  changing it later.
@AbeJellinek
Copy link
Member

OK, I think this is ready. @dstillman or @adam3smith, would appreciate a third opinion before we merge.

@AbeJellinek AbeJellinek merged commit 30664ce into zotero:master Oct 9, 2024
1 check failed
@AbeJellinek
Copy link
Member

Thank you so, so much! This is a huge improvement.

@thebluepotato thebluepotato deleted the arxiv_fast branch October 14, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants