Update arXiv translator to use recommended Atom API instead of OAI #3366

thebluepotato · 2024-10-01T14:28:39Z

Based on various tests, the currently used oai2 endpoint is very slow (up to 20s for a single query). Conversely, the endpoint documented by arXiv is much faster. This is currently a WIP.

Seems to be the similar idea as #3168

- Detect the new (2018) search interface - Move the codes that obtains search results from new search interface and from old search/listing/catchup into their respective functions - Asyncify doWeb - For the legacy search function, prefer selector-based approach to XPath - Add test cases for new search

AbeJellinek · 2024-10-01T15:07:16Z

Thanks! I'd hold off on any further changes here for a sec because I do think we want to get #3168 merged. I'll try to do that this week. (The oai2 endpoint is really, really slow right now, but I don't remember that always being the case...)

AbeJellinek · 2024-10-01T15:08:46Z

That said, if you could rebase on #3168, we could just do everything here.

adam3smith · 2024-10-01T15:13:57Z

I've found the oai2 endpoint the least reliable arXiv API option for quite some time, so it'd be nice to switch away from it. Last time I looked, data quality wasn't exactly the same, but that was quite some time back.

arXiv.org.js

…to arxiv_fast

thebluepotato · 2024-10-01T17:27:56Z

I've found the oai2 endpoint the least reliable arXiv API option for quite some time, so it'd be nice to switch away from it. Last time I looked, data quality wasn't exactly the same, but that was quite some time back.

In terms of data quality, it seems that for at least one of the test cases, the OAI endpoint contained a "published" DOI whereas the Atom endpoint did not

thebluepotato

The main points remaining here it seems is determining whether version should be saved in the item's URL, within extra and whether this depends on the URL that the user is currently visitng.

thebluepotato · 2024-10-01T15:38:18Z

arXiv.org.js

+	const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean);
+	if (categories && categories.length) newItem.tags.push(...categories);
+
+	const arxivURL = text(entry, "id").replace(/v\d+/, '');


Ok, I wanted to align with the old code, which ignores the version entirely, but this might be due to the limitations of the oai2 endpoint rather than a design choice.

arXiv.org.js

thebluepotato · 2024-10-01T15:59:46Z

arXiv.org.js

+		newItem.archiveID = "arXiv:" + articleID;
+		newItem.complete();
+	}
+}

 function parseXML(text) {


Should we remove this entirely if we commit to the "new" endpoint?

thebluepotato · 2024-10-01T18:12:52Z

arXiv.org.js

+	const categories = Array.from(entry.querySelectorAll("category")).map((el) => el.getAttribute("term")).map(sub => arXivCategories[sub] ?? false).filter(Boolean);
+	if (categories && categories.length) newItem.tags.push(...categories);
+
+	const arxivURL = text(entry, "id").replace(/v\d+/, '');


The new endpoint works well with versions, it's just that the old code (in doWeb) looks for the id within the HTML instead of the URL and fetches the "unversioned" ID, which points to the latest version. I'm not sure what the user expects when they're looking at a specific version, of the expect the latest version ot b imported or the one they're currently looking at. Most of the time, I guess you wouldn't land on the "versioned" page anyway.

thebluepotato · 2024-10-01T20:35:25Z

arXiv.org.js

-			id = ZU.xpathText(doc, '(//span[@class="arxivid"]/a)[1]')
-				|| ZU.xpathText(doc, '//b[starts-with(normalize-space(text()),"arXiv:")]');
+
+		if (!id) { // Honestly not sure where this might still be needed


The Atom endpoint handles version numbers well, so I don't see cases where we wouldn't rely on the url to get the ID. Might want to delete this.

thebluepotato · 2024-10-01T20:37:46Z

arXiv.org.js

+		.filter(Boolean);
+	newItem.tags.push(...categories);
+
+	let arxivURL = text(entry, "id").replace(/v\d+/, '');


We need a clear determination here wrt to user's expectations. The Atom endpoint always indicates the version in the url, so that's why the version is removed here. However, we need to clarify when and where the version should be saved in the item's url.

arXiv.org.js

AbeJellinek · 2024-10-03T19:58:11Z

This is looking great. I'm getting more and more timeouts from the old export endpoint, so I'd love to get it merged.

@adam3smith, what do you think?

thebluepotato · 2024-10-04T06:50:59Z

Note that https://github.com/zotero/utilities/blob/e00d98d3a11f6233651a052c108117cf44873edc/utilities.js#L435 should be updated after this PR is merged since the new endpoint explicitly does support versions.

arXiv.org.js

- Full Text PDF -> Preprint PDF - Extra: 'arXiv: [ID]' -> 'arXiv:[ID]' again Changed my mind on this - consistency is better for now, but we can consider changing it later.

AbeJellinek · 2024-10-04T15:07:07Z

OK, I think this is ready. @dstillman or @adam3smith, would appreciate a third opinion before we merge.

AbeJellinek · 2024-10-09T14:00:37Z

Thank you so, so much! This is a huge improvement.

zoe-translates and others added 5 commits October 23, 2023 18:29

ArXiv: Asyncify doSearch() and add a test for search by identifier

098a19c

[ESLint] Make "arXiv" a valid input key for search tests

c6cd378

ArXiv: [Minor] Add comment about selectors in use for legacy search

883ae25

Update arXiv translator to use faster API

f296963

thebluepotato mentioned this pull request Oct 1, 2024

Bringing Crossref, Semantic Scholar, Open Citations and Open Alex lookup + auto-import + redesign to Cita for Zotero 7 diegodlh/zotero-cita#300

Merged

AbeJellinek requested changes Oct 1, 2024

View reviewed changes

thebluepotato added 4 commits October 1, 2024 17:52

Update based on initial review

1f7bfb3

Linting fixes

a96513f

Merge remote-tracking branch 'zoe-translates/arxiv-search-updates' in…

c025628

…to arxiv_fast

Post-merge fixes

0634264

thebluepotato added 2 commits October 1, 2024 20:08

Update tag parsing

0d2fcea

XPath be gone

9789876

thebluepotato commented Oct 1, 2024

View reviewed changes

thebluepotato marked this pull request as ready for review October 1, 2024 20:45

thebluepotato requested a review from AbeJellinek October 1, 2024 20:45

AbeJellinek requested changes Oct 3, 2024

View reviewed changes

arXiv.org.js Outdated Show resolved Hide resolved

arXiv.org.js Show resolved Hide resolved

arXiv.org.js Outdated Show resolved Hide resolved

Rename attachments and standardize Extra field spacing

6c824d3

thebluepotato mentioned this pull request Oct 4, 2024

Update arXiv regex zotero/utilities#37

Open

thebluepotato commented Oct 4, 2024

View reviewed changes

arXiv.org.js Outdated Show resolved Hide resolved

thebluepotato requested a review from AbeJellinek October 4, 2024 10:15

AbeJellinek added 2 commits October 4, 2024 10:56

Update minVersion

3065a2b

Misc. changes

bed5e0b

- Full Text PDF -> Preprint PDF - Extra: 'arXiv: [ID]' -> 'arXiv:[ID]' again Changed my mind on this - consistency is better for now, but we can consider changing it later.

Cleanup

165654c

AbeJellinek merged commit 30664ce into zotero:master Oct 9, 2024
1 check failed

thebluepotato deleted the arxiv_fast branch October 14, 2024 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update arXiv translator to use recommended Atom API instead of OAI #3366

Update arXiv translator to use recommended Atom API instead of OAI #3366

thebluepotato commented Oct 1, 2024 •

edited

Loading

AbeJellinek commented Oct 1, 2024

AbeJellinek commented Oct 1, 2024

adam3smith commented Oct 1, 2024

thebluepotato commented Oct 1, 2024

thebluepotato left a comment

thebluepotato Oct 1, 2024

thebluepotato Oct 1, 2024

thebluepotato Oct 1, 2024

thebluepotato Oct 1, 2024

thebluepotato Oct 1, 2024

AbeJellinek commented Oct 3, 2024

thebluepotato commented Oct 4, 2024

AbeJellinek commented Oct 4, 2024

AbeJellinek commented Oct 9, 2024

Update arXiv translator to use recommended Atom API instead of OAI #3366

Update arXiv translator to use recommended Atom API instead of OAI #3366

Conversation

thebluepotato commented Oct 1, 2024 • edited Loading

AbeJellinek commented Oct 1, 2024

AbeJellinek commented Oct 1, 2024

adam3smith commented Oct 1, 2024

thebluepotato commented Oct 1, 2024

thebluepotato left a comment

Choose a reason for hiding this comment

thebluepotato Oct 1, 2024

Choose a reason for hiding this comment

thebluepotato Oct 1, 2024

Choose a reason for hiding this comment

thebluepotato Oct 1, 2024

Choose a reason for hiding this comment

thebluepotato Oct 1, 2024

Choose a reason for hiding this comment

thebluepotato Oct 1, 2024

Choose a reason for hiding this comment

AbeJellinek commented Oct 3, 2024

thebluepotato commented Oct 4, 2024

AbeJellinek commented Oct 4, 2024

AbeJellinek commented Oct 9, 2024

thebluepotato commented Oct 1, 2024 •

edited

Loading