Add a reader for harvesting directly from purl-fetcher HTTP API #1511

thatbudakguy · 2024-08-27T19:30:04Z

This adds a traject reader that can be useful in development when
you want to quickly index many records from purl-fetcher without
resorting to Kafka. It is intended for dev use only.

It can point at any release target (searchworks, earthworks) and
index all of the items currently released to that target.

This PR also modifies PublicCocinaRecord and PublicXmlRecord to
optionally accept a connection object, so that a single Faraday
connection can be shared amongst the reader and records, which
enables parallelizing record-fetching from purl to match the
number of traject threads.

This setup allows indexing everything released to Earthworks in
a little under 5 minutes with 4 threads on my machine.

jcoyne · 2024-09-09T13:54:27Z

@thatbudakguy Can you explain how would you use the purl-fetcher in development? This seems like functionality that is already built into Argo, and we're building it a second time, making more code paths to have to maintain.

thatbudakguy · 2024-09-09T15:05:09Z

I use this to do reindexing of SDR content from my local machine. The past few times I've needed to do a full reindex of Earthworks, in both staging and prod, I've used it.

I also heard from some folks on the Earthworks workcycle who were working on solr changes that they wanted to index a lot of records (thousands) locally for testing. I think this is probably the easiest way to do that, or to create a local copy of all of production in order to test indexing changes.

For all of the cases I've used this, the underlying data in Argo has not changed – only the indexing configuration. Republishing the data there would move a lot of objects through the system unnecessarily.

jcoyne · 2024-09-09T16:59:37Z

@thatbudakguy When you do a "Republish" in Argo, all it's basically just ensuring the metadata is fresh and telling the indexer to go. It's not a lot of "unnecessary moving". But if this is what you need then go for it.

This adds a traject reader that can be useful in development when you want to quickly index many records from purl-fetcher without resorting to Kafka. It is intended for dev use only. It can point at any release target (searchworks, earthworks) and index all of the items currently released to that target. This PR also modifies PublicCocinaRecord and PublicXmlRecord to optionally accept a connection object, so that a single Faraday connection can be shared amongst the reader and records, which enables parallelizing record-fetching from purl to match the number of traject threads. This setup allows indexing everything released to Earthworks in a little under 5 minutes with 4 threads on my machine.

thatbudakguy force-pushed the purl-fetcher-reader branch 3 times, most recently from 076f69d to c6a538f Compare August 30, 2024 15:53

thatbudakguy force-pushed the purl-fetcher-reader branch 3 times, most recently from 890fb28 to ccdecf6 Compare September 6, 2024 22:04

thatbudakguy marked this pull request as ready for review September 6, 2024 22:08

thatbudakguy force-pushed the purl-fetcher-reader branch from ccdecf6 to 905be11 Compare September 6, 2024 23:50

thatbudakguy force-pushed the purl-fetcher-reader branch from 905be11 to ce3c68b Compare September 9, 2024 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a reader for harvesting directly from purl-fetcher HTTP API #1511

Add a reader for harvesting directly from purl-fetcher HTTP API #1511

thatbudakguy commented Aug 27, 2024

jcoyne commented Sep 9, 2024

thatbudakguy commented Sep 9, 2024

jcoyne commented Sep 9, 2024

Add a reader for harvesting directly from purl-fetcher HTTP API #1511

Are you sure you want to change the base?

Add a reader for harvesting directly from purl-fetcher HTTP API #1511

Conversation

thatbudakguy commented Aug 27, 2024

jcoyne commented Sep 9, 2024

thatbudakguy commented Sep 9, 2024

jcoyne commented Sep 9, 2024