Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a reader for harvesting directly from purl-fetcher HTTP API #1511

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thatbudakguy
Copy link
Member

This adds a traject reader that can be useful in development when
you want to quickly index many records from purl-fetcher without
resorting to Kafka. It is intended for dev use only.

It can point at any release target (searchworks, earthworks) and
index all of the items currently released to that target.

This PR also modifies PublicCocinaRecord and PublicXmlRecord to
optionally accept a connection object, so that a single Faraday
connection can be shared amongst the reader and records, which
enables parallelizing record-fetching from purl to match the
number of traject threads.

This setup allows indexing everything released to Earthworks in
a little under 5 minutes with 4 threads on my machine.

@thatbudakguy thatbudakguy force-pushed the purl-fetcher-reader branch 3 times, most recently from 076f69d to c6a538f Compare August 30, 2024 15:53
@thatbudakguy thatbudakguy force-pushed the purl-fetcher-reader branch 3 times, most recently from 890fb28 to ccdecf6 Compare September 6, 2024 22:04
@thatbudakguy thatbudakguy marked this pull request as ready for review September 6, 2024 22:08
@jcoyne
Copy link
Contributor

jcoyne commented Sep 9, 2024

@thatbudakguy Can you explain how would you use the purl-fetcher in development? This seems like functionality that is already built into Argo, and we're building it a second time, making more code paths to have to maintain.

@thatbudakguy
Copy link
Member Author

I use this to do reindexing of SDR content from my local machine. The past few times I've needed to do a full reindex of Earthworks, in both staging and prod, I've used it.

I also heard from some folks on the Earthworks workcycle who were working on solr changes that they wanted to index a lot of records (thousands) locally for testing. I think this is probably the easiest way to do that, or to create a local copy of all of production in order to test indexing changes.

For all of the cases I've used this, the underlying data in Argo has not changed – only the indexing configuration. Republishing the data there would move a lot of objects through the system unnecessarily.

@jcoyne
Copy link
Contributor

jcoyne commented Sep 9, 2024

@thatbudakguy When you do a "Republish" in Argo, all it's basically just ensuring the metadata is fresh and telling the indexer to go. It's not a lot of "unnecessary moving". But if this is what you need then go for it.

This adds a traject reader that can be useful in development when
you want to quickly index many records from purl-fetcher without
resorting to Kafka. It is intended for dev use only.

It can point at any release target (searchworks, earthworks) and
index all of the items currently released to that target.

This PR also modifies PublicCocinaRecord and PublicXmlRecord to
optionally accept a connection object, so that a single Faraday
connection can be shared amongst the reader and records, which
enables parallelizing record-fetching from purl to match the
number of traject threads.

This setup allows indexing everything released to Earthworks in
a little under 5 minutes with 4 threads on my machine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants