Add a reader for harvesting directly from purl-fetcher HTTP API #1511
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a traject reader that can be useful in development when
you want to quickly index many records from purl-fetcher without
resorting to Kafka. It is intended for dev use only.
It can point at any release target (searchworks, earthworks) and
index all of the items currently released to that target.
This PR also modifies PublicCocinaRecord and PublicXmlRecord to
optionally accept a connection object, so that a single Faraday
connection can be shared amongst the reader and records, which
enables parallelizing record-fetching from purl to match the
number of traject threads.
This setup allows indexing everything released to Earthworks in
a little under 5 minutes with 4 threads on my machine.