Skip to content
This repository has been archived by the owner on Sep 21, 2024. It is now read-only.

chore: Task Workers Revamp #796

Merged
merged 1 commit into from
Feb 29, 2024
Merged

chore: Task Workers Revamp #796

merged 1 commit into from
Feb 29, 2024

Conversation

jsantell
Copy link
Contributor

@jsantell jsantell commented Jan 24, 2024

Decoupled Worker Tasks

Worker tasks are now abstracted such that routes now send jobs via a JobClient decoupled from where the work is performed. A standalone noosphere_gateway::jobs::worker_queue module provides the work distribution logic, with each processor receiving a GatewayJobContext that provides access to sphere contexts, IPFS, and name resolvers when processing a job.

  • Worker jobs can now queue up additional tasks on success (IPFS Syndication -> Name Record Publish) (LinkRecord can propagate before blocks are available #606).
  • Jobs can be retried upon failure.
  • Some timeout logic has been implemented, but not yet ready.
  • Republishing name records is now its own job type. Eliminates the need for the scheduler to resolve the latest name record.

Minimize external queries

GatewayManager and routes were updated to minimize GM queries (#753).

In addition to a single, cacheable baseline query for scope, unauthenticated routes now has one less external query, and authenticated routes also has one less external query, with another query moved behind authorization.

Ultimately handler logic should not be changed, other than moving Push handler's queueing of name record publishing behind IPFS syndication.

Caveats/Concerns/Enhancements

  • We should do some profiling to measure lock contention across workers where appropriate amongst shared resources like KuboClient and the name resolver.
  • Job processors may need to be rethought in order to better take advantage of retries e.g. classifying a failure between retriables, invalid state/unretriable, and failed-but-ok-to-lazily-try-again-later.
  • Implement timeouts in worker queue. Timeouts now implemented
  • Ipfs syndication revision parameter now optional (unused? to confirm)
  • Ensure name record provided to PUSH is still valid after Ipfs syndication completes due to record expiry

@jsantell jsantell force-pushed the task-workers branch 6 times, most recently from b8f9f6b to a24a2af Compare January 31, 2024 18:31
@jsantell jsantell changed the title [WIP] Task Workers chore: Task Workers Revamp Jan 31, 2024
@jsantell jsantell marked this pull request as ready for review January 31, 2024 18:50
@jsantell jsantell requested a review from cdata January 31, 2024 18:50
Copy link
Contributor

github-actions bot commented Jan 31, 2024

Test flake analysis

No flakes detected 🎉

status platform features toolchain
🟢 ubuntu-latest test-kubo,headers stable
🟢 ubuntu-latest test-kubo,headers,rocksdb stable
🟢 ubuntu-latest test-kubo,headers nightly
🟢 macos-13 test-kubo,headers,rocksdb stable
🟢 macos-13 test-kubo,headers stable
🟢 windows-latest test-kubo,headers stable

@jsantell jsantell force-pushed the task-workers branch 3 times, most recently from 959058a to 8edb673 Compare February 1, 2024 19:20
@jsantell
Copy link
Contributor Author

jsantell commented Feb 1, 2024

Added explicit checks on existence of name records after syncing changes in tests, possibly clearing up a handful of intermittent failures.

Windows builds are having an unrelated issue: #805

@jsantell jsantell requested review from cdata and removed request for cdata February 1, 2024 19:55
@jsantell
Copy link
Contributor Author

jsantell commented Feb 5, 2024

Added time out support, recreating the hanging worker thread and attempting to resubmit the job. Of note, the SingleTenantJobClient is set to only 1 attempt for now, until we normalize what an error is in the job processors

@jsantell jsantell force-pushed the task-workers branch 2 times, most recently from a6ca055 to 30c88a1 Compare February 5, 2024 22:29
const JOB_RETRIES: usize = 1;
/// How many seconds before a job is considered broken,
/// and potentially restarted.
const TIMEOUT_SECONDS: u64 = 180;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syndicating large histories could be longer than this timeout, and notably, name records' publishing timeframe is currently 2 minutes.

@jsantell
Copy link
Contributor Author

Rebasing on #815

@cdata
Copy link
Collaborator

cdata commented Feb 26, 2024

Ipfs syndication revision parameter now optional (unused? to confirm)

IIRC we now syndicate full spheres at a time to IPFS, so it isn't necessary to track this.

Copy link
Collaborator

@cdata cdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bones look excellent, but I have some light opinions about some of the flesh :)

Great work so far!

rust/noosphere-gateway/src/jobs/worker_queue/mod.rs Outdated Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/queue.rs Outdated Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/worker.rs Outdated Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/worker.rs Outdated Show resolved Hide resolved
rust/noosphere-ns/src/utils.rs Outdated Show resolved Hide resolved
rust/noosphere-ns/src/utils.rs Show resolved Hide resolved
rust/noosphere/tests/distributed_basic.rs Outdated Show resolved Hide resolved
rust/noosphere-ns/src/utils.rs Outdated Show resolved Hide resolved
rust/noosphere-ns/src/utils.rs Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/queue.rs Outdated Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/worker.rs Outdated Show resolved Hide resolved
rust/noosphere-gateway/src/jobs/worker_queue/worker.rs Outdated Show resolved Hide resolved
@jsantell
Copy link
Contributor Author

jsantell commented Feb 27, 2024

Updates (commit):

  • Changes/cleanups from feedback
  • Remove vestigial revision in IpfsSyndication job
  • Reduce indirection from cloning (e.g. use outer Arcs, not inner)
  • s/ContextResolver/SphereContextResolver
  • s/NameResolverExt/NameResolverPoller
  • s/WorkerQueueThread/WorkerQueueOrchestrator
  • More documentation for worker_queue
  • Updated retries to explicitly be "retrying" (defaulting to 0) rather than "number of attempts" (defaulting to 1)

Copy link
Collaborator

@cdata cdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the hard work on this!

* Decouple worker tasks to run independently of the gateway. Fixes #720.
* Reduce GatewayManager queries in the gateway handlers on request.
  Fixes #723.
* Introduce polling helpers in tests ensuring async name record publication
  occurs before continuing the test.
@jsantell jsantell merged commit 887ebff into main Feb 29, 2024
21 of 22 checks passed
@jsantell jsantell deleted the task-workers branch February 29, 2024 18:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants