Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve argument streaming for remote job execution (part 1) #1397

Draft
wants to merge 10 commits into
base: devel
Choose a base branch
from

Conversation

Shrews
Copy link
Contributor

@Shrews Shrews commented Sep 24, 2024

Remote Job Execution Protocol Changes

Reason For Change

Remote job execution involves a Transmitter node streaming the job keyword arguments as a JSON object to a Worker node. Adding new keyword arguments to the interface.run() or interface.run_async() methods can cause errors within the remote job streaming interface when an older Worker node receives the new, unrecognized keyword argument from a newer Transmitter node (worker/transmitter version mismatch).

Fixes #1324

Solution

The total fix will require a multi-stage solution:

  • First stage of the solution (implemented in this PR) will have the Transmitter process stream only keyword arguments that are actually specified and have a value different from their default value. This does not fix the case when a new keyword argument is introduced and used with a non-default value, but does fix older Worker nodes from failing when the new keyword is not used since the new keyword will not be sent in the argument stream.
  • The second stage (not implemented here) will have the Worker process gracefully fail on unrecognized keywords.

Implementation Details

  • RunnerConfig object will become the main source for supplying job parameters instead of kwargs.
    • Turns the BaseConfig and RunnerConfig classes into dataclasses to allow us to properly define arguments, their data types, and their default values.
      • Since the __init__() method is now autogenerated, care is taken to keep the same name for a few class attributes that did not have the same name as their argument. This is done through the use of Python class property decorators. Internally, the attribute is referenced by the non-alias version.
    • By default, all RunnerConfig attributes are streamable to a Worker. Attributes that are NEVER streamable will have explicit dataclass metadata to mark it as such.
  • interface.run() and interface.run_async() changed to accept a RunnerConfig object.
    • Can now be used in one of two ways:
      • Old-style: method takes only kwargs parameters; backwards compatible
      • New-style: method takes a RunnerConfig object and a few select parameters.
    • Non-public methods lower in the call hierarchy (init_runner(), dump_artifacts(), etc) are modified to allow for receiving a RunnerConfig object instead of keyword arguments.
  • Transmitter modified to query the RunnerConfig object to retrieve the keyword arguments to transmit.

TODO

  • Harden and add tests for new API.
  • Update documentation.
  • Testing with Controller?

@Shrews Shrews force-pushed the kwarg-protocol-3 branch 2 times, most recently from c0b642f to 1931b5f Compare September 24, 2024 19:30
@github-actions github-actions bot added the docs Changes to documentation label Oct 14, 2024
@Shrews Shrews changed the title Keyword argument protocol - config dataclasses Improve argument streaming for remote job execution (part 1) Oct 14, 2024
@Shrews Shrews force-pushed the kwarg-protocol-3 branch 3 times, most recently from 368faa7 to 15fe5e5 Compare October 17, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Changes to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Account for streaming payload splatting version incompatibilities
1 participant