Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the agent #684

Merged
merged 8 commits into from
Jul 15, 2024
Merged

Conversation

diconico07
Copy link
Contributor

This PR broadly refactors the agent to:

  • use kube Controller construct
  • take advantage of Server Side Apply
  • prepare for resource split and CDI+DRA
  • don't put everything under a util directory
  • use closer to kube upstream kube client
  • update proto definitions for device plugins
  • use kubelet pod resources monitoring interface rather than CRI to do slot reconciliation
  • Use CRD definition in Rust code to generate yaml file

While a strict refactor would not change anything user-facing, in order to facilitate some elements, the capacity is added to the Instance object, as well as the CDI fully qualified device name.

This Still WIP for now, as most of the unit tests were not ported to new architecture.

What this PR does / why we need it:

Special notes for your reviewer:
Sorry for the huge PR

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

@diconico07 diconico07 force-pushed the revamp-agent branch 2 times, most recently from 47bbb6e to 8b30652 Compare December 28, 2023 15:29
@bfjelds
Copy link
Collaborator

bfjelds commented Jan 4, 2024

assuming that there are instances of files that are just existing code that has been moved (but otherwise functionally unchanged), it would be helpful if you could point those out so we don't waste time reviewing them :)

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 4, 2024

prepare for resource split and CDI+DRA

yay! will we get deallocate calls from DRA? (and be able to do away with the reconciler!!!)

&mut self,
request: impl tonic::IntoRequest<super::PreferredAllocationRequest>,
) -> Result<tonic::Response<super::PreferredAllocationResponse>, tonic::Status> {
self.inner.ready().await.map_err(|e| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use this preferredallocation concept to avoid kubelet requesting shared resources? (or maybe this isn't worth investing in if we move to DRA)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the preferred allocation is preference, there are no guarantees that the kubelet will follow those preferences (especially if multiple resources are requested), so I think it's not worth it (esp. compared to DRA)

@diconico07
Copy link
Contributor Author

assuming that there are instances of files that are just existing code that has been moved (but otherwise functionally unchanged), it would be helpful if you could point those out so we don't waste time reviewing them :)

Some parts of files are code that just got moved, but most of it required rework (hence the fact I couldn't just copy/paste the unit tests)

Copy link
Contributor

@kate-goldenring kate-goldenring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments based on our first synchronous review

shared/src/k8s/crud.rs Outdated Show resolved Hide resolved
agent/src/util/discovery_configuration_controller.rs Outdated Show resolved Hide resolved
@@ -19,11 +19,46 @@ pub type InstanceList = ObjectList<Instance>;
#[derive(CustomResource, Deserialize, Serialize, Clone, Debug, JsonSchema)]
#[serde(rename_all = "camelCase")]
// group = API_NAMESPACE and version = API_VERSION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diconico07 do we need to add an annotation to declare this immutable? Can we add a todo to support capacity changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not enforce immutability on the Instance as it is a purely internal one, concerning the capacity changes, I added a TODO comment about it in the device plugin instance controller (where the change will be needed)

agent/src/util/discovery_configuration_controller.rs Outdated Show resolved Hide resolved

/// Real world implementation of the Discovery Handler Request
struct DHRequestImpl {
endpoints: RwLock<Vec<watch::Receiver<Vec<Arc<DiscoveredDevice>>>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can we document in code that we support multiple discovery handlers of the same type? AKA if i have deployed 3 onvif discovery handlers, for each configuration applied, i create 3 connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is a right place for documenting this in code (maybe in the crate documentation of the discovery utils crate), but we may want to add a paragraph about this in the docs

@diconico07 diconico07 force-pushed the revamp-agent branch 3 times, most recently from e7a0872 to 83d3e7f Compare February 6, 2024 13:44
This commit broadly refactors the agent to:
- use kube Controller construct
- take advantage of Server Side Apply
- prepare for resource split and CDI+DRA
- don't put everything under a util directory
- use closer to kube upstream kube client
- update proto definitions for device plugins
- use kubelet pod resources monitoring interface rather than CRI to do
  slot reconciliation
- Use CRD definition in Rust code to generate yaml file

Signed-off-by: Nicolas Belouin <[email protected]>
@diconico07 diconico07 marked this pull request as ready for review February 15, 2024 10:01
Copy link
Contributor

@kate-goldenring kate-goldenring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got about half way through and am really liking how you've brought in the kube-rs controller updates. All i have left is reviewing the discovery_configuration_controller, though that is a bulk of the implementation. My main confusion so far is with why we are using Signal.

agent/build.rs Outdated Show resolved Hide resolved
deployment/helm/crds/akri-instance-crd.yaml Show resolved Hide resolved
shared/src/k8s/crud.rs Outdated Show resolved Hide resolved
shared/src/akri/configuration.rs Show resolved Hide resolved
agent/src/plugin_manager/device_plugin_slot_reclaimer.rs Outdated Show resolved Hide resolved
agent/src/main.rs Outdated Show resolved Hide resolved
agent/src/util/discovery_configuration_controller.rs Outdated Show resolved Hide resolved
agent/src/util/discovery_configuration_controller.rs Outdated Show resolved Hide resolved
agent/src/util/discovery_configuration_controller.rs Outdated Show resolved Hide resolved
Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
Copy link
Contributor

@kate-goldenring kate-goldenring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the implementation looks good. I had mainly nit comments. I really like the slot reconciliation changes. However, I wasn't able to successfully get the changes running with a debugEcho discovery handler to test the changes. It looks like an issue with kube_client resolving the CRD updates. I am running the agent and debugEcho DH locally and applying this config.

The client errors with .spec.capacity: field not declared in schema and .spec.cdiName: field not declared in schema. Here is a snippet of error from the agent:

[2024-06-27T22:23:46Z INFO  kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=unknown
[2024-06-27T22:23:46Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:46Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 1s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema:  (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
    
    Caused by:
        failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema: )
[2024-06-27T22:23:47Z INFO  kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=error policy requested retry
[2024-06-27T22:23:47Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:47Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 2s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema:  (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
...


[2024-06-27T22:24:31Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 64s: Other(ApiError: the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest (ErrorResponse { status: "Failure", message: "the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided", reason: "BadRequest", code: 400 })
    
    Caused by:
        the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest)

agent/proto/pluginapi.proto Show resolved Hide resolved
let state = self.state.borrow();
let cdi_kind = state.get(kind)?;
let mut device = cdi_kind.devices.iter().find(|dev| dev.name == id)?.clone();
device.name = format!("{}-{}", kind, id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this modification of the name should be put in it's own function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to, this name modification is really only relevant for this in-memory/non full CDI approach we are currently using with device plugins, so won't get re-used, and would probably make it more complicated to read from my point of view.

},
Ok(endpoint) = rec.recv() => {
if endpoint.get_name() != self.handler_name {
// We woke up for another kind of DH, let's get back to sleep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen? When would DH Foo be woken up by DH Bar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename rec and endpoint here to be more explicit, this can happen if another discovery handler registers itself:
We are awaking all active requests whenever a new discovery handler registers itself, so that if it is a DH with the same name/kind as the one of this request we also send the request to that newly registered endpoint.
To simplify I chose to have a single MQ to broadcast those registrations (that should be pretty rare anyway), rather than creating one per name/kind of DH.

}

#[async_trait]
impl DiscoveryHandlerEndpoint for NetworkEndpoint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to it's own module, say network_handler.rs? It is unexpected to have it bundled in with the registration module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to me to have them separated as the registration socket only cares for network handlers anyway.
Maybe I should just rename the module network_handler.rs, as it basically defines the network handlers and its registration mechanism (akin to the embedded_handlers.rs module that also contains the registration of the active embedded handlers)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Maybe we do this in a followup pr so as to not add more changes

Copy link
Contributor

@kate-goldenring kate-goldenring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @diconico07 for revamping the agent and the patience iterating on this.

@kate-goldenring kate-goldenring added version/minor Minor version change is needed and removed version/minor Minor version change is needed labels Jul 12, 2024
@kate-goldenring
Copy link
Contributor

/version minor

@github-actions github-actions bot added the version/minor Minor version change is needed label Jul 12, 2024
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@kate-goldenring kate-goldenring merged commit 3175bb9 into project-akri:main Jul 15, 2024
3 checks passed
@diconico07 diconico07 mentioned this pull request Jul 31, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version/minor Minor version change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants