Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision Tracker - Describe the device capabilities specification #9

Open
Tracked by #117
ajcraig opened this issue Jun 5, 2024 · 9 comments
Open
Tracked by #117
Assignees
Labels
Decision Tracker Preview Release 1: Workload Managemnt Included wihtin the preview release 1.

Comments

@ajcraig
Copy link
Contributor

ajcraig commented Jun 5, 2024

Below I have outlined a proposal for the Device Capability Specification that is utilized by the Workload Orchestration Software. This file informs the WOS of the properties, resources, and components of the Margo compliant devices.

The associated workflow / use case for this is detailed below:

  1. Device Owner creates the Device Capability file associated with the device and stores it locally.
  2. Workload Orchestration Software enrolls the Workload Orchestration Agent into their platform.
  3. During the enrollment process, the agent sends the device capabilities specification to the WOS.
  4. This information is then utilized by the WOS to ensure compatibility with applications, inform the user of the device's information, and allows the WOS to manage the usage of the device to ensure it isn't over capacity.

Proposed Margo Device Capability Specification

kind: devicecapabilityspecification
properties:
    deviceId: 
    deviceVendor:
    modelNumber:
    serialNumber:
    margoDeviceRole:
        role1:
    deviceResources:
        cpuArchitecture:
        cpuModel:
        vcpuCount:
        cpuFrequency: 
        memoryCapacity: 
        storageCapacity:
    devicePeripherals:
        periferal1: 
            - name:
            - type: 
            - modelNumber: 
            - description:
        periferal2:
            - name:
            - type:
            - modelNumber:
            - description:
    deviceInterfaces:
        interface1:
            - name:
            - type: 
            - modelNumber: 
        interface2:
            - name:
            - type:
            - modelNumber: 

Device Capability Attributes

Top-level Attributes

Attribute Type Required? Description
properties Properties Y Metadata element specifying characteristics about the device. See the Properties section below.

Properties Atrributes

Attribute Type Required? Description
deviceID string Y Unique deviceID assigned to the device via the Device Orchestration Software.
deviceVendor string Y Defines the device vendor.
modelNumber string Y Defines the model number of the device.
serialNumber string Y Defines the servial number of the device.
margoDeviceRole Margo Device Role Y Spec element that defines the device role it can provide to the Margo environment. See the Margo Device Role section below.
deviceResources Device Resources Y Spec element that defines the device's resources available to the application deployed on the device. See the Device Resources section below.
devicePeripherals Device Peripherals Y Spec element that defines the device's peripherals available to the application deployed on the device. See the Device Peripherals section below.
deviceInterfaces Device Interfaces Y Spec element that defines the device's interfaces that are available to the application deployed on the device. See the Device Interfaces section below.

Margo Device Role Attributes

Attribute Type Required? Description
role string Y Defines the device role(s) it can provide to the Margo environment.the device can represent within Identifier of the version of the API the object definition follows.

Device Resources

Attribute Type Required? Description
cpuArchitecture string Y Defines the CPUs architecture. i.e. ARM/x86.
cpuModel string Y Defines the CPU Model of the device.
vcpuCount integer Y Defines the vCPU count available on the device.
cpuFrequency integer Y Defines the frequency of the CPU.
memoryCapacity integer Y Defines the memory capacity available for applciations on the device. This MUST be defined in MBs
storageCapacity integer Y Defines the storage capacity available for applications to utilize. This MUST be defined in MBs.

Device Peripherals

Attribute Type Required? Description
peripheral Peripheral Y Defines a peripheral that is present on the edge device. Can be one to many described in this section. See the Peripheral Attributes section below.

Peripheral Attributes

Attribute Type Required? Description
name string Y Name of the peripheral.
type string Y Type of the peripheral. i.e. GPU
modelNumber string Y Model number of the peripheral.
description string Y Description of the peripheral which can be used to describe within the WOS GUI.

Device Interfaces

Attribute Type Required? Description
interface Interface Y Defines a interface that is present on the edge device. Can be one to many described in this section. See the Interface Attributes section below.

Interface Attributes

Attribute Type Required? Description
name string Y Name of the interface.
type string Y Type of the interface. i.e. Ethernet NIC,
modelNumber string Y Model number of the interface.
description string Y Description of the interface which can be used to describe within the WOS GUI.
@ajcraig
Copy link
Contributor Author

ajcraig commented Jun 5, 2024

@margo/technical-wg Please review the proposal above.

Thanks!

@ajcraig ajcraig self-assigned this Jun 5, 2024
@phil-abb
Copy link
Contributor

phil-abb commented Jun 6, 2024

Here's a proposed set of modifications

  • Following convention, the kind should be Pascal case and I don't think we need "specification" in the kind.
  • We know it's for the device's capabilities so I don't think we need to prefix anything with device
  • I prefer to group things instead of repeating prefixes and suffixes so I prefer having a group for CPU and capacity
  • role should be plural since a device can fill multiple roles. I don't think we need "margo" in the name either because the whole thing is Margo.
  • Instead of using periferal1, periferal2, interface1, interface2, etc. we can just key it off the name
  • I prefer properties instead of description unless you meant for description to be like a short sentence about the item maybe? If so, this is fine but I think we need to allow for a properties dictionary to allow for indicating the unique properties for that device. This could potentially be free form or we may need to define specific properties for certain types of hardware.
apiVersion: device.margo/v1
kind: DeviceCapability
properties:
  id: 
  vendor:
  modelNumber:
  serialNumber:
  roles:
  resources:
    cpu:
      architecture:
      model:
      cores:
      frequency: 
    capacity:
      memory: 
      storage:
  peripherals:
    - name: 
      type: 
      modelNumber: 
      properties:
  interfaces:
    - name: 
      type: 
      modelNumber: 
      properties:

Simple Example

apiVersion: device.margo/v1
kind: DeviceCapability
properties:
  id: northstarida.xtapro.edge
  vendor: Northstar Industrial Applications
  modelNumber: 332ANZE1-N1
  serialNumber: PF45343-AA
  roles:
    - standalone cluster
    - cluster lead
  resources:
    cpu:
      architecture: Intel x64
      model: i9-14900KS
      cores: 24
      frequency: 6.2 GHz
    capacity:
      memory: 64.0 GB
      storage: 2 TB
  peripherals:
    - name: NVIDIA GeForce RTX 4070 Ti SUPER OC Edition Graphics Card 
      type: GPU  
      modelNumber: TUF-RTX4070TIS-O16G
      properties:
        manufacturer: NVIDIA
        series: NVIDIA GeForce RTX 40 Series
        gpu: GeForce RTX 4070 Ti SUPER
        ram: 16 GB
        clockSpeed: 2640 MHz
  interfaces:
    - name: RTL8125 NIC 2.5G Gigabit LAN Network Card
      type: Ethernet
      modelNumber: RTL8125 
      properties:
        maxSpeed: 2.5 Gbps
    - name: WiFi 6E Intel AX411NGW M.2 Cnvio2
      type:  Wi-Fi
      modelNumber: AX411NGW
      properties:
        bands: ["2.4 GHz", "5 GHz", "6GHz"]
        maxSpeed: 2.4 Gbps

@phil-abb
Copy link
Contributor

phil-abb commented Jun 6, 2024

For peripherals.type and interfaces.type I think we'll need to make a list of types that should be used. We may not be able to come up with a complete list but we should try to come up with as many as we can think of so people are using the type consistently. If we don't, trying to match up the requirements is going to be too difficult if it's intended to be automatable.

for anything with a unit (GB, TB, GHz, Mhz, Gbps, etc.) we'll probably need to define what the unit should be (or at least a consistent naming convention) if the intention is to try to pair these resources requirements with the application's requirements. If we don't, trying to match up the requirements is going to be too difficult if it's intended to be automatable.

@phil-abb
Copy link
Contributor

phil-abb commented Jun 6, 2024

Do we need to include any information about the container platform it is running (e.g., Docker, Podman, Kubernetes distribution, version, etc.)?

@julienduquesnay-se
Copy link

julienduquesnay-se commented Jun 7, 2024 via email

@phil-abb
Copy link
Contributor

phil-abb commented Jun 7, 2024

In the resource section, "cpu" and "capacity" are not at the same semantic level ("cpu" being a device and "capacity" a characteristic).

fair point. Maybe for memory and storage, we don't need to group them under capacity

resources:
  memory:
  storage:
  cpu:
  - model:
    architecture:
    cores:
    - type:
      count:
      maxFrequency:

@phil-abb
Copy link
Contributor

Here is a real use case we have for an application requiring a specific piece of hardware

An application vendor has an application requiring an NVIDIA GPU on the device. The application is designed to work only with NVIDIA GPUs and recommends the NVIDIA GPU has at least 16 GB of RAM to run efficiently. For the application to run on the device the application vendor expects the device to have the NVIDIA device drivers installed as well as the Nvidia Operator on Kubernetes or NVIDIA Container toolkit for docker-compose based deployment. The drivers, operator, and toolkit require elevated permissions to install so the expectation is these are installed and configured by the device owner.

@margo/technical-wg I think it would be good if we could provide actual use cases we have where applications require a specific piece of hardware so we can see what the requirements are instead of trying to guess what they might be. I think this will help us figure out what needs to be in this file. Does anyone else have any real use cases?

@gunjald
Copy link

gunjald commented Jun 12, 2024

Here is a real use case we have for an application requiring a specific piece of hardware

An application vendor has an application requiring an NVIDIA GPU on the device. The application is designed to work only with NVIDIA GPUs and recommends the NVIDIA GPU has at least 16 GB of RAM to run efficiently. For the application to run on the device the application vendor expects the device to have the NVIDIA device drivers installed as well as the Nvidia Operator on Kubernetes or NVIDIA Container toolkit for docker-compose based deployment. The drivers, operator, and toolkit require elevated permissions to install so the expectation is these are installed and configured by the device owner.

@margo/technical-wg I think it would be good if we could provide actual use cases we have where applications require a specific piece of hardware so we can see what the requirements are instead of trying to guess what they might be. I think this will help us figure out what needs to be in this file. Does anyone else have any real use cases?

Few observations.

  1. In such use cases event the NVIDIA GPUs are having different architectures and possibly the CUDA capabilities may also differ. So the more specifics will need to be asked to determine the compatible GPU that application needs
  2. Not sure if the applications are supposed to consume "K8s Operator" themselves as it is something the management platform will use to manage the cluster resources or for similar functions.

@ajcraig
Copy link
Contributor Author

ajcraig commented Aug 19, 2024

This issue is now tied directly to the Margo Management Interface PR where the latest device capabilities proposal can be found.

@margo/approvers - Let's consider this another Decision tracker item. This issue will be closed when the PR is merged.

@ajcraig ajcraig linked a pull request Aug 19, 2024 that will close this issue
4 tasks
@ajcraig ajcraig changed the title Proposal/Discussion: Describe the device capabilities specification Decision Tracker - Describe the device capabilities specification Aug 19, 2024
@ajcraig ajcraig added the Preview Release 1: Workload Managemnt Included wihtin the preview release 1. label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Decision Tracker Preview Release 1: Workload Managemnt Included wihtin the preview release 1.
Development

No branches or pull requests

5 participants