Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for requesting max SHM size from SDL #179

Closed
2 tasks done
anilmurty opened this issue Jan 23, 2024 · 7 comments
Closed
2 tasks done

Support for requesting max SHM size from SDL #179

anilmurty opened this issue Jan 23, 2024 · 7 comments
Assignees
Labels
P1 repo/akash-api repo/node Akash node repo issues repo/provider Akash provider-services repo issues

Comments

@anilmurty
Copy link

anilmurty commented Jan 23, 2024

Is your feature request related to a problem? Please describe.

Customers (particularly AI/ ML training workloads) frequently need to be able to have multiple services share storage - for example one service that is downloading data and labeling is CPU bound, while another that uses the data for training is GPU bound and they can run in parallel but need to access large shared memory. We currently don't allow the max SHM size to be controllable by the user which makes it hard to run such workloads.

Describe the solution you'd like

Support being able to specify and request SHM size as part of the SDL

Describe alternatives you've considered

  1. Manually applying it on the provider: This is the workaround we have been pursuing so far but it's painful because it needs to be done every time a new deployment is done or the deployment restarts for some reason. Also this requires coordination with the provider who may not be in the same TZ as the tenant.
    Note that we have tested being able to apply these changes on the provider side manually during our work with Thumper training on the FoundryStaking provider.

Search

  • I did search for other open and closed issues before opening this

Code of Conduct

  • I agree to follow this project's Code of Conduct

Additional context

No response

@anilmurty anilmurty changed the title Support/ Workaround for SHM Support for SHM Jan 23, 2024
@troian troian added repo/node Akash node repo issues repo/provider Akash provider-services repo issues repo/akash-api P1 and removed awaiting-triage labels Jan 24, 2024
@troian troian changed the title Support for SHM SHM support Jan 24, 2024
@anilmurty
Copy link
Author

Per Feb 20 call: We are leaning towards implementing full support for SHM (not just the workaround with bid attributes). @boz is planning to take this on (Thanks Adam!)

@anilmurty
Copy link
Author

In the interim @troian and @chainzero are going to look into the workaround with using bid attributes + a daemon running on the provider that checks the attributes and applies SHM using kubectl commands

@anilmurty anilmurty changed the title SHM support Support for requesting max SHM size from SDL Feb 23, 2024
@brewsterdrinkwater brewsterdrinkwater moved this from Up Next (prioritized) to In Progress (prioritized) in Akash Cohesive Product / Engineering Roadmap Feb 23, 2024
boz added a commit to akash-network/provider that referenced this issue Feb 26, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
boz added a commit to akash-network/node that referenced this issue Feb 26, 2024
* Add SDL support for `"ram"` storage class.
* `"ram"` volumes cannot be persistent or `ReadOnly`.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
boz added a commit to akash-network/provider that referenced this issue Feb 26, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179
boz added a commit to akash-network/node that referenced this issue Feb 26, 2024
* Add SDL support for `"ram"` storage class.
* `"ram"` volumes cannot be persistent or `ReadOnly`.

refs akash-network/support#179
troian pushed a commit to akash-network/node that referenced this issue Feb 27, 2024
* Add SDL support for `"ram"` storage class.
* `"ram"` volumes cannot be persistent or `ReadOnly`.

refs akash-network/support#179
@brewsterdrinkwater
Copy link
Collaborator

March 4th, 2024:

  • This will be tested and then merged to the provider sometime this week.
  • Should be code complete for now.

@anilmurty anilmurty moved this from In Progress (prioritized) to In Test (or staging) in Akash Cohesive Product / Engineering Roadmap Mar 5, 2024
boz added a commit to akash-network/provider that referenced this issue Mar 7, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
boz added a commit to akash-network/provider that referenced this issue Mar 8, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
@brewsterdrinkwater
Copy link
Collaborator

March 12th, 2024:

  • will be tested after network upgrade.
  • Will merge after testing

Does not need a network upgrade. No SDL changes.

troian pushed a commit to akash-network/provider that referenced this issue Mar 20, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
troian pushed a commit to akash-network/provider that referenced this issue Mar 20, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
troian pushed a commit to akash-network/provider that referenced this issue Mar 21, 2024
* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>
@andy108369
Copy link
Contributor

akash network 0.32.2
provider-services 0.5.9

shm doesn't seem to be working yet.

provider attributes

$ provider-services query provider get akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk -o text
attributes:
- key: host
  value: akash
- key: organization
  value: overclock
- key: datacenter
  value: hurricane
- key: capabilities/gpu/vendor/nvidia/model/t4
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi/interface/pcie
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/interface/pcie
  value: "true"
- key: capabilities/storage/1/class
  value: default
- key: capabilities/storage/1/persistent
  value: "true"
- key: capabilities/storage/2/class
  value: beta3
- key: capabilities/storage/2/persistent
  value: "true"
- key: capabilities/storage/3/class
  value: ram
- key: capabilities/storage/3/persistent
  value: "false"
- key: ip-lease
  value: "true"
host_uri: https://provider.hurricane.akash.pub:8443
info:
  email: [email protected]
  website: https://akash.network
owner: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk

SDL

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII [email protected]'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true


profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

after send-manifest:

E[2024-03-30|21:36:59.567] applying deployment                          cmp=provider client=kube err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm"" lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk service=ssh
E[2024-03-30|21:36:59.567] unable to deploy lid=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk. last known state:
cmp=provider client=kube
E[2024-03-30|21:36:59.567] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""
E[2024-03-30|21:36:59.567] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash state=deploy-active err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""

@troian

@andy108369
Copy link
Contributor

SDL (pers.volume + /dev/shm)

In the case of two volumes - pers.volume + shm (ram) I'm getting "manifest version validation failed" from provider.

SDL:

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII [email protected]'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        data:
          mount: /root
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true


profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: data
            size: 5Gi
            attributes:
              persistent: true
              class: beta3
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

Client:

provider-services 0.5.9

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][]$ akash_deploy ssh-shm-and-pers.yaml
INFO: Broadcasting 'provider-services deployment create -y --deposit 500000uakt -- ssh-shm-and-pers.yaml' transaction...
INFO: Waiting for the TX 1CCD212E8E216E23A168B32C922CDAED988C9710F2C15FF5FE601A61B6069BAB to get processed by the Akash network
INFO: Success

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077--1]$ akash_accept 
	rate	monthly	usd	dseq/gseq/oseq	provider					host
0>	1.00	0.42	$2.05	15663077/1/1	akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk	provider.hurricane.akash.pub:8443	
Choose your bid from the list [0]: 0
INFO: Accepting the bid offered by akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider for 15663077/1/1 deployment
INFO: Broadcasting 'provider-services market lease create -y' transaction...
INFO: Waiting for the TX 8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4 to get processed by the Akash network
akINFO: Success
8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077-1-1]$ akash_send_manifest ssh-shm-and-pers.yaml
Detected provider for 15663077/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
[{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"FAIL","error":"remote server returned 500","errorMessage":"manifest version validation failed\n"}]
Error: submit manifest to some providers has been failed
ERROR: provider-services  send-manifest failed with '1' code.

Provider (v0.5.9):

$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f | grep -Evi 'check|result|IP|replicas|dump'
Defaulted container "provider" out of: provider, init (init)
I[2024-03-30|21:47:55.201] order detected                               module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:47:55.203] group fetched                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:47:55.203] requesting reservation                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
D[2024-03-30|21:47:55.203] reservation requested. order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1, resources=[{"resource":{"id":1,"cpu":{"units":{"val":"1000"}},"memory":{"size":{"val":"4294967296"}},"storage":[{"name":"shm","size":{"val":"2147483648"},"attributes":[{"key":"class","value":"ram"},{"key":"persistent","value":"false"}]},{"name":"data","size":{"val":"5368709120"},"attributes":[{"key":"class","value":"beta3"},{"key":"persistent","value":"true"}]},{"name":"default","size":{"val":"10737418240"}}],"gpu":{"units":{"val":"0"}},"endpoints":[{"kind":1,"sequence_number":0},{"sequence_number":0}]},"count":1,"price":{"denom":"uakt","amount":"1000000.000000000000000000"}}] module=provider-cluster cmp=provider cmp=service cmp=inventory-service
D[2024-03-30|21:47:55.203] reservation count                            module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=1
I[2024-03-30|21:47:55.203] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
D[2024-03-30|21:47:55.205] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 price=1.000000000000000000uakt


I[2024-03-30|21:48:01.322] bid complete                                 module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1


I[2024-03-30|21:48:13.520] lease won                                    module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.520] shutting down                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:48:13.520] lease won                                    module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.520] new lease                                    module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2024-03-30|21:48:13.521] watchdog start                               module=provider-manifest cmp=provider leaseID=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.525] data received                                module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3



I[2024-03-30|21:48:20.899] watchdog done                                module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077
I[2024-03-30|21:48:20.899] manifest received                            module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077
I[2024-03-30|21:48:20.901] data received                                module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3
I[2024-03-30|21:48:20.901] deployment version mismatch                  module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 expected=9D34F853C8A02E32ABD64AAEC0900C67ABBDF9BE5584177C33753198638D8AB3 got=95A422D963420A7C974C8F8B8EC0569CDA801EFB2B1C3595634F5891B5030A4E
E[2024-03-30|21:48:20.901] invalid manifest: %s                         module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 manifestversionvalidationfailed=(MISSING)
D[2024-03-30|21:48:20.901] requests valid                               module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 num-requests=0
E[2024-03-30|21:48:20.901] manifest submit failed                       cmp=provider err="manifest version validation failed"

troian added a commit to akash-network/provider that referenced this issue Mar 30, 2024
* feat(cluster/kube/builder): `"ram"` storage class

* Implement `"ram"` storage class with "empty dir" memory-backed
  volumes.
* No changes to resource accounting - service memory size
  must include size allocated to ram storage.

refs akash-network/support#179

Signed-off-by: Adam Bozanich <[email protected]>

* feat(shm): add e2e tests

Signed-off-by: Artur Troian <[email protected]>

---------

Signed-off-by: Adam Bozanich <[email protected]>
Signed-off-by: Artur Troian <[email protected]>
Co-authored-by: Adam Bozanich <[email protected]>
andy108369 added a commit to andy108369/helm-charts that referenced this issue Mar 31, 2024
andy108369 added a commit to akash-network/helm-charts that referenced this issue Mar 31, 2024
@andy108369
Copy link
Contributor

I've tested the provider-services 0.5.11 - everything is working there.

Details
akash-network/helm-charts#268

@troian troian closed this as completed Mar 31, 2024
@github-project-automation github-project-automation bot moved this from In Test (or staging) to Released (in Prod) in Akash Cohesive Product / Engineering Roadmap Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 repo/akash-api repo/node Akash node repo issues repo/provider Akash provider-services repo issues
Projects
Status: Released (in Prod)
Development

No branches or pull requests

5 participants