Support for requesting max SHM size from SDL #179

anilmurty · 2024-01-23T17:10:02Z

Is your feature request related to a problem? Please describe.

Customers (particularly AI/ ML training workloads) frequently need to be able to have multiple services share storage - for example one service that is downloading data and labeling is CPU bound, while another that uses the data for training is GPU bound and they can run in parallel but need to access large shared memory. We currently don't allow the max SHM size to be controllable by the user which makes it hard to run such workloads.

Describe the solution you'd like

Support being able to specify and request SHM size as part of the SDL

Describe alternatives you've considered

Manually applying it on the provider: This is the workaround we have been pursuing so far but it's painful because it needs to be done every time a new deployment is done or the deployment restarts for some reason. Also this requires coordination with the provider who may not be in the same TZ as the tenant.
Note that we have tested being able to apply these changes on the provider side manually during our work with Thumper training on the FoundryStaking provider.

Search

I did search for other open and closed issues before opening this

Code of Conduct

I agree to follow this project's Code of Conduct

Additional context

No response

anilmurty · 2024-02-20T17:37:44Z

Per Feb 20 call: We are leaning towards implementing full support for SHM (not just the workaround with bid attributes). @boz is planning to take this on (Thanks Adam!)

anilmurty · 2024-02-20T17:44:52Z

In the interim @troian and @chainzero are going to look into the workaround with using bid attributes + a daemon running on the provider that checks the attributes and applies SHM using kubectl commands

* Implement `"ram"` storage class with "empty dir" memory-backed volumes. * No changes to resource accounting - service memory size must include size allocated to ram storage. refs akash-network/support#179 Signed-off-by: Adam Bozanich <[email protected]>

* Add SDL support for `"ram"` storage class. * `"ram"` volumes cannot be persistent or `ReadOnly`. refs akash-network/support#179 Signed-off-by: Adam Bozanich <[email protected]>

* Implement `"ram"` storage class with "empty dir" memory-backed volumes. * No changes to resource accounting - service memory size must include size allocated to ram storage. refs akash-network/support#179

* Add SDL support for `"ram"` storage class. * `"ram"` volumes cannot be persistent or `ReadOnly`. refs akash-network/support#179

brewsterdrinkwater · 2024-03-04T21:34:49Z

March 4th, 2024:

This will be tested and then merged to the provider sometime this week.
Should be code complete for now.

* Implement `"ram"` storage class with "empty dir" memory-backed volumes. * No changes to resource accounting - service memory size must include size allocated to ram storage. refs akash-network/support#179 Signed-off-by: Adam Bozanich <[email protected]>

brewsterdrinkwater · 2024-03-12T16:05:45Z

March 12th, 2024:

will be tested after network upgrade.
Will merge after testing

Does not need a network upgrade. No SDL changes.

* Implement `"ram"` storage class with "empty dir" memory-backed volumes. * No changes to resource accounting - service memory size must include size allocated to ram storage. refs akash-network/support#179 Signed-off-by: Adam Bozanich <[email protected]>

andy108369 · 2024-03-30T21:38:40Z

akash network 0.32.2
provider-services 0.5.9

shm doesn't seem to be working yet.

provider attributes

$ provider-services query provider get akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk -o text
attributes:
- key: host
  value: akash
- key: organization
  value: overclock
- key: datacenter
  value: hurricane
- key: capabilities/gpu/vendor/nvidia/model/t4
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi/interface/pcie
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/interface/pcie
  value: "true"
- key: capabilities/storage/1/class
  value: default
- key: capabilities/storage/1/persistent
  value: "true"
- key: capabilities/storage/2/class
  value: beta3
- key: capabilities/storage/2/persistent
  value: "true"
- key: capabilities/storage/3/class
  value: ram
- key: capabilities/storage/3/persistent
  value: "false"
- key: ip-lease
  value: "true"
host_uri: https://provider.hurricane.akash.pub:8443
info:
  email: [email protected]
  website: https://akash.network
owner: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk

SDL

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII [email protected]'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true


profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

after send-manifest:

E[2024-03-30|21:36:59.567] applying deployment                          cmp=provider client=kube err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm"" lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk service=ssh
E[2024-03-30|21:36:59.567] unable to deploy lid=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk. last known state:
cmp=provider client=kube
E[2024-03-30|21:36:59.567] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""
E[2024-03-30|21:36:59.567] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash state=deploy-active err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""

@troian

andy108369 · 2024-03-30T21:51:20Z

SDL (pers.volume + /dev/shm)

In the case of two volumes - pers.volume + shm (ram) I'm getting "manifest version validation failed" from provider.

SDL:

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII [email protected]'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        data:
          mount: /root
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true


profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: data
            size: 5Gi
            attributes:
              persistent: true
              class: beta3
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

Client:

provider-services 0.5.9

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][]$ akash_deploy ssh-shm-and-pers.yaml
INFO: Broadcasting 'provider-services deployment create -y --deposit 500000uakt -- ssh-shm-and-pers.yaml' transaction...
INFO: Waiting for the TX 1CCD212E8E216E23A168B32C922CDAED988C9710F2C15FF5FE601A61B6069BAB to get processed by the Akash network
INFO: Success

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077--1]$ akash_accept 
	rate	monthly	usd	dseq/gseq/oseq	provider					host
0>	1.00	0.42	$2.05	15663077/1/1	akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk	provider.hurricane.akash.pub:8443	
Choose your bid from the list [0]: 0
INFO: Accepting the bid offered by akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider for 15663077/1/1 deployment
INFO: Broadcasting 'provider-services market lease create -y' transaction...
INFO: Waiting for the TX 8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4 to get processed by the Akash network
akINFO: Success
8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077-1-1]$ akash_send_manifest ssh-shm-and-pers.yaml
Detected provider for 15663077/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
[{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"FAIL","error":"remote server returned 500","errorMessage":"manifest version validation failed\n"}]
Error: submit manifest to some providers has been failed
ERROR: provider-services  send-manifest failed with '1' code.

Provider (v0.5.9):

$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f | grep -Evi 'check|result|IP|replicas|dump'
Defaulted container "provider" out of: provider, init (init)
I[2024-03-30|21:47:55.201] order detected                               module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:47:55.203] group fetched                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:47:55.203] requesting reservation                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
D[2024-03-30|21:47:55.203] reservation requested. order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1, resources=[{"resource":{"id":1,"cpu":{"units":{"val":"1000"}},"memory":{"size":{"val":"4294967296"}},"storage":[{"name":"shm","size":{"val":"2147483648"},"attributes":[{"key":"class","value":"ram"},{"key":"persistent","value":"false"}]},{"name":"data","size":{"val":"5368709120"},"attributes":[{"key":"class","value":"beta3"},{"key":"persistent","value":"true"}]},{"name":"default","size":{"val":"10737418240"}}],"gpu":{"units":{"val":"0"}},"endpoints":[{"kind":1,"sequence_number":0},{"sequence_number":0}]},"count":1,"price":{"denom":"uakt","amount":"1000000.000000000000000000"}}] module=provider-cluster cmp=provider cmp=service cmp=inventory-service
D[2024-03-30|21:47:55.203] reservation count                            module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=1
I[2024-03-30|21:47:55.203] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
D[2024-03-30|21:47:55.205] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 price=1.000000000000000000uakt


I[2024-03-30|21:48:01.322] bid complete                                 module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1


I[2024-03-30|21:48:13.520] lease won                                    module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.520] shutting down                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:48:13.520] lease won                                    module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.520] new lease                                    module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2024-03-30|21:48:13.521] watchdog start                               module=provider-manifest cmp=provider leaseID=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-03-30|21:48:13.525] data received                                module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3



I[2024-03-30|21:48:20.899] watchdog done                                module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077
I[2024-03-30|21:48:20.899] manifest received                            module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077
I[2024-03-30|21:48:20.901] data received                                module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3
I[2024-03-30|21:48:20.901] deployment version mismatch                  module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 expected=9D34F853C8A02E32ABD64AAEC0900C67ABBDF9BE5584177C33753198638D8AB3 got=95A422D963420A7C974C8F8B8EC0569CDA801EFB2B1C3595634F5891B5030A4E
E[2024-03-30|21:48:20.901] invalid manifest: %s                         module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 manifestversionvalidationfailed=(MISSING)
D[2024-03-30|21:48:20.901] requests valid                               module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 num-requests=0
E[2024-03-30|21:48:20.901] manifest submit failed                       cmp=provider err="manifest version validation failed"

* feat(cluster/kube/builder): `"ram"` storage class * Implement `"ram"` storage class with "empty dir" memory-backed volumes. * No changes to resource accounting - service memory size must include size allocated to ram storage. refs akash-network/support#179 Signed-off-by: Adam Bozanich <[email protected]> * feat(shm): add e2e tests Signed-off-by: Artur Troian <[email protected]> --------- Signed-off-by: Adam Bozanich <[email protected]> Signed-off-by: Artur Troian <[email protected]> Co-authored-by: Adam Bozanich <[email protected]>

andy108369 · 2024-03-31T08:58:19Z

I've tested the provider-services 0.5.11 - everything is working there.

Details
akash-network/helm-charts#268

anilmurty added the awaiting-triage label Jan 23, 2024

anilmurty added this to Akash Cohesive Product / Engineering Roadmap Jan 23, 2024

anilmurty moved this to Up Next (prioritized) in Akash Cohesive Product / Engineering Roadmap Jan 23, 2024

anilmurty changed the title ~~Support/ Workaround for SHM~~ Support for SHM Jan 23, 2024

troian added repo/node Akash node repo issues repo/provider Akash provider-services repo issues repo/akash-api P1 and removed awaiting-triage labels Jan 24, 2024

troian changed the title ~~Support for SHM~~ SHM support Jan 24, 2024

chainzero assigned troian Feb 21, 2024

anilmurty changed the title ~~SHM support~~ Support for requesting max SHM size from SDL Feb 23, 2024

brewsterdrinkwater moved this from Up Next (prioritized) to In Progress (prioritized) in Akash Cohesive Product / Engineering Roadmap Feb 23, 2024

troian assigned boz Feb 23, 2024

This was referenced Feb 26, 2024

feat(sdl): "ram" storage class akash-network/node#1925

Merged

feat(cluster/kube/builder): "ram" storage class akash-network/provider#199

Closed

boz added a commit to akash-network/node that referenced this issue Feb 26, 2024

feat(sdl): "ram" storage class

59db787

* Add SDL support for `"ram"` storage class. * `"ram"` volumes cannot be persistent or `ReadOnly`. refs akash-network/support#179

troian pushed a commit to akash-network/node that referenced this issue Feb 27, 2024

feat(sdl): "ram" storage class (#1925)

23da352

* Add SDL support for `"ram"` storage class. * `"ram"` volumes cannot be persistent or `ReadOnly`. refs akash-network/support#179

anilmurty moved this from In Progress (prioritized) to In Test (or staging) in Akash Cohesive Product / Engineering Roadmap Mar 5, 2024

andy108369 mentioned this issue Mar 18, 2024

Fully Dockerized container of Grok for Akash akash-network/awesome-akash#509

Open

andy108369 added a commit to andy108369/helm-charts that referenced this issue Mar 31, 2024

add shm size support akash-network/support#179

023448f

andy108369 added a commit to akash-network/helm-charts that referenced this issue Mar 31, 2024

add shm size support akash-network/support#179 (#268)

d2c852a

troian closed this as completed Mar 31, 2024

github-project-automation bot moved this from In Test (or staging) to Released (in Prod) in Akash Cohesive Product / Engineering Roadmap Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for requesting max SHM size from SDL #179

Support for requesting max SHM size from SDL #179

anilmurty commented Jan 23, 2024 •

edited

Loading

anilmurty commented Feb 20, 2024

anilmurty commented Feb 20, 2024

brewsterdrinkwater commented Mar 4, 2024

brewsterdrinkwater commented Mar 12, 2024

andy108369 commented Mar 30, 2024

andy108369 commented Mar 30, 2024

andy108369 commented Mar 31, 2024

Support for requesting max SHM size from SDL #179

Support for requesting max SHM size from SDL #179

Comments

anilmurty commented Jan 23, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Search

Code of Conduct

Additional context

anilmurty commented Feb 20, 2024

anilmurty commented Feb 20, 2024

brewsterdrinkwater commented Mar 4, 2024

brewsterdrinkwater commented Mar 12, 2024

andy108369 commented Mar 30, 2024

provider attributes

SDL

after send-manifest:

andy108369 commented Mar 30, 2024

SDL (pers.volume + /dev/shm)

andy108369 commented Mar 31, 2024

anilmurty commented Jan 23, 2024 •

edited

Loading