Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully Dockerized container of Grok for Akash #509

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Zblocker64
Copy link

No description provided.

@andy108369
Copy link
Collaborator

andy108369 commented Mar 18, 2024

Thank you for PR!

This has been tested only up until the SHM related error.

It awaits akash-network/support#179 first.

One can run it if one has access to the provider by setting up the /dev/shm - Memory K8s kind of path as explained here #507 (comment)

Grok/deploy.yaml Outdated Show resolved Hide resolved
Grok/deploy.yaml Show resolved Hide resolved
@andy108369
Copy link
Collaborator

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. Segmentation fault in K8s Pod (8x H100's) xai-org/grok-1#164 (comment)

@Zblocker64
Copy link
Author

Zblocker64 commented Mar 18, 2024

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. python3 process exits eventually (8x h100's) xai-org/grok-1#164 (comment)

Just pushed an update to docker hub. You can use latest or 1.0 as the tag

@andy108369
Copy link
Collaborator

andy108369 commented Mar 18, 2024

I've tested your image, with the /dev/shm enabled for pod (done it from K8s host), and it eventually Segfaults:

image
image

Upstream issue xai-org/grok-1#164 (comment)

Refs.

xai-org/grok-1#164 (comment)
#507 (comment)
xai-org/grok-1#152 (comment)

@andy108369
Copy link
Collaborator

Please do not use this image (or any xai-org's grok-1 image) on H100's !
It still locks up the latest nvidia drivers 550.54.15 which then forces us to reboot these nodes.

Details
xai-org/grok-1#164 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants