Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default AMI Fails to detect Nvidia driver on AWS g6e #1480

Open
OLSecret opened this issue Oct 18, 2024 · 4 comments
Open

Default AMI Fails to detect Nvidia driver on AWS g6e #1480

OLSecret opened this issue Oct 18, 2024 · 4 comments

Comments

@OLSecret
Copy link

Getting:

Digest: sha256:5e8ed922ecacdb1071096eebef5af11563fd0c2c8bce9143ea3898768994080f
  Status: Downloaded newer image for iterativeai/cml:0-dvc3-base1-gpu
  docker.io/iterativeai/cml:0-dvc3-base1-gpu
  /usr/bin/docker create --name 41bde5f6557b4c82bb0400b08e5ca5b0_iterativeaicml0dvc3base1gpu_78f5fb --label 380bf3 --workdir /__w/SecretModels/SecretModels --network github_network_5168857de2994b2fabc54139db02ee1f --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work":"/__w" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/externals":"/__e":ro -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp":"/__w/_temp" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_actions":"/__w/_actions" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_tool":"/__w/_tool" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" iterativeai/cml:0-dvc
  215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  /usr/bin/docker start 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
  Error: failed to start containers: 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error: Docker start fail with exit code 1

from setup like:

name: model-style-train-on-manual_call

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Hugging Face model name to use for training'
        required: true
        default: 'euclaise/gpt-neox-122m-minipile-digits'

jobs:
  launch-runner:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/setup-python@v4
        with:
           python-version: '3.x'
      - uses: actions/checkout@v3
      - uses: iterative/setup-cml@v2
      - name: Deploy runner on EC2
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
              --cloud=aws \
              --cloud-hdd-size=256 \
              --cloud-region=us-west-2 \
              --cloud-type=g6e.xlarge \
              --cloud-gpu=v100 \
              --labels=cml-gpu 

  run:
    needs: launch-runner
    runs-on: [self-hosted, cml-gpu ]
    container:
      image: docker://iterativeai/cml:0-dvc3-base1-gpu
      options: --gpus all
    timeout-minutes: 40000
    permissions:
      contents: read
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/checkout@v3
      - uses: robinraju/release-downloader@v1
        with:
          tag: 'style'
          fileName: '*.jsonl'
      - name: Train models
        env:
          GITHUB_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          REPO_TOKEN: ${{ github.token }}
          DEBIAN_FRONTEND: noninteractive
          MODEL_NAME: ${{ github.event.inputs.model_name }}
        run: |
          echo $NODE_OPTIONS
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Oct 23, 2024

This issue could be literally anything related to GPU drivers.

Please run a non-GPU workload like sleep infinity and SSH into the instance using either these instructions or e.g. mxschmitt/action-tmate; then take a look to journalctl in case GPU drivers failed to install, run nvidia-smi to check if the host detects the GPU outside the container runtime...

I currently can't be of much help, but with these hints you should be able to find out what's happening.

@ajithvcoder
Copy link

@OLSecret @0x2b3bfa0 is it possible to mention AMI Id which I have in AWS .. private or public one ? Because I am getting a 11.4 cuda version with T4 GPU while launching a g4dn.xlarge instance. Where as in manual method I am getting 12.4 with T4 GPU

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Nov 30, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants