Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute compatibility? #6

Open
slerman12 opened this issue Jun 10, 2021 · 6 comments
Open

Compute compatibility? #6

slerman12 opened this issue Jun 10, 2021 · 6 comments

Comments

@slerman12
Copy link

Would you happen to have a rough estimate of the kind of compute needed to run this model? Unfortunately, we are subject to a very limited compute scenario and I am getting memory allocation errors when trying to run under the default settings.

Thank you for any support.

@wilson1yan
Copy link
Owner

The model should run on 4 GPUs with ~24GB of memory each. I will change the default batch size in scripts/train_videogpt.py, as it should be something like 4 or 8 (batch size per GPU) to get a total batch size across all GPUs of around 32.

If you haven't tried it yet, I also suggest using sparse attention, as you get some memory usage reduction and speed-up when training the model.

@slerman12
Copy link
Author

slerman12 commented Jun 11, 2021

Thank you so much! I'll give that a try.

@slerman12
Copy link
Author

Don't want to keep prodding you, but I ran the provided Sparse Attention installation script:

sudo apt-get install llvm-9-dev

And received this trace:

Reading package lists... Done
Building dependency tree    
Reading state information... Done
E: Unable to locate package llvm-9-dev

I tried installing llvm another way:

bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

This worked, but the subsequent install deepseed command did not:

Command errored out with exit status 1

@slerman12 slerman12 reopened this Jun 11, 2021
@wilson1yan
Copy link
Owner

Hmm not too sure what the issue is. Have you tried running sudo apt update or sudo apt-get install before installing llvm-9-dev? This page might also have some useful information.

For the deepspeed install, do you know what the exact error was?

@slerman12
Copy link
Author

The trace is pretty long, but I think it was this:

csrc/sparse_attention/utils.cpp:110:90: warning: narrowing conversion of ‘H’
 from ‘size_t {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
    error: command '/usr/bin/gcc' failed with exit code 1

Maybe our system has some issue with gcc? I'm not too familiar with this system-level stuff.

@wilson1yan
Copy link
Owner

I believe that is essentially the same error that you mentioned above failed with exit code 1, and right above that is just a warning, and not the error. The error should be somewhere else up in the logs.

Have you tried looking at some of the github issues on the Deepspeed repo that might be relevant? Such as this one

One other option is to try out the Dockerfile in the other VideoGPT related repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants