Compute compatibility? #6

slerman12 · 2021-06-10T15:49:04Z

Would you happen to have a rough estimate of the kind of compute needed to run this model? Unfortunately, we are subject to a very limited compute scenario and I am getting memory allocation errors when trying to run under the default settings.

Thank you for any support.

wilson1yan · 2021-06-10T19:25:39Z

The model should run on 4 GPUs with ~24GB of memory each. I will change the default batch size in scripts/train_videogpt.py, as it should be something like 4 or 8 (batch size per GPU) to get a total batch size across all GPUs of around 32.

If you haven't tried it yet, I also suggest using sparse attention, as you get some memory usage reduction and speed-up when training the model.

slerman12 · 2021-06-11T14:42:51Z

Thank you so much! I'll give that a try.

slerman12 · 2021-06-11T17:42:01Z

Don't want to keep prodding you, but I ran the provided Sparse Attention installation script:

sudo apt-get install llvm-9-dev

And received this trace:

Reading package lists... Done
Building dependency tree    
Reading state information... Done
E: Unable to locate package llvm-9-dev

I tried installing llvm another way:

bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

This worked, but the subsequent install deepseed command did not:

Command errored out with exit status 1

wilson1yan · 2021-06-11T20:44:18Z

Hmm not too sure what the issue is. Have you tried running sudo apt update or sudo apt-get install before installing llvm-9-dev? This page might also have some useful information.

For the deepspeed install, do you know what the exact error was?

slerman12 · 2021-06-12T22:48:31Z

The trace is pretty long, but I think it was this:

csrc/sparse_attention/utils.cpp:110:90: warning: narrowing conversion of ‘H’
 from ‘size_t {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
    error: command '/usr/bin/gcc' failed with exit code 1

Maybe our system has some issue with gcc? I'm not too familiar with this system-level stuff.

wilson1yan · 2021-06-15T04:57:55Z

I believe that is essentially the same error that you mentioned above failed with exit code 1, and right above that is just a warning, and not the error. The error should be somewhere else up in the logs.

Have you tried looking at some of the github issues on the Deepspeed repo that might be relevant? Such as this one

One other option is to try out the Dockerfile in the other VideoGPT related repo

slerman12 closed this as completed Jun 11, 2021

slerman12 reopened this Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute compatibility? #6

Compute compatibility? #6

slerman12 commented Jun 10, 2021

wilson1yan commented Jun 10, 2021

slerman12 commented Jun 11, 2021 •

edited

Loading

slerman12 commented Jun 11, 2021

wilson1yan commented Jun 11, 2021

slerman12 commented Jun 12, 2021

wilson1yan commented Jun 15, 2021

Compute compatibility? #6

Compute compatibility? #6

Comments

slerman12 commented Jun 10, 2021

wilson1yan commented Jun 10, 2021

slerman12 commented Jun 11, 2021 • edited Loading

slerman12 commented Jun 11, 2021

wilson1yan commented Jun 11, 2021

slerman12 commented Jun 12, 2021

wilson1yan commented Jun 15, 2021

slerman12 commented Jun 11, 2021 •

edited

Loading