Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About ERROR:root:Failed to execute the training process: name 'str2optimizer8bit_blockwise' is not defined #180

Open
HEUfcy opened this issue Aug 10, 2024 · 2 comments

Comments

@HEUfcy
Copy link

HEUfcy commented Aug 10, 2024

Thank you very much for your excellent work. I am now encountering this problem while training my model in a virtual environment.

The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:24<00:00, 1.65it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 33.16it/s]
Moviepy - Building video ./exp_output/stage2/validation/1_1_1.mp4.
MoviePy - Writing audio in 1_1_1TEMP_MPY_wvf_snd.mp4
MoviePy - Done.
Moviepy - Writing video ./exp_output/stage2/validation/1_1_1.mp4

Moviepy - Done !
Moviepy - video ready ./exp_output/stage2/validation/1_1_1.mp4
Steps: 0%| | 1/3000 [06:10<6:40:15, 8.01s/it, lr=1e-5, step_loss=0.271, td=3.17s][2024-08-10 10:01:34,981] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Steps: 0%| | 2/3000 [06:20<185:23:13, 222.61s/it, lr=1e-5, step_loss=0.258, td=4.30s][2024-08-10 10:01:44,611] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Steps: 0%|▏ | 3/3000 [06:30<104:21:45, 125.36s/it, lr=1e-5, step_loss=0.371, td=3.83s][2024-08-10 10:01:53,991] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Steps: 0%|▏ | 4/3000 [06:39<66:13:19, 79.57s/it, lr=1e-5, step_loss=0.374, td=3.64s][2024-08-10 10:02:03,559] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
Steps: 0%|▎ | 5/3000 [06:49<45:11:54, 54.33s/it, lr=1e-5, step_loss=0.373, td=4.09s][2024-08-10 10:02:13,085] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
Steps: 0%|▎ | 6/3000 [06:58<32:30:51, 39.10s/it, lr=1e-5, step_loss=0.262, td=3.67s][2024-08-10 10:02:21,432] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
Steps: 0%|▍ | 7/3000 [07:07<24:08:47, 29.04s/it, lr=1e-5, step_loss=0.259, td=3.89s][2024-08-10 10:02:31,178] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
Steps: 0%|▍ | 8/3000 [07:17<19:01:57, 22.90s/it, lr=1e-5, step_loss=0.297, td=3.85s][2024-08-10 10:02:40,832] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216, reducing to 8388608
Steps: 0%|▌ | 9/3000 [07:26<15:35:07, 18.76s/it, lr=1e-5, step_loss=0.284, td=3.83s][2024-08-10 10:02:50,562] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608, reducing to 4194304
Steps: 0%|▌ | 10/3000 [07:36<13:15:55, 15.97s/it, lr=1e-5, step_loss=0.316, td=3.91s][2024-08-10 10:03:00,072] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304, reducing to 2097152
Steps: 0%|▋ | 11/3000 [07:45<11:37:07, 13.99s/it, lr=1e-5, step_loss=0.243, td=3.69s]
ERROR:root:Failed to execute the training process: name 'str2optimizer8bit_blockwise' is not defined
ERROR:root:Failed to execute the training process: name 'str2optimizer8bit_blockwise' is not defined
ERROR:root:Failed to execute the training process: name 'str2optimizer8bit_blockwise' is not defined
c

@xumingw
Copy link
Contributor

xumingw commented Aug 14, 2024

@HEUfcy
Copy link
Author

HEUfcy commented Aug 19, 2024

Thank you for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants