cannot run PACO train on prod (vGPU server) #1181

homework36 · 2024-06-19T15:56:01Z

We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:

Task Training model for Patchwise Analysis of Music Document, Training[eacd36d5-c8dd-4b02-b9cd-38ca31c92959] raised unexpected: RuntimeError("The job did not produce the output file for Background Model.\n\n{'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('c031c7c9-c86d-481f-ae62-59f0b2491828'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/0fd193e7-8b3d-45a0-84d8-b99c0b2b8fc0'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('c7a3b8ca-429a-4fcd-be92-2957e00497ba'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/faebcd1f-c384-44b9-9572-25f5ac27b12e'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('eea42920-5100-4a57-9604-7688d582b482'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/a69ba529-0053-4c7f-bc3d-333942300b15'}], 'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('3fd388db-1cf3-4400-9fb7-712bd3f6738e'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/03ac3bb9-3323-42a0-b989-5eadae3a0529'}]}")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 843, in run
    ).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Background Model.

Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.

I'm baffled. The directory tmpa8eq61gp was created successfully with all necessary permissions, but no hdf5 files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo. /rodan-main/code/rodan/jobs/base.py was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.

This is strange also because, if the training has not started, it does not write hdf5 yet, of course. However, in this case, it is the missing output file that seems to prevent training.

The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).

The text was updated successfully, but these errors were encountered:

homework36 · 2024-07-10T22:14:33Z

Since we are likely to eventually use the distributed version for Rodan prod (because everything else has been successfully set up #1184) I will do all the testing on the current single-instance version (rodan2.simssa.ca).

homework36 · 2024-07-16T01:37:07Z

The issue is now transferred to Paco repo here.

homework36 · 2024-07-23T22:15:08Z

Can do training now

homework36 added Priority: HIGH bug labels Jun 19, 2024

homework36 self-assigned this Jun 19, 2024

homework36 mentioned this issue Jun 19, 2024

NVIDIA-SMI failed on vGPU instance #1170

Open

homework36 mentioned this issue Jul 10, 2024

production server with multiple instances #1184

Closed

homework36 added the help-wanted Help :^ label Jul 16, 2024

homework36 closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot run PACO train on prod (vGPU server) #1181

cannot run PACO train on prod (vGPU server) #1181

homework36 commented Jun 19, 2024 •

edited

Loading

homework36 commented Jul 10, 2024 •

edited

Loading

homework36 commented Jul 16, 2024

homework36 commented Jul 23, 2024

cannot run PACO train on prod (vGPU server) #1181

cannot run PACO train on prod (vGPU server) #1181

Comments

homework36 commented Jun 19, 2024 • edited Loading

homework36 commented Jul 10, 2024 • edited Loading

homework36 commented Jul 16, 2024

homework36 commented Jul 23, 2024

homework36 commented Jun 19, 2024 •

edited

Loading

homework36 commented Jul 10, 2024 •

edited

Loading