You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:
Task Training model for Patchwise Analysis of Music Document, Training[eacd36d5-c8dd-4b02-b9cd-38ca31c92959] raised unexpected: RuntimeError("The job did not produce the output file for Background Model.\n\n{'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('c031c7c9-c86d-481f-ae62-59f0b2491828'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/0fd193e7-8b3d-45a0-84d8-b99c0b2b8fc0'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('c7a3b8ca-429a-4fcd-be92-2957e00497ba'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/faebcd1f-c384-44b9-9572-25f5ac27b12e'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('eea42920-5100-4a57-9604-7688d582b482'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/a69ba529-0053-4c7f-bc3d-333942300b15'}], 'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('3fd388db-1cf3-4400-9fb7-712bd3f6738e'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/03ac3bb9-3323-42a0-b989-5eadae3a0529'}]}")
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/code/Rodan/rodan/jobs/base.py", line 843, in run
).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Background Model.
Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.
I'm baffled. The directory tmpa8eq61gp was created successfully with all necessary permissions, but no hdf5 files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo. /rodan-main/code/rodan/jobs/base.py was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.
This is strange also because, if the training has not started, it does not write hdf5 yet, of course. However, in this case, it is the missing output file that seems to prevent training.
The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).
The text was updated successfully, but these errors were encountered:
Since we are likely to eventually use the distributed version for Rodan prod (because everything else has been successfully set up #1184) I will do all the testing on the current single-instance version (rodan2.simssa.ca).
We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:
Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.
Related:
I'm baffled. The directory
tmpa8eq61gp
was created successfully with all necessary permissions, but nohdf5
files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo./rodan-main/code/rodan/jobs/base.py
was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.This is strange also because, if the training has not started, it does not write
hdf5
yet, of course. However, in this case, it is the missing output file that seems to prevent training.The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).
The text was updated successfully, but these errors were encountered: