Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot run PACO train on prod (vGPU server) #1181

Closed
homework36 opened this issue Jun 19, 2024 · 3 comments
Closed

cannot run PACO train on prod (vGPU server) #1181

homework36 opened this issue Jun 19, 2024 · 3 comments

Comments

@homework36
Copy link
Contributor

homework36 commented Jun 19, 2024

We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:

Task Training model for Patchwise Analysis of Music Document, Training[eacd36d5-c8dd-4b02-b9cd-38ca31c92959] raised unexpected: RuntimeError("The job did not produce the output file for Background Model.\n\n{'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('c031c7c9-c86d-481f-ae62-59f0b2491828'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/0fd193e7-8b3d-45a0-84d8-b99c0b2b8fc0'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('c7a3b8ca-429a-4fcd-be92-2957e00497ba'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/faebcd1f-c384-44b9-9572-25f5ac27b12e'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('eea42920-5100-4a57-9604-7688d582b482'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/a69ba529-0053-4c7f-bc3d-333942300b15'}], 'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('3fd388db-1cf3-4400-9fb7-712bd3f6738e'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/03ac3bb9-3323-42a0-b989-5eadae3a0529'}]}")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 843, in run
    ).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Background Model.

Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.

Related:

I'm baffled. The directory tmpa8eq61gp was created successfully with all necessary permissions, but no hdf5 files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo. /rodan-main/code/rodan/jobs/base.py was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.

This is strange also because, if the training has not started, it does not write hdf5 yet, of course. However, in this case, it is the missing output file that seems to prevent training.

The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).

@homework36
Copy link
Contributor Author

homework36 commented Jul 10, 2024

Since we are likely to eventually use the distributed version for Rodan prod (because everything else has been successfully set up #1184) I will do all the testing on the current single-instance version (rodan2.simssa.ca).

@homework36
Copy link
Contributor Author

The issue is now transferred to Paco repo here.

@homework36 homework36 added the help-wanted Help :^ label Jul 16, 2024
@homework36
Copy link
Contributor Author

Can do training now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant