-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Cluster Bugs #56
Comments
You probably did not specify that you want the calculation to occur in the scratch directory? Other idea could be that you are not on the correct pygromos commit (but I checked the version I have and it doesn't have any major difference from the pygromos v1)? |
I don't pass the work_dir command, therefore I should use the default of the new branch The pygromos version is correct for sure, as it is the standard one for reeds? I wonder if there was a cluster anomaly? or if the approach with the ssh-script is not robust? |
I'll also check out the branch and see if it works for me |
Somehow, we got now the impression, that this might related to a temporary communication problem of the nodes. So right now let's collect all awkward bugs on the pipeline here and maybe we can make some sense of it. The problems occur for me apparently rarely. |
For me, the same thing happened: after checking out the newest version of the eoff rebalancing branch (which includes the minor rework of the submission pipeline), only the files from one node are copied back correctly, the rest are missing... Does it (usually) work for you @candidechamp @schroederb even when the job is distributed among different nodes? |
@SalomeRonja I havn't had a single issue so far. I just diffed my local branch and the /origin/main and I don't see anything wrong. Are you 100% sure the files |
Ah, after a closer look, the problem was that the job timed-out - I didn't think to increase the |
@SalomeRonja Thanks for looking into it. That's unfortunately a drawback we can't really do anything about when running multi-node jobs. If the wall-time is reached, we have no way of getting the data, the only people who can fix this is people who develop the LSF queuing system. |
@candidechamp but we can still fall back to the old work_dir flag if desired, right? |
@schroederb You can, but this makes the cluster slow for everyone. |
Uah, was that stated by the Cluster Support? I thought you told me it was not such a big Deal? |
Oh no sorry actually the cluster people said something slightly different: """ An advantage of the local scratch is, that it is independent. If people do stupid stuff on /cluster/work, it will slow down the entire file system, i.e., your job could negatively be affected by the actions of other users, while you don't have this problem on the local scratch. Copying the data from/to local scratch can even be optimized and parallelized (using gnu parallel to untar several tar archives in parallel to $TMPDIR, using multiple cores). In some test a user could copy 360 GB of data from /cluster/work to $TMPDIR within like 3 or 4 minutes. When a job runs for several hours, then a few minutes will not cause a lot of overhead compared to the total runtime. |
Ah ok, ja I think the default should be the scratch solution on the node. Just in case we want to test/debug something, it still is nice to keep the option of opting out there. |
@SalomeRonja @epbarros I think this issue may be closed now? |
Hi @candidechamp,
the new copying back from the cluster scratch, did fail for me!
The run did not copy all files back to the work folder (cnfs missing).
It looks, like somehow the copying after run fails/ the scratch folder is not found anymore?
Does that also happen for you?
Here I attached the output file:
CHK1_nd5_enr3_complex_prod_1SS_21r_3_sopt4_rb3_max8_md.txt
The text was updated successfully, but these errors were encountered: