Replies: 2 comments 10 replies
-
First, I think the issue is not directly related to For Ray to (temporarily) work, have you tried the workaround proposed here by setting To analyze the issue, I think we could debug into psutil.Process.parent() (maybe only using some simple standalone examples first, i.e., w/o Ray) and see what really happens there.
Maybe you could share a little more about your use case and/or motivation (i.e., why Ray in SGX LibOS, as a random try out or any specific scenario)? I'm personally interested in having this integrated, though not in CI-Examples, maybe in our examples repo or at least in contrib repo. |
Beta Was this translation helpful? Give feedback.
-
HI, @kailun-qin , @dimakuv : Re: To analyze the issue, I think we could debug into psutil.Process.parent() (maybe only using some giampaolo/psutil#1905 first, i.e., w/o Ray) and see what really happens there. I did a small experiment to see if even basic use of Here is a very small repro script: ./test_sgx_psutil_simple.py
Standalone output: (Without Gramine-SGX)
Output when run under gramine-sgx (same behaviour seen under gramine-direct)
What I am finding is that basic usage of I may be making very silly pilot errors in my test repro, but I think it is a fair expectation that Python's Kindly take a look at this repro test-case to see if my expectations of a valid return from This may, well-be, the issue that Ray's code here (ray/_private/process_watcher.py) is running into when Ray is run under Gramine. It may be that we have to 'handle' this on Ray's-side by not falling over and dying immediately. But that would mask real errors that Ray's code is written to handle the case of not-finding-parent-proc scenario. Your thoughts? |
Beta Was this translation helpful? Give feedback.
-
HI, folks,
This post is a follow-on to the discussion thread #1664, to which I got some help / answers from @kailun-qin and @dimakuv . Thanks! for that. [ I had posted this to the gramine user's group but upon request by Dimitrii, I'm starting a new discussion thread here. ]
This post is about using Ray under gramine-sgx / gramine-direct.
I'm on gramine-SGX-V1: Gramine was built from commit: 4212a2525efffecbc787419ccf349299957b679f
Thanks to the help received from Kailun-Qin and Dmitrii, I am now past the basic config / template / memory resources issues. I am still searching for someone else out there who has successfully started a Ray Cluster under gramine-sgx.
The problem:
The Ray cluster never seems to come up successfully under gramine-sgx or gramine-direct. 'ray start' does seem to go through successfully [1] but soon immediately thereafter, the bootstrap process fails with this cryptic message:
[P1:T1:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P1:T1:python3.8] do_process_exit() -> do_thread_exit(): process 1 exited with status 0
vsgx-vm:[43] $ [P12:T187:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
[P13:T189:python3.8] libos_syscall_exit_group() -> do_process_exit: First time=1
The brief function names in above output is from debug instrumentation I added to figure out what's going on with the Gramine / libOS execution, leading to process[es] exiting.
In [2] below I have shown snippets of the Ray dashboard_agent.log file showing more diagnostic messages leading to the failure. These messages simply give more info about the retry attempts, and show that eventually the node crashes.
The thing relevant to Gramine in that log is this brief message: FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'.
Questions to Gramine-devs:
Do you know if /proc/net/dev is supported under gramine-sgx or gramine-dev? I did go thru the online docs, but cant' recall if this dev is supported.
On the same SGX-enabled box, I am able to bring up Ray cluster (ray start, ray status) directly without using gramine-sgx. Question is: has anyone in this group tried this exercise of integrating Ray under Gramine-sgx or gramine-direct?
Would it be possible for someone in your dev-/QA-team to try this integration out? And let me know whether you are able to get 'ray start' to work under gramine-sgx? it should be a fairly simple install of Ray s/w to get this working on some Linux box.
Digging further on the Ray-side:
I found these two threads. Some of this may be useful to Gramine-devs to help triage / troubleshoot the problems I am seeing. The signature of the problem I am seeing is exactly the same as the issues reported here:
Ray Issue-29412: [Ray Core] Ray agent getting killed unexpectedly
Which lead to a tentative code-fix in ray Python libraries,
Ray PR-29540: [Agent] Make agent shutdown more informative and graceful
The point of these two threads is that: Seems like there might have been some issue with Python library, psutil.Process.parent() misreporting that parent node is down, causing some cascading shutdowns on the Ray-side.
Question(s) are:
Can Gramine-devs speculate if such issues with node patrolling on Gramine-side, induced by some Python library hiccups could lead to 'ray start' totally aborting?
I was hoping someone would have tested out this integration on your end and put-up a nice tutorial on this page (Several other interesting integrations have been tried out.)
Given that Ray / ML/ Python workloads are becoming so very popular, I would have thought it would get some push from Gramine-CI/QA folks to try out this integration. And give us a helpful tutorial on how-to get this to work.
Thanks in advance, and thanks for reading this far. Any help / tips will be most graciously accepted.
--AdityA>
References
The above lines indicate that 'ray start' did go through cleanly ... albeit very briefly. Immediately after these 'successful' start-up messages, we get that message from Ray about 'process exiting' (presumably because either it received some kill, or there was some faulty logic in detecting if the parent process is up).
These messages show "the issue" with Python psutil.Process.parent() not being able to 'locate' a parent node.
Beta Was this translation helpful? Give feedback.
All reactions