Replies: 3 comments 1 reply
-
Hi @gapisback, thanks for the question and detailed information!
Unfortunately no, I'm not aware of such attempt at this moment.
Yes, from a quick look, I noticed below configs in your template:
Instead, I think you would actually need I also noticed:
This looks stale. Pls use the latest Gramine master branch (w/ the feature included) and see https://gramine.readthedocs.io/en/latest/manifest-syntax.html#untrusted-shared-memory for more details. Would you pls give the above another try? Thanks!
No, files in a shared memory mount need to be explicitly listed as allowed_files to be accessed (pls take a look at our shm regression test and manifest). |
Beta Was this translation helpful? Give feedback.
-
Thanks, @kailun-qin -- For your quick and clear response. We rebased to the latest commit off
And we made the following changes to my template file: (Brief diffs of changes shown below.)
And am now able to get past the bootstrap issues relating to The startup messages from Ray code look like so:
This is looking better. But we still have other issues while trying to bring up Ray under Gramine, which I will describe below. These need troubleshooting on the Gramine side, and hopefully you / someone on Gramine-dev can help. |
Beta Was this translation helpful? Give feedback.
-
The basic quest still remains:What configuration do we need in the template file to allow Ray cluster's
|
Beta Was this translation helpful? Give feedback.
-
Hello,
Has anyone attempted to start a Ray cluster from inside
gramine-sgx
?Problem Statement
Using sample template manifest files from this Gramine-examples repo (and a few other Gramine manifest doc sources), I am using a basic python.manifest.template file.
The command I'm using to start head-node of a Ray cluster is like so:
Btw, I should mention that on this same SGX-enabled machine, Ray cluster-start
ray start --head
(head-node) and starting / stopping worker nodes ...ray status
-- all commands work just fine OUTSIDE Gramine. So, there is NO issue with theray
installation I have done on this machine.With Gramine, here is what happens: Ray startup goes through for a file, creates the required ray-session-dirs in
/tmp/ray
and so on.The startup process eventually fails with these errors:
Triage Notes
From my analysis, the error is not really to-do with
we don't have enought space in /dev/shm
, but possibly due to some interaction (or lack of support) for accessing/dev/shm
and how Ray's code doesmmap()
on a file created in this/dev/shmem
directory.I have chased this down somewhat through Gramine code changes and have found that shared-memory support was added to Gramine under PR #827.
I have checked that this feature support is there in the installed version of
gramine-sgx
I am using.The relevant snippets from my
python.manifest.template
are the following: (See Appendix for the full template file.)All the right incantations seem to be in-place in above template specification.
But I'm still running into the error reported above.
Relevant code from Ray Sources:
Ray's code that runs during
ray start
is doing this in ray/object_manager/plasma/dlmalloc.cc @ L211:The failure is coming from the call to
mmap()
on L 211 above.Question:
Questions, to the general members of Gramine and specifically to the implementers / reviewers of PR #827, are:
Has anyone attempted this integration of running Ray from inside gramine-sgx?
Am I missing something in my Gramine template specifications?
With Gramine's support for shared-memory (PR [LibOS,PAL/Linux-SGX] Add shared untrusted memory support #827), is this usage of doing
mmap(PROT_WRITE)
on a randomly-named file from/dev/shm
supposed to work? Or, am I simply running into some feature incompatibility here?Acknowledgements
Thanks for reading this far! I'm stuck ... so any guidance to get me unblocked will be most appreciated!
Appendix
Contents of python.manifest.template
Beta Was this translation helpful? Give feedback.
All reactions