-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault when accessing metadata buffer #212
Comments
@kleinschmidt this was using Ray.jl v0.0.2 right? |
It was on 7f3aec0a39b6db86356ad658ab89d29531303caa (the commit before teh workflows landed), and everything built from source in the docker image (using the Ray.jl-provided dockerfile as the base) |
Just hit this again, same code, same circusmstances (just started a fresh cluster and submitted job, got through ~4% of the work) |
...and again |
Can reproduce this pretty regularly (unfortunately not in a way that I can share since it's internal data). I also ruled out the async reducer as the root cause; using a channel to make sure the I did a bit of poking around at whether we can check The next step for debugging this would probably be to just print the entire bytes during deserialization if |
This was observed during some internal Beacon benchmarking of a large job (~25k tasks + a reduce step) with kuberay. I haven't been able to reproduce (re-submitting exactly the same job with no changes has been running smoothly).
Ray.get
here is being called in the context of an async reduction step. The tasks being reduced over are generated viamap
a task that returns a DataFrame, and then reduced like this:My hunch is that there may be some kind of race condition here, where the julia async
Task
s are somehow yielding in such a way as to cause the underlying memory to be freed. But really only a hunch. Full stacktrace from the segfault is below. The other thing I could think of off the top of my head is that there's something we're not handling around non-localmemorybuffer buffers, but it's hard to say. The mysterious thing is that we're basically only interacting with the metadata via CoreWorker API methods (gate a call toGetMetadata
behind a call toHasMetadata
:https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/src/ray_julia_jll/common.jl#L258-L267
both of which are directly wrapping the C++ methods:
https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/build/wrapper.cc#L649-L661
Logs
The text was updated successfully, but these errors were encountered: