-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Data]: ray data map_batches leaks memory/doesn't update objectRef count #49757
Comments
I tried avoiding the nesting of ray data inside ray tasks for this on latest and got:
|
Here is a revised script that I used so I could remove call to ray remote after ray data. import ray
import numpy as np
from ray.util.state import summarize_actors, summarize_objects, summarize_tasks
import time
import gc
class dummyActor:
def __init__(self, num) -> None:
self.num = num
def __call__(self, recs):
for x in recs:
pass
return recs
@ray.remote
def dummy_fnc(arg):
for i in arg.iter_rows():
pass
return arg
def dummy_transform(arg):
arg['item'] =arg['id']*2
return arg
if __name__ =="__main__":
data = ray.data.range(1000000)
data = data.map_batches(dummyActor, fn_constructor_kwargs = {'num':10}, concurrency=2)
data = data.map(dummy_transform, concurrency=10)
data.write_parquet('local://tmp/datab/')
del data
gc.collect()
time.sleep(1)
print(summarize_objects(raise_on_missing_output=False)) I still see memory not being released.
|
OK thanks. This is reproducible. Does this cause issues / OOM your pipeline? |
I've seen and debugged into this before. The problem is that some objects are still being referenced by idle workers. |
This does cause OOM for me in case of a long running processes, these process continue to build up over a period of time and eventually are killed OOM. Also to add, in such scenario, ray tries to recreate these worker processes(with last know state, ie. including ref's that should be released), back to it's last known state, and that makes whole cluster unusable. |
@richardliaw , hey any pointers/insights that I can use to free up memory(forced GC?) to have a long running task |
Ray data groupby also has similar problems. The Ray version I use is 2.40 |
What happened + What you expected to happen
When using ray data map_batches, the memory is not released once the reference to dataset goes out of scope, thus starting to build up memory and ultimately workers getting killed due to OOM.
Versions / Dependencies
Ray: 2.34
Python: 3.9
Reproduction script
output:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: