-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in task serialization #16
Comments
@gench This should be your code issue.
The serialized func object will only include function code bytes and some param data which used to run tasks. When tasks launched in spark executor, it will allocate memory in executor side. |
That is the problem. My function takes a big dataframe generated in the driver as you can see below. Each time the function is serialised for an executor, its memory is not released afterwards. When I run the following code, it takes 8 times the memory of X pandas dataframe.
|
Could you try convert the dataframe into a spark broadcast variable ? like: |
In the parallelization of multiple tasks, the memory take of the spark driver is increased by the memory requirement of each task. I think the problem is in
apply_async
function which keeps the reference to the serialised pickle objects so the driver easily goes out of memory when thenumber_of_tasks x memory_take_of_a_task > driver_memory
.The text was updated successfully, but these errors were encountered: