Memory leak in task serialization #16

gench · 2020-05-01T08:24:40Z

In the parallelization of multiple tasks, the memory take of the spark driver is increased by the memory requirement of each task. I think the problem is in apply_async function which keeps the reference to the serialised pickle objects so the driver easily goes out of memory when the number_of_tasks x memory_take_of_a_task > driver_memory.

The text was updated successfully, but these errors were encountered:

WeichenXu123 · 2020-06-25T08:49:27Z

@gench This should be your code issue.

in apply_async function which keeps the reference to the serialised pickle objects

The serialized func object will only include function code bytes and some param data which used to run tasks. When tasks launched in spark executor, it will allocate memory in executor side.

gench · 2020-07-09T21:38:11Z

The serialized func object will only include function code bytes and some param data which used to run tasks.

That is the problem.

My function takes a big dataframe generated in the driver as you can see below. Each time the function is serialised for an executor, its memory is not released afterwards. When I run the following code, it takes 8 times the memory of X pandas dataframe.

Parallel(backend="spark", n_jobs=8)(delayed(my_function)(X=X, ...) for fold, (tr_ind, val_ind) in enumerate(cv_iterator))

WeichenXu123 · 2021-06-29T12:26:40Z

@gench

My function takes a big dataframe

Could you try convert the dataframe into a spark broadcast variable ?

like:
bc_pandas_df = sparkContext.broadcast(pandas_df)
then in remote executed function,
get the broadcast variable value by bc_pandas_df.value

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in task serialization #16

Memory leak in task serialization #16

gench commented May 1, 2020

WeichenXu123 commented Jun 25, 2020 •

edited

Loading

gench commented Jul 9, 2020

WeichenXu123 commented Jun 29, 2021

Memory leak in task serialization #16

Memory leak in task serialization #16

Comments

gench commented May 1, 2020

WeichenXu123 commented Jun 25, 2020 • edited Loading

gench commented Jul 9, 2020

WeichenXu123 commented Jun 29, 2021

WeichenXu123 commented Jun 25, 2020 •

edited

Loading