The spark driver has stopped unexpectedly and is restarting AFTER using excel lib #689
-
In a part of my code I read several tabs of a spreadsheet and it started to fill up the driver node memory of my cluster. For what reason does this happen? When I read excel the data is not distributed in the worker nodes? Why when I use maxRowsInMemory = 20 the problem is solved? Does it help to read the data on demand? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The |
Beta Was this translation helpful? Give feedback.
-
Spark-excel uses the excel-streaming-reader library to read data without having all data in memory. |
Beta Was this translation helpful? Give feedback.
The
maxRowsInMemory
uses a streaming reader.The v1 version (the one you're using if you do a
.format("com.crealytics.spark.excel")
) actually reads all rows into memory on the driver and only then calls parallelize to distribute the data to workers.The v2 version (
.format("excel")
) reads directly on the workers.