maxRowsInMemory is beautiful #751

echokhan · 2023-06-20T19:03:02Z

echokhan
Jun 20, 2023

I was trying to read excel files residing on AWS S3. As I already had pyspark pipelines setup, I attempted to use com.crealytics.spark.excel for excel. It worked fine for files <10MB however, with large files (50 to 150 MB excel files) I started getting job failure as follows:

"java.lang.OutOfMemoryError: Java heap space"

I referred to AWS Glue's docs and found the following troubleshooting guide: AWS Glue OOM Heap Space

This, however, only dealt with large number of small file problems, or other driver intensive operations, and the only suggestion it had for my situation is to scale up.

For 50 MB files, I scaled up to 20-30 workers and the job was successful, however, the 150 MB file still could not be read.

I approached the problem with a different toolset i.e. boto3 & pandas or awswrangler. That did the job with just 4 workers in under 10 mins. I bet not even 3 are required.

Of course, then I came across .option("maxRowsInMemory", x)
Initially, with a higher value, it failed, however after starting from a value of x = 1
The job worked with no heap space error and with minimum workers for the 150 MB file.

nightscape · 2023-06-21T08:01:12Z

nightscape
Jun 21, 2023
Maintainer

Hi @echokhan, thanks for the praise 😄
A value of 1 seems very low though!
Does your Excel file have very long rows?

2 replies

echokhan Jun 21, 2023
Author

Not too long no. Just 50+.
However, I started with 1 and wanted to check how it affects job performance, in combination with schema interference, as I saw you advising on some of the issues. Did not see too much difference.

A question though, does spark-excel behave similarly to files as spark does with CSVs?

This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L231

nightscape Jun 21, 2023
Maintainer

Yes, same behaviour. It will read the first n lines (see parameter excerptSize) and use those to determine the schema.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxRowsInMemory is beautiful #751

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

maxRowsInMemory is beautiful #751

echokhan Jun 20, 2023

Replies: 1 comment · 2 replies

nightscape Jun 21, 2023 Maintainer

echokhan Jun 21, 2023 Author

nightscape Jun 21, 2023 Maintainer

echokhan
Jun 20, 2023

Replies: 1 comment 2 replies

nightscape
Jun 21, 2023
Maintainer

echokhan Jun 21, 2023
Author

nightscape Jun 21, 2023
Maintainer