Replies: 1 comment 2 replies
-
Hi @echokhan, thanks for the praise 😄 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was trying to read excel files residing on AWS S3. As I already had pyspark pipelines setup, I attempted to use com.crealytics.spark.excel for excel. It worked fine for files <10MB however, with large files (50 to 150 MB excel files) I started getting job failure as follows:
"java.lang.OutOfMemoryError: Java heap space"
I referred to AWS Glue's docs and found the following troubleshooting guide: AWS Glue OOM Heap Space
This, however, only dealt with large number of small file problems, or other driver intensive operations, and the only suggestion it had for my situation is to scale up.
For 50 MB files, I scaled up to 20-30 workers and the job was successful, however, the 150 MB file still could not be read.
I approached the problem with a different toolset i.e. boto3 & pandas or awswrangler. That did the job with just 4 workers in under 10 mins. I bet not even 3 are required.
Of course, then I came across .option("maxRowsInMemory", x)
Initially, with a higher value, it failed, however after starting from a value of x = 1
The job worked with no heap space error and with minimum workers for the 150 MB file.
Beta Was this translation helpful? Give feedback.
All reactions