Skip to content

Commit

Permalink
Add details about corrupted shuffle blocks.
Browse files Browse the repository at this point in the history
  • Loading branch information
holdenk committed Sep 25, 2024
1 parent 150ad68 commit 17ca694
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 1 deletion.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
.DS_Store
site/
.python-version
.*~
6 changes: 6 additions & 0 deletions docs/details/error-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ FetchFailed Exceptions can be bucketed into below 4 categories:
2. Ran out of overhead memory on an Executor
3. Shuffle block greater than 2 GB
4. Network TimeOut.
5. Corrupted Shuffle Block

### Ran out of heap memory(OOM) on an Executor

Expand Down Expand Up @@ -57,3 +58,8 @@ Error that you normally see in the executor/task logs:
* org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
* org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx
* Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx

### Corrupted Shuffle Block

You'll see (on the driver logs) a lot of `Block shuffle_[somenumbershere] is corrupted but the cause is unknown`
See [corrupted shuffle fetch](../shuffle_fetch_corrupted)
8 changes: 8 additions & 0 deletions docs/details/shuffle_fetch_corrupted.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Corrupted Shuffle Blocks

Corrupted Shuffle Blocks are generally rare and most commonly indicate a hardware level problem. If you see a lot of `Block shuffle_[somenumbershere] is corrupted but the cause is unknown` the next step is to see if they're all coming from the same host. You can do this by grepping the driver stderr log for `FetchFailed(BlockManagerId(` and if the hostname of the block manger is the same (note: there may be different block manager ids) then that host is likely the problematic host.


For example in `24/09/25 00:21:37 WARN task-result-getter-0 TaskSetManager: Lost task 613.0 in stage 1.0 (TID 12901) (fetching-host-name.internal executor 42): FetchFailed(BlockManagerId(420, serving-host-name.internal, 7337, None), shuffleId=0, mapIndex=11187, mapId=11187, reduceId=613, message=` the host we are concerned about is the `serving-host-name.internal`. At this point you'll need to work with your cluster administrator to remove that host. If you're running on EC2 (or another cloud) you may also need to create a ticket with upstream so that the node (once removed) is not added back from auto-scaling.


5 changes: 4 additions & 1 deletion docs/flowchart/error.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ Error --> OtherError[Others]
OtherError --> SchemaColumnConvertNotSupportedException
OtherError --> TooLargeJar[JAR too large]
OtherError --> LostShuffleFiles[Lost Shuffle Files]
ShuffleError --> LostShuffleFiles[Lost Shuffle Files]
ShuffleError --> ShuffleFetchCorrupted[Corrupted Shuffle Files]
LostShuffleFiles --> ShuffleExchangeLosesExecReg[ShuffleExchange Loses Executor Registration]
LostShuffleFiles --> NodeFailures[Node Failures]
Expand Down Expand Up @@ -57,6 +59,7 @@ click ExecutorMemoryError "../../details/error-executor-out-of-memory"
click ExecutorDiskError "../../details/error-executor-out-of-disk"
click ShuffleError "../../details/error-shuffle"
click ShuffleFetchCorrupted "../../details/shuffle_fetch_corrupted"
click SqlAnalysisError "../../details/error-sql-analysis"
click OtherError "../../details/error-other"
Expand Down

0 comments on commit 17ca694

Please sign in to comment.