Add details about corrupted shuffle blocks.

holdenk · Sep 25, 2024 · 17ca694 · 17ca694
1 parent 150ad68
commit 17ca694
Show file tree

Hide file tree

Showing 4 changed files with 19 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,4 @@
 .DS_Store
 site/
 .python-version
+.*~
diff --git a/docs/details/error-shuffle.md b/docs/details/error-shuffle.md
@@ -16,6 +16,7 @@ FetchFailed Exceptions can be bucketed into below 4 categories:
 2. Ran out of overhead memory on an Executor
 3. Shuffle block greater than 2 GB
 4. Network TimeOut.
+5. Corrupted Shuffle Block
 
 ### Ran out of heap memory(OOM) on an Executor
 
@@ -57,3 +58,8 @@ Error that you normally see in the executor/task logs:
 * org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
 * org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx
 * Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx
+
+### Corrupted Shuffle Block
+
+You'll see (on the driver logs) a lot of `Block shuffle_[somenumbershere] is corrupted but the cause is unknown`
+See [corrupted shuffle fetch](../shuffle_fetch_corrupted)
diff --git a/docs/details/shuffle_fetch_corrupted.md b/docs/details/shuffle_fetch_corrupted.md
@@ -0,0 +1,8 @@
+# Corrupted Shuffle Blocks
+
+Corrupted Shuffle Blocks are generally rare and most commonly indicate a hardware level problem. If you see a lot of `Block shuffle_[somenumbershere] is corrupted but the cause is unknown` the next step is to see if they're all coming from the same host. You can do this by grepping the driver stderr log for `FetchFailed(BlockManagerId(` and if the hostname of the block manger is the same (note: there may be different block manager ids) then that host is likely the problematic host.
+
+
+For example in `24/09/25 00:21:37 WARN task-result-getter-0 TaskSetManager: Lost task 613.0 in stage 1.0 (TID 12901) (fetching-host-name.internal executor 42): FetchFailed(BlockManagerId(420, serving-host-name.internal, 7337, None), shuffleId=0, mapIndex=11187, mapId=11187, reduceId=613, message=` the host we are concerned about is the `serving-host-name.internal`. At this point you'll need to work with your cluster administrator to remove that host. If you're running on EC2 (or another cloud) you may also need to create a ticket with upstream so that the node (once removed) is not added back from auto-scaling.
+
+
diff --git a/docs/flowchart/error.md b/docs/flowchart/error.md
@@ -13,7 +13,9 @@ Error --> OtherError[Others]
 
 OtherError --> SchemaColumnConvertNotSupportedException
 OtherError --> TooLargeJar[JAR too large]
-OtherError --> LostShuffleFiles[Lost Shuffle Files]
+
+ShuffleError --> LostShuffleFiles[Lost Shuffle Files]
+ShuffleError --> ShuffleFetchCorrupted[Corrupted Shuffle Files]
 
 LostShuffleFiles --> ShuffleExchangeLosesExecReg[ShuffleExchange Loses Executor Registration]
 LostShuffleFiles --> NodeFailures[Node Failures]
@@ -57,6 +59,7 @@ click ExecutorMemoryError "../../details/error-executor-out-of-memory"
 click ExecutorDiskError "../../details/error-executor-out-of-disk"
 
 click ShuffleError "../../details/error-shuffle"
+click ShuffleFetchCorrupted "../../details/shuffle_fetch_corrupted"
 click SqlAnalysisError "../../details/error-sql-analysis"
 click OtherError "../../details/error-other"
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,3 +2,4 @@ @@
     .DS_Store
     site/
     .python-version
+    .*~